Features¶
-
jamie.features.
list_features
()¶ List available featuresets
-
jamie.features.
select_features
(f)¶ Select featureset from name
-
jamie.features.
valid_doc
(features, doc)¶ Check whether document is valid according to featureset required columns. Each featureset usually requires some attributes to be present in the data for a valid feature transformation.
- Returns
bool – Whether document is valid
-
class
jamie.features.
FeatureBase
(data, require_columns, clean_columns=None)¶ Base feature class
- Parameters
data (pd.DataFrame) – Data file to use
require_columns (list of str) – List of required columns in DataFrame.
clean_columns (list of str, optional) – List of columns to apply text cleaning to
- Raises
ValueError – If any of the required columns are missing
-
fit_transform
(X, y=None)¶ Fit the features and transform with the final estimator. This calls the fit_transform() function of the underlying FeatureUnion object. The transformation pipeline converts from a CSV file to a numpy.ndarray. This method is usually called to create the training feature matrix from the CSV file.
- Parameters
X (pd.DataFrame) – Data to fit, usually from a CSV file
y (array-like, optional) – This is ignored but kept for compatibility with other scikit-learn transformers
- Returns
numpy.ndarray – Feature matrix after applying pipeline
-
make_arrays
(prediction_field)¶ Build feature matrix. When the features class is initialized, it does not build the matrix before
make_arrays()
is called. This does nothing in the base class, but is overloaded in the derived Feature classes. Typically it is the only function called from outside.- Parameters
prediction_field (str) – Which column of the data to use as labels for prediction
-
set_features
(features)¶ Set features using a FeatureUnion
-
train_test_split
(random_state, test_size=0.2)¶ Return different train test splits for ensemble by varying random_state.
-
transform
(X)¶ Transform X separately by each transformer in the FeatureUnion, concatenate results. This method is usually called to transform the test data X_test in a similar manner to X_train. Particularly for text transformation this preserves the vocabulary fitted from the training data.
- Parameters
X (pd.DataFrame) – Data to fit, usually from a CSV file
- Returns
numpy.ndarray – Feature matrix after applying transformation
Installed featuresets¶
-
class
jamie.features.
RSEFeatures
(data)¶ Default featureset for finding Research Software Engineering (RSE) jobs. To see the methods, see
FeatureBase
. The featureset encodes the following features:description: Description text of job, transformed as below
job_title: Job title, transformed as below
The text is transformed using TF-IDF to produce unigrams and bigrams after removing stopwords, and using sublinear TF scaling.