Features¶

jamie.features.list_features()¶: List available featuresets

jamie.features.select_features(f)¶: Select featureset from name

jamie.features.valid_doc(features, doc)¶

Check whether document is valid according to featureset required columns. Each featureset usually requires some attributes to be present in the data for a valid feature transformation.

Returns: bool – Whether document is valid

class jamie.features.FeatureBase(data, require_columns, clean_columns=None)¶

Base feature class

Parameters

data (pd.DataFrame) – Data file to use
require_columns (list of str) – List of required columns in DataFrame.
clean_columns (list of str, optional) – List of columns to apply text cleaning to

Raises

ValueError – If any of the required columns are missing

fit_transform(X, y=None)¶

Fit the features and transform with the final estimator. This calls the fit_transform() function of the underlying FeatureUnion object. The transformation pipeline converts from a CSV file to a numpy.ndarray. This method is usually called to create the training feature matrix from the CSV file.

Parameters

X (pd.DataFrame) – Data to fit, usually from a CSV file
y (array-like, optional) – This is ignored but kept for compatibility with other scikit-learn transformers

Returns

numpy.ndarray – Feature matrix after applying pipeline

make_arrays(prediction_field)¶

Build feature matrix. When the features class is initialized, it does not build the matrix before make_arrays() is called. This does nothing in the base class, but is overloaded in the derived Feature classes. Typically it is the only function called from outside.

Parameters: prediction_field (str) – Which column of the data to use as labels for prediction

set_features(features)¶: Set features using a FeatureUnion

train_test_split(random_state, test_size=0.2)¶

Return different train test splits for ensemble by varying random_state.

Parameters

random_state (int or RandomState) – Random state to use
test_size (float, default=0.2) – Proportion of data to use for test

Returns

numpy.ndarray tuple – Returns X_train, X_test, y_train, y_test

transform(X)¶

Transform X separately by each transformer in the FeatureUnion, concatenate results. This method is usually called to transform the test data X_test in a similar manner to X_train. Particularly for text transformation this preserves the vocabulary fitted from the training data.

Parameters: X (pd.DataFrame) – Data to fit, usually from a CSV file
Returns: numpy.ndarray – Feature matrix after applying transformation

Installed featuresets¶

class jamie.features.RSEFeatures(data)¶

Default featureset for finding Research Software Engineering (RSE) jobs. To see the methods, see FeatureBase. The featureset encodes the following features:

description: Description text of job, transformed as below
job_title: Job title, transformed as below

The text is transformed using TF-IDF to produce unigrams and bigrams after removing stopwords, and using sublinear TF scaling.

Features¶

Installed featuresets¶

Navigation

Related Topics