Features

jamie.features.list_features()

List available featuresets

jamie.features.select_features(f)

Select featureset from name

jamie.features.valid_doc(features, doc)

Check whether document is valid according to featureset required columns. Each featureset usually requires some attributes to be present in the data for a valid feature transformation.

Returns

bool – Whether document is valid

class jamie.features.FeatureBase(data, require_columns, clean_columns=None)

Base feature class

Parameters
  • data (pd.DataFrame) – Data file to use

  • require_columns (list of str) – List of required columns in DataFrame.

  • clean_columns (list of str, optional) – List of columns to apply text cleaning to

Raises

ValueError – If any of the required columns are missing

fit_transform(X, y=None)

Fit the features and transform with the final estimator. This calls the fit_transform() function of the underlying FeatureUnion object. The transformation pipeline converts from a CSV file to a numpy.ndarray. This method is usually called to create the training feature matrix from the CSV file.

Parameters
  • X (pd.DataFrame) – Data to fit, usually from a CSV file

  • y (array-like, optional) – This is ignored but kept for compatibility with other scikit-learn transformers

Returns

numpy.ndarray – Feature matrix after applying pipeline

make_arrays(prediction_field)

Build feature matrix. When the features class is initialized, it does not build the matrix before make_arrays() is called. This does nothing in the base class, but is overloaded in the derived Feature classes. Typically it is the only function called from outside.

Parameters

prediction_field (str) – Which column of the data to use as labels for prediction

set_features(features)

Set features using a FeatureUnion

train_test_split(random_state, test_size=0.2)

Return different train test splits for ensemble by varying random_state.

Parameters
  • random_state (int or RandomState) – Random state to use

  • test_size (float, default=0.2) – Proportion of data to use for test

Returns

numpy.ndarray tuple – Returns X_train, X_test, y_train, y_test

transform(X)

Transform X separately by each transformer in the FeatureUnion, concatenate results. This method is usually called to transform the test data X_test in a similar manner to X_train. Particularly for text transformation this preserves the vocabulary fitted from the training data.

Parameters

X (pd.DataFrame) – Data to fit, usually from a CSV file

Returns

numpy.ndarray – Feature matrix after applying transformation

Installed featuresets

class jamie.features.RSEFeatures(data)

Default featureset for finding Research Software Engineering (RSE) jobs. To see the methods, see FeatureBase. The featureset encodes the following features:

  • description: Description text of job, transformed as below

  • job_title: Job title, transformed as below

The text is transformed using TF-IDF to produce unigrams and bigrams after removing stopwords, and using sublinear TF scaling.