Module auton_survival.preprocessing

Classes

class Imputer (cat_feat_strat='ignore', num_feat_strat='mean', remaining='drop')

A class to impute missing values in the input features.

Real world datasets are often subject to missing covariates. Imputation replaces the missing values allowing downstream experiments. This class allows multiple strategies to impute both categorical and numerical/continuous covariates.

For categorical features, the class allows:

  • replace: Replace all null values with a user specificed constant.
  • ignore: Keep all missing values as is.
  • mode: Replace null values with most commonly occurring variable.

For numerical/continuous features, the user can choose between the following strategies:

  • mean: Replace all missing values with the mean in the column.
  • median: Replace all missing values with the median in the column.
  • knn: Use a k Nearest Neighbour model to predict the missing value.
  • missforest: Use the MissForest model to predict the null values.

Parameters

cat_feat_strat : str
Strategy for imputing categorical features. One of 'replace', 'ignore', 'mode'. Default is ignore.
num_feat_strat : str
Strategy for imputing numerical/continuous features. One of 'mean', 'median', 'knn', 'missforest'. Default is mean.
remaining : str
Strategy for handling remaining columns. One of 'ignore', 'drop'. Default is drop.

Methods

def fit(self, data, cat_feats=None, num_feats=None, fill_value=-1, n_neighbors=5, **kwargs)
def transform(self, data)
def fit_transform(self, data, cat_feats, num_feats, fill_value=-1, n_neighbors=5, **kwargs)

Imputes dataset using imputation strategies.

Parameters

data : pandas.DataFrame
The dataframe to be imputed.
cat_feats : list
List of categorical features.
num_feats : list
List of numerical/continuous features.
fill_value : int
Value to be filled if cat_feat_strat='replace'.
n_neighbors : int
Number of neighbors to be used if num_feat_strat='knn'.
**kwargs
Passed on.

Returns

pandas.DataFrame
Imputed dataset.
class Scaler (scaling_strategy='standard')

Scaler to rescale numerical features.

For scaling, the user can choose between the following strategies:

  • standard: Perform the standard scaling method.
  • minmax: Perform the minmax scaling method.
  • none: Do not perform scaling.

Parameters

scaling_strategy : str
Strategy to use for scaling numerical/continuous data. One of 'standard', 'minmax', 'none'. Default is standard.

Methods

def fit(self, data, num_feats=None)

Fits scaler to dataset using scaling strategy.

Parameters

data : pandas.DataFrame
Dataframe to be scaled.
feats : list
List of numerical/continuous features to be scaled. NOTE: if left empty, all features are interpreted as numerical.

Returns

Fitted instance of scaler.

def transform(self, data)

Scales data using scaling strategy.

Parameters

data : pandas.DataFrame
Dataframe to be scaled.
feats : list
List of numerical/continuous features to be scaled. NOTE: if left empty, all features are interpreted as numerical.

Returns

Fitted instance of scaler.

def fit_transform(self, data, num_feats=[])

Fits a scaler and rescales a dataset using a standard rescaling strategy.

Parameters

data : pandas.DataFrame
Dataframe to be scaled.
feats : list
List of numerical/continuous features to be scaled. NOTE: if left empty, all features are interpreted as numerical.

Returns

pandas.DataFrame
Scaled dataset.
class Preprocessor (cat_feat_strat='ignore', num_feat_strat='mean', scaling_strategy='standard', one_hot=True, remaining='drop')

A composite transform involving both scaling and preprocessing.

Parameters

cat_feat_strat : str
Strategy for imputing categorical features.
num_feat_strat : str
Strategy for imputing numerical/continuous features.
scaling_strategy : str
Strategy to use for scaling numerical/continuous data.
one_hot : bool
Whether to apply one hot encoding to the data.
remaining : str
Strategy for handling remaining columns.

Methods

def fit(self, data, cat_feats, num_feats, fill_value=-1, n_neighbors=5, **kwargs)

Fit imputer and scaler to dataset.

def transform(self, data)

Impute and scale the dataset.

def fit_transform(self, data, cat_feats, num_feats, fill_value=-1, n_neighbors=5, **kwargs)

Imputes and scales dataset.

Parameters

data : pandas.DataFrame
The dataframe to be imputed.
cat_feats : list
List of categorical features.
num_feats : list
List of numerical/continuous features.
one_hot : bool
Indicating whether to perform one-hot encoding.
fill_value : int
Value to be filled if cat_feat_strat='replace'.
n_neighbors : int
Number of neighbors to be used if num_feat_strat='knn'.
**kwargs
Passed on.

Returns:

pandas.DataFrame: Imputed and scaled dataset.