causallib.preprocessing.transformers module

Copyright 2019 IBM Corp.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

class causallib.preprocessing.transformers.Imputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose='deprecated', copy=True, add_indicator=False)[source]

Bases: sklearn.impute._base.SimpleImputer

transform(X)[source]

Impute all missing values in X.

Parameters: X ({array-like, sparse matrix}, shape (n_samples, n_features)) – The input data to complete.
Returns: X_imputed – X with imputed values.
Return type: {ndarray, sparse matrix} of shape (n_samples, n_features_out)

class causallib.preprocessing.transformers.MatchingTransformer(propensity_transform=None, caliper=None, with_replacement=True, n_neighbors=1, matching_mode='both', metric='mahalanobis', knn_backend='sklearn')[source]

Bases: object

Transform data by removing poorly matched samples.

Parameters

propensity_transform (causallib.transformers.PropensityTransformer) – an object for data preprocessing which adds the propensity score as a feature (default: None)
caliper (float) – maximal distance for a match to be accepted. If not defined, all matches will be accepted. If defined, some samples may not be matched and their outcomes will not be estimated. (default: None)
with_replacement (bool) – whether samples can be used multiple times for matching. If set to False, the matching process will optimize the linear sum of distances between pairs of treatment and control samples and only min(N_treatment, N_control) samples will be estimated. Matching with no replacement does not make use of the fit data and is therefore not implemented for out-of-sample data (default: True)
n_neighbors (int) – number of nearest neighbors to include in match. Must be 1 if with_replacement is False. If larger than 1, the estimate is calculated using the regress_agg_function or classify_agg_function across the n_neighbors. Note that when the caliper variable is set, some samples will have fewer than n_neighbors matches. (default: 1).
matching_mode (str) – Direction of matching: treatment_to_control, control_to_treatment or both to indicate which set should be matched to which. All sets are cross-matched in match and when with_replacement is False all matching modes coincide. With replacement there is a difference.
metric (str) – Distance metric string for calculating distance between samples. Note: if an external built knn_backend object with a different metric is supplied, metric needs to be changed to reflect that, because Matching will set its inverse covariance matrix if “mahalanobis” is set. (default: “mahalanobis”, also supported: “euclidean”)
knn_backend (str or callable) – Backend to use for nearest neighbor search. Options are “sklearn” or a callable which returns an object implementing fit, kneighbors and set_params like the sklearn NearestNeighbors object. (default: “sklearn”).

find_indices_of_matched_samples(X, a)[source]

Find indices of samples which matched successfully.

Given a DataFrame of samples X and treatment assignments a, return a list of indices of samples which matched successfully.

Parameters

X (pd.DataFrame) – Covariates of samples
a (pd.Series) – Treatment assignments

Returns

indices of matched samples to be passed to X.loc

Return type

pd.Series

fit(X, a, y)[source]

Fit data to transform

This function loads the data for matching and must be called before transform. For convenience, consider using fit_transform.

Parameters

X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
y (pd.Series) – Series of shape (n,) containing outcomes for the n samples.

Returns

Fitted object

Return type

self (MatchingTransformer)

fit_transform(X, a, y)[source]

Match data and return matched subset.

This is a convenience method, calling fit and transform at once. For details, see documentation of each function.

Parameters

X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
y (pd.Series) – Series of shape (n,) containing outcomes for the n samples.

Returns

Covariates of samples that were matched am (pd.Series): Treatment values of samples that were matched ym (pd.Series): Outcome values of samples that were matched

Return type

Xm (pd.DataFrame)

set_params(**kwargs)[source]

Set parameters of matching engine. Supported parameters are:

Keyword Arguments

propensity_transform (causallib.transformers.PropensityTransformer) – an object for data preprocessing which adds the propensity score as a feature (default: None)
caliper (float) – maximal distance for a match to be accepted (default: None)
with_replacement (bool) – whether samples can be used multiple times for matching (default: True)
n_neighbors (int) – number of nearest neighbors to include in match. Must be 1 if with_replacement is False (default: 1).
matching_mode (str) – Direction of matching: treatment_to_control, control_to_treatment or both to indicate which set should be matched to which. All sets are cross-matched in match and without replacement there is no difference in outcome, but with replacement there is a difference and it impacts the results of transform.
metric (str) –
Distance metric string for calculating distance between samples (default: “mahalanobis”,

also supported: “euclidean”)
knn_backend (str or callable) – Backend to use for nearest neighbor search. Options are “sklearn” or a callable which returns an object implementing fit, kneighbors and set_params like the sklearn NearestNeighbors object. (default: “sklearn”).

Returns

(MatchingTransformer) object with new parameters set

Return type

self

transform(X, a, y)[source]

Transform data by restricting it to samples which are matched

Following a matching process, not all of the samples will find matches. Transforming the data by only allowing samples in treatment that have close matches in control, or in control that have close matches in treatment can make other causal methods more effective. This function will call match on the underlying Matching object.

The attribute matching_mode changes the behavior of this function. If set to control_to_treatment each control will attempt to find a match among the treated, hence the transformed data will have a maximum size of N_c + min(N_c,N_t). If set to treatment_to_control, each treatment will attempt to find a match among the control and the transformed data will have a maximum size of N_t + min(N_c,N_t). If set to both, both matching operations will be executed and if a sample succeeds in either direction it will be included, hence the maximum size of the transformed data will be len(X).

If with_replacement is False, matching_mode does not change the behavior. There will be up to min(N_c,N_t) samples in the returned DataFrame, regardless.

Parameters

X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
y (pd.Series) – Series of shape (n,) containing outcomes for the n samples.

Raises

NotImplementedError – Raised if a value of attribute matching_mode
other than the supported values is set. –

Returns

Covariates of samples that were matched am (pd.Series): Treatment values of samples that were matched ym (pd.Series): Outcome values of samples that were matched

Return type

Xm (pd.DataFrame)

class causallib.preprocessing.transformers.MinMaxScaler(only_binary_features=True, ignore_nans=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Scales features to 0-1, allowing for NaNs.

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

Parameters

only_binary_features (bool) – Whether to apply only on binary features or across all.
ignore_nans (bool) – Whether to ignore NaNs during calculation.

fit(X, y=None)[source]

Compute the minimum and maximum to be used for later scaling.

Parameters

X (pd.DataFrame) – array-like, shape [n_samples, n_features] The data used to compute the mean and standard deviation used for later scaling along the features axis (axis=0).
y – Passthrough for Pipeline compatibility.

Returns

a fitted MinMaxScaler

Return type

MinMaxScaler

inverse_transform(X)[source]

Scaling chosen features of X to the range of 0 - 1.

Parameters: X (pd.DataFrame) – array-like, shape [n_samples, n_features] Input data that will be transformed.
Returns: array-like, shape [n_samples, n_features]. Transformed data.
Return type: pd.DataFrame

transform(X)[source]

Undo the scaling of X according to feature_range.

Parameters: X (pd.DataFrame) – array-like, shape [n_samples, n_features] Input data that will be transformed.
Returns: array-like, shape [n_samples, n_features]. Transformed data.
Return type: pd.DataFrame

class causallib.preprocessing.transformers.PropensityTransformer(learner, include_covariates=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transform covariates by adding/replacing with the propensity score.

Parameters

learner (sklearn.estimator) – A learner implementing fit and predict_proba to use for predicting the propensity score.
include_covariates (bool) – Whether to return the original covariates alongside the “propensity” column.

fit(X, a)[source]

transform(X, treatment_values=None)[source]

Append propensity or replace covariates with propensity.

Parameters

X (pd.DataFrame) – A DataFrame of samples to transform. This will be input to the learner trained by fit. If the columns are different, the results will not be valid.
treatment_values (Any | None) – A desired value/s to extract propensity to (i.e. probabilities to what treatment value should be calculated). If not specified, then the maximal treatment value is chosen. This is since the usual case is of treatment (A=1) control (A=0) setting.

Returns

DataFrame with a “propensity” column. If “include_covariates” is True, it will include all of the original features plus “propensity”, else it will only have the “propensity” column.

Return type

pd.DataFrame

class causallib.preprocessing.transformers.StandardScaler(with_mean=True, with_std=True, ignore_nans=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Standardize continuous features by removing the mean and scaling to unit variance while allowing nans.

X = (X - X.mean()) / X.std()

Parameters

with_mean (bool) – Whether to center the data before scaling.
with_std (bool) – Whether to scale the data to unit variance.
ignore_nans (bool) – Whether to ignore NaNs during calculation.

fit(X, y=None)[source]

Compute the mean and std to be used for later scaling.

Parameters

X (pd.DataFrame) – The data used to compute the mean and standard deviation used for later scaling along the features axis (axis=0).
y – Passthrough for Pipeline compatibility.

Returns

A fitted standard-scaler

Return type

StandardScaler

inverse_transform(X)[source]

Scale back the data to the original representation

Parameters: X (pd.DataFrame) – array-like, shape [n_samples, n_features] The data used to compute the mean and standard deviation used for later scaling along the features axis (axis=0).
Returns: Un-scaled dataset.
Return type: pd.DataFrame

transform(X, y='deprecated')[source]

Perform standardization by centering and scaling

Parameters

X (pd.DataFrame) – array-like, shape [n_samples, n_features] The data used to compute the mean and standard deviation used for later scaling along the features axis (axis=0).
y – Passthrough for Pipeline compatibility.X:

Returns

Scaled dataset.

Return type

pd.DataFrame