causallib.estimation.matching module

class causallib.estimation.matching.KNN(learner, index)

Bases: tuple

Create new instance of KNN(learner, index)

index: Alias for field number 1

learner: Alias for field number 0

class causallib.estimation.matching.Matching(propensity_transform=None, caliper=None, with_replacement=True, n_neighbors=1, matching_mode='both', metric='mahalanobis', knn_backend='sklearn', estimate_observed_outcome=False)[source]

Bases: causallib.estimation.base_estimator.IndividualOutcomeEstimator, causallib.estimation.base_weight.WeightEstimator

Match treatment and control samples with similar covariates.

Parameters

propensity_transform (causallib.transformers.PropensityTransformer) – an object for data preprocessing which adds the propensity score as a feature (default: None)
caliper (float) – maximal distance for a match to be accepted. If not defined, all matches will be accepted. If defined, some samples may not be matched and their outcomes will not be estimated. (default: None)
with_replacement (bool) – whether samples can be used multiple times for matching. If set to False, the matching process will optimize the linear sum of distances between pairs of treatment and control samples and only min(N_treatment, N_control) samples will be estimated. Matching with no replacement does not make use of the fit data and is therefore not implemented for out-of-sample data (default: True)
n_neighbors (int) – number of nearest neighbors to include in match. Must be 1 if with_replacement is False. If larger than 1, the estimate is calculated using the regress_agg_function or classify_agg_function across the n_neighbors. Note that when the caliper variable is set, some samples will have fewer than n_neighbors matches. (default: 1).
matching_mode (str) – Direction of matching: treatment_to_control, control_to_treatment or both to indicate which set should be matched to which. All sets are cross-matched in match and when with_replacement is False all matching modes coincide. With replacement there is a difference.
metric (str) – Distance metric string for calculating distance between samples. Note: if an external built knn_backend object with a different metric is supplied, metric needs to be changed to reflect that, because Matching will set its inverse covariance matrix if “mahalanobis” is set. (default: “mahalanobis”, also supported: “euclidean”)
knn_backend (str or callable) – Backend to use for nearest neighbor search. Options are “sklearn” or a callable which returns an object implementing fit, kneighbors and set_params like the sklearn NearestNeighbors object. (default: “sklearn”).
estimate_observed_outcome (bool) – Whether to allow a match of a sample to a sample other than itself when looking within its own treatment value. If True, the estimated potential outcome for the observed outcome may differ from the true observed outcome. (default: False)

classify_agg_function

Aggregating function for outcome estimation when classifying. (default: majority_rule) Usage is determined by type of y during fit

Type: callable

regress_agg_function

Aggregating function for outcome estimation when regressing or predicting prob_a. (default: np.mean) Usage is determined by type of y during fit

Type: callable

treatments_

DataFrame of treatments (created after fit)

Type: pd.DataFrame

outcomes_

DataFrame of outcomes (created after fit)

Type: pd.DataFrame

match_df_

Dataframe of most recently calculated matches. For details, see match. (created after match)

Type: pd.DataFrame

samples_used_

Series with count of samples used during most recent match. Series includes a count for each treatment value. (created after match)

Type: pd.Series

compute_weight_matrix(X, a, use_stabilized=None, **kwargs)[source]

Computes individual weight across all possible treatment values. f(Pr[A=a_j | X_i]) for all individual i and treatment j.

Parameters

X (pd.DataFrame) – Covariate matrix of size (num_subjects, num_features).
a (pd.Series) – Treatment assignment of size (num_subjects,).
use_stabilized (bool) – Whether to re-weigh the learned weights with the prevalence of the treatment. This overrides the use_stabilized parameter provided at initialization. See Also: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4351790/#S6title
**kwargs –

Returns

A matrix of size (num_subjects, num_treatments) with weight for every individual and every: treatment.

Return type

pd.DataFrame

compute_weights(X, a, treatment_values=None, use_stabilized=None, **kwargs)[source]

Calculate weights based on a given set of matches.

Only applicable for matching_mode “control_to_treatment” or “treatment_to_control”.

Parameters

X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
treatment_values – IGNORED.
use_stabilized – IGNORED.
**kwargs –

Returns

a Series of shape (n,) with a weight per sample.

Return type

pd.Series

Raises

ValueError if Matching().matching_mode == 'both'. –

estimate_individual_outcome(X, a, y=None, treatment_values=None, predict_proba=True, dropna=True)[source]

Calculate the potential outcome for each sample and treatment value.

Execute match and calculate, for each treatment value and each sample, the expected outcome.

Note: Out of sample estimation for matching without replacement requires passing a y vector here. If no ‘y’ is passed here, the values received by fit are used, and if the estimation indices are not a subset of the fitted indices, the estimation will fail.

If the attribute estimate_observed_outcome is True, estimates will be calculated for the observed outcomes as well. If not, then the observed outcome will be passed through from the corresponding element of y passed to fit.

Parameters

X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
y (pd.Series) – Series of shape (n,) containing outcome values for n samples. This is only used when with_replacemnt=False. Otherwise, the outcome values passed to fit are used.
predict_proba (bool) – whether to output classifications or probabilties for a classification task. If set to False and data is non-integer, a warning is issued. (default True)
dropna (bool) – For samples that were unmatched due to caliper restrictions, drop from outcome_df leading to a potentially smaller sized output, or include them as NaN. (default: True)
treatment_values – IGNORED

Note: The args are assumed to share the same index.

Returns: outcome_df (pd.DataFrame)

fit(X, a, y, sample_weight=None)[source]

Load the treatments and outcomes and fit search trees.

Applies transform to covariates X, initializes search trees for each treatment value for performing nearest neighbor searches. Note: Running fit a second time overwrites any information from previous fit or `match and re-fits the propensity_transform object.

Parameters

X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
y (pd.Series) – Series of shape (n,) containing outcomes for the n samples.
sample_weight – IGNORED In signature for compatibility with other estimators.

Note: X, a and y must share the same index.

Returns: self (Matching) the fitted object

get_covariates_of_matches(s, t, covariates)[source]

Look up covariates of closest matches for a given matching.

Using self.match_df_ and the supplied covariates, look up the covariates of the last match. The function can only be called after match has been run.

Args:
s (int) : source treatment value t (int) : target treatment value covariates (pd.DataFrame) : The same covariates which were

passed to fit.

Returns:
covariate_df (pd.DataFrame) : a DataFrame of size (n_matched_samples, n_covariates * 3 + 2) with the covariate values of the sample, covariates of its match, calculated distance and number of neighbors found within the given caliper (with no caliper this will equal self.n_neighbors )

match(X, a, use_cached_result=True, successful_matches_only=False)[source]

Matching the samples in X according to the treatment values in a.

Returns a DataFrame including all the results, which is also set as the attribute self.match_df_. The arguments X and a define the “needle” where the “haystack” is the data that was previously passed to fit, for matching with replacement. As such, treatment and control samp les from within X will not be matched with each other, unless the same X and a were passed to fit. For matching without replacement, the X and a passed to match provide the “needle” and the “haystack”. If the attribute caliper is set, the matches are limited to those with a distance less than caliper.

Parameters

X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
use_cached_result (bool) – Whether or not to return the match_df from the most recent matching operation. The cached result will only be used if the sample indices of X and those of match_df are identical, otherwise it will rematch.
successful_matches_only (bool) – Whether or not to filter the matches to those which matched successfully. If set to False, the resulting DataFrame will have shape (n* len(a.unique()), 2 ), otherwise it may have a smaller shape due to unsuccessful matches.

Note: The args are assumed to share the same index.

Returns

The resulting matches DataFrame is indexed so that

` match_df.loc[treatment_value, sample_id]` has columns matches: and distances containing lists of indices to samples and the respective distances for the matches discovered for sample_id from within the fitted samples with the given treatment_value. The indices in the matches column are from the fitted data, not the X argument in match. If sample_id had no match, match_df.loc[treatment_value, sample_id].matches = []. The DataFrame has shape (n* len(a.unique()), 2 ), if successful_matches_only is set to `False.

Return type

match_df

Raises

NotImplementedError – Raised when with_replacement is False and n_neighbors is not 1.

matches_to_weights(match_df=None)[source]

Calculate weights based on a given set of matches.

For each matching from one treatment value to another, a weight vector is generated. The weights are calculated as the number of times a sample was selected in a matching, with each occurrence weighted according to the number of other samples in that matching. The weights can be used to estimate outcomes or to check covariate balancing. The function can only be called after match has been run.

Parameters

match_df (pd.DataFrame) – a DataFrame of matches returned from match. If not supplied, will use the match_df_ attribute if available, else raises ValueError. Will not execute match to generate a match_df.

Returns

DataFrame of shape (n,M) where M is the: number of permutations of a.unique().

Return type

weights_df (pd.DataFrame)

class causallib.estimation.matching.PropensityMatching(learner, **kwargs)[source]

Bases: causallib.estimation.matching.Matching

Matching on propensity score only.

This is a convenience class to execute the common task of propensity score matching. It shares all of the methods of the Matching class but offers a shortcut for initialization.

Parameters

learner (sklearn.estimator) – a trainable propensity model that implements fit and predict_proba. Will be passed to the PropensityTransformer object.
**kwargs – see Matching.__init__ for supported kwargs.

causallib.estimation.matching.majority_rule(x)[source]