causallib.estimation.matching module
- class causallib.estimation.matching.KNN(learner, index)
Bases:
tuple
Create new instance of KNN(learner, index)
- index
Alias for field number 1
- learner
Alias for field number 0
- class causallib.estimation.matching.Matching(propensity_transform=None, caliper=None, with_replacement=True, n_neighbors=1, matching_mode='both', metric='mahalanobis', knn_backend='sklearn', estimate_observed_outcome=False)[source]
Bases:
causallib.estimation.base_estimator.IndividualOutcomeEstimator
,causallib.estimation.base_weight.WeightEstimator
Match treatment and control samples with similar covariates.
- Parameters
propensity_transform (causallib.transformers.PropensityTransformer) – an object for data preprocessing which adds the propensity score as a feature (default: None)
caliper (float) – maximal distance for a match to be accepted. If not defined, all matches will be accepted. If defined, some samples may not be matched and their outcomes will not be estimated. (default: None)
with_replacement (bool) – whether samples can be used multiple times for matching. If set to False, the matching process will optimize the linear sum of distances between pairs of treatment and control samples and only min(N_treatment, N_control) samples will be estimated. Matching with no replacement does not make use of the fit data and is therefore not implemented for out-of-sample data (default: True)
n_neighbors (int) – number of nearest neighbors to include in match. Must be 1 if with_replacement is False. If larger than 1, the estimate is calculated using the regress_agg_function or classify_agg_function across the n_neighbors. Note that when the caliper variable is set, some samples will have fewer than n_neighbors matches. (default: 1).
matching_mode (str) – Direction of matching: treatment_to_control, control_to_treatment or both to indicate which set should be matched to which. All sets are cross-matched in match and when with_replacement is False all matching modes coincide. With replacement there is a difference.
metric (str) – Distance metric string for calculating distance between samples. Note: if an external built knn_backend object with a different metric is supplied, metric needs to be changed to reflect that, because Matching will set its inverse covariance matrix if “mahalanobis” is set. (default: “mahalanobis”, also supported: “euclidean”)
knn_backend (str or callable) – Backend to use for nearest neighbor search. Options are “sklearn” or a callable which returns an object implementing fit, kneighbors and set_params like the sklearn NearestNeighbors object. (default: “sklearn”).
estimate_observed_outcome (bool) – Whether to allow a match of a sample to a sample other than itself when looking within its own treatment value. If True, the estimated potential outcome for the observed outcome may differ from the true observed outcome. (default: False)
- classify_agg_function
Aggregating function for outcome estimation when classifying. (default: majority_rule) Usage is determined by type of y during fit
- Type
callable
- regress_agg_function
Aggregating function for outcome estimation when regressing or predicting prob_a. (default: np.mean) Usage is determined by type of y during fit
- Type
callable
- treatments_
DataFrame of treatments (created after fit)
- Type
pd.DataFrame
- outcomes_
DataFrame of outcomes (created after fit)
- Type
pd.DataFrame
- match_df_
Dataframe of most recently calculated matches. For details, see match. (created after match)
- Type
pd.DataFrame
- samples_used_
Series with count of samples used during most recent match. Series includes a count for each treatment value. (created after match)
- Type
pd.Series
- compute_weight_matrix(X, a, use_stabilized=None, **kwargs)[source]
Computes individual weight across all possible treatment values. f(Pr[A=a_j | X_i]) for all individual i and treatment j.
- Parameters
X (pd.DataFrame) – Covariate matrix of size (num_subjects, num_features).
a (pd.Series) – Treatment assignment of size (num_subjects,).
use_stabilized (bool) – Whether to re-weigh the learned weights with the prevalence of the treatment. This overrides the use_stabilized parameter provided at initialization. See Also: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4351790/#S6title
**kwargs –
- Returns
- A matrix of size (num_subjects, num_treatments) with weight for every individual and every
treatment.
- Return type
pd.DataFrame
- compute_weights(X, a, treatment_values=None, use_stabilized=None, **kwargs)[source]
Calculate weights based on a given set of matches.
Only applicable for matching_mode “control_to_treatment” or “treatment_to_control”.
- Parameters
X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
treatment_values – IGNORED.
use_stabilized – IGNORED.
**kwargs –
- Returns
a Series of shape (n,) with a weight per sample.
- Return type
pd.Series
- Raises
ValueError if Matching().matching_mode == 'both'. –
- estimate_individual_outcome(X, a, y=None, treatment_values=None, predict_proba=True, dropna=True)[source]
Calculate the potential outcome for each sample and treatment value.
Execute match and calculate, for each treatment value and each sample, the expected outcome.
Note: Out of sample estimation for matching without replacement requires passing a y vector here. If no ‘y’ is passed here, the values received by fit are used, and if the estimation indices are not a subset of the fitted indices, the estimation will fail.
If the attribute estimate_observed_outcome is True, estimates will be calculated for the observed outcomes as well. If not, then the observed outcome will be passed through from the corresponding element of y passed to fit.
- Parameters
X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
y (pd.Series) – Series of shape (n,) containing outcome values for n samples. This is only used when with_replacemnt=False. Otherwise, the outcome values passed to fit are used.
predict_proba (bool) – whether to output classifications or probabilties for a classification task. If set to False and data is non-integer, a warning is issued. (default True)
dropna (bool) – For samples that were unmatched due to caliper restrictions, drop from outcome_df leading to a potentially smaller sized output, or include them as NaN. (default: True)
treatment_values – IGNORED
Note: The args are assumed to share the same index.
- Returns
outcome_df (pd.DataFrame)
- fit(X, a, y, sample_weight=None)[source]
Load the treatments and outcomes and fit search trees.
Applies transform to covariates X, initializes search trees for each treatment value for performing nearest neighbor searches. Note: Running fit a second time overwrites any information from previous fit or `match and re-fits the propensity_transform object.
- Parameters
X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
y (pd.Series) – Series of shape (n,) containing outcomes for the n samples.
sample_weight – IGNORED In signature for compatibility with other estimators.
Note: X, a and y must share the same index.
- Returns
self (Matching) the fitted object
- get_covariates_of_matches(s, t, covariates)[source]
Look up covariates of closest matches for a given matching.
Using self.match_df_ and the supplied covariates, look up the covariates of the last match. The function can only be called after match has been run.
- Args:
s (int) : source treatment value t (int) : target treatment value covariates (pd.DataFrame) : The same covariates which were
passed to fit.
- Returns:
covariate_df (pd.DataFrame) : a DataFrame of size (n_matched_samples, n_covariates * 3 + 2) with the covariate values of the sample, covariates of its match, calculated distance and number of neighbors found within the given caliper (with no caliper this will equal self.n_neighbors )
- match(X, a, use_cached_result=True, successful_matches_only=False)[source]
Matching the samples in X according to the treatment values in a.
Returns a DataFrame including all the results, which is also set as the attribute self.match_df_. The arguments X and a define the “needle” where the “haystack” is the data that was previously passed to fit, for matching with replacement. As such, treatment and control samp les from within X will not be matched with each other, unless the same X and a were passed to fit. For matching without replacement, the X and a passed to match provide the “needle” and the “haystack”. If the attribute caliper is set, the matches are limited to those with a distance less than caliper.
- Parameters
X (pd.DataFrame) – DataFrame of shape (n,m) containing m covariates for n samples.
a (pd.Series) – Series of shape (n,) containing discrete treatment values for the n samples.
use_cached_result (bool) – Whether or not to return the match_df from the most recent matching operation. The cached result will only be used if the sample indices of X and those of match_df are identical, otherwise it will rematch.
successful_matches_only (bool) – Whether or not to filter the matches to those which matched successfully. If set to False, the resulting DataFrame will have shape (n* len(a.unique()), 2 ), otherwise it may have a smaller shape due to unsuccessful matches.
Note: The args are assumed to share the same index.
- Returns
- The resulting matches DataFrame is indexed so that
- ` match_df.loc[treatment_value, sample_id]` has columns matches
and distances containing lists of indices to samples and the respective distances for the matches discovered for sample_id from within the fitted samples with the given treatment_value. The indices in the matches column are from the fitted data, not the X argument in match. If sample_id had no match, match_df.loc[treatment_value, sample_id].matches = []. The DataFrame has shape (n* len(a.unique()), 2 ), if successful_matches_only is set to `False.
- Return type
match_df
- Raises
NotImplementedError – Raised when with_replacement is False and n_neighbors is not 1.
- matches_to_weights(match_df=None)[source]
Calculate weights based on a given set of matches.
For each matching from one treatment value to another, a weight vector is generated. The weights are calculated as the number of times a sample was selected in a matching, with each occurrence weighted according to the number of other samples in that matching. The weights can be used to estimate outcomes or to check covariate balancing. The function can only be called after match has been run.
- Parameters
match_df (pd.DataFrame) – a DataFrame of matches returned from match. If not supplied, will use the match_df_ attribute if available, else raises ValueError. Will not execute match to generate a match_df.
- Returns
- DataFrame of shape (n,M) where M is the
number of permutations of a.unique().
- Return type
weights_df (pd.DataFrame)
- class causallib.estimation.matching.PropensityMatching(learner, **kwargs)[source]
Bases:
causallib.estimation.matching.Matching
Matching on propensity score only.
This is a convenience class to execute the common task of propensity score matching. It shares all of the methods of the Matching class but offers a shortcut for initialization.
- Parameters
learner (sklearn.estimator) – a trainable propensity model that implements fit and predict_proba. Will be passed to the PropensityTransformer object.
**kwargs – see Matching.__init__ for supported kwargs.