Module causallib.evaluation

This submodule allows evaluating the performance of the estimation models defined in causallib.estmation.

The intended usage is to use evaluate from causalib.evaluation to generate EvaluationResults objects. If the cross-validation parameter cv is not supplied, a simple evaluation without cross-validation will be performed. And an object will be returned that can generate various plots, accessible by name (see the docs) or all at once via plot_all(). The object also includes the model’s predictions, evaluated metrics, the fitted models as models and a copy of the original data as (X, a, and y).

If the cv parameter is set to "auto", evaluate generates a k-fold cross-validation with train and validation phases, refitting the model k times, with k=5. Other options are also supported for customizing cross-validation, see the docs. The EvaluationResults will also contain a list of train/test split indices used by cross-validation in cv.

Example: Inverse probability weighting

An IPW method with logistic regression can be evaluated in cross-validation using

from sklearn.linear_model import LogisticRegression
from causallib.estimation import IPW
from causallib.datasets.data_loader import fetch_smoking_weight
from causallib.evaluation import evaluate

data = fetch_smoking_weight()

model = LogisticRegression()
ipw = IPW(learner=model), data.a, data.y)
res = evaluate(ipw, data.X, data.a, data.y, cv="auto")


This will train the models and create evaluation plots showing the performance on both the training and validation data.

# {'weight_distribution', 'pr_curve', 'covariate_balance_love', 'roc_curve', 'calibration', 'covariate_balance_slope'}
res.plot_covariate_balance(kind="love", phase="valid")

Submodule structure

This section is intended for future contributors and those seeking to customize the evaluation logic.

The evaluate function is defined in To generate predictions it instantiates a Predictor object as defined in This handles refitting and generating the necessary predictions for the different models. The predictions objects are defined in Metrics are defined in These are simple functions and do not depend on the structure of the objects. The metrics are applied to the individual predictions via the scoring functions defined in The results of the predictors and scorers across multiple phases and folds are combined in the EvaluationResults object which is defined in

evaluation.plots submodule structure

In order to generate the correct plots from the EvaluationResults objects, we need PlotDataExtractor objects. The responsibility of these objects is to extract the correct data for a given plot from EvaluationResults, and they are defined in plots/ Enabling plotting as member functions for EvaluationResults objects is accomplished using the plotter mixins, which are defined in plots/ When an EvaluationResults object is produced by evaluate, the EvaluationResults.make factory ensures that it has the correct extractors and plotting mixins.

Finally, plots/ contains a number of methods for aggregating and combining data to produce curves for ROC, PR and calibration plots. And plots/ contains the individual plotting functions.

How to add a new plot

If there is a model evaluation plot that you would like to add to the codebase, you must first determine for what models it would be relevant. For example, a confusion matrix makes sense for a classification task but not for continuous outcome prediction, or sample weight calculation.

Currently, the types of models are

  • Individual outcome predictions (continuous outcome)

  • Individual outcome predictions (binary outcome)

  • Sample weight predictions

  • Propensity predictions

Propensity predictions combine binary individual outcome predictions (because “is treated” is a binary feature) with sample weight predictions. Something like a confusion matrix would make sense for binary outcome predictions and for propensity predictions, but not for the other categories. In that sense it would behave like the ROC curve, and PR curve which are already implemented.

Assuming you want to add a new plot, you would add the basic plotting function to plots/ Then you would add a case to the relevant extractors’ get_data_for_plot members to extract the data for the plot, based on its name, in plots/ . You would also add the name as an available plot in the relevant frozenset and in the lookup_name function, both in plots/ At this point, the plot should be drawn automatically when you run plot_all on the relevant EvaluationResults object. To expose the plot as a member plot_my_new_plot, you must add it to the correct mixin in plots/



Module contents

Objects and methods to evaluate accuracy of causal models.

causallib.evaluation.evaluate(estimator, X, a, y, cv=None, metrics_to_evaluate='defaults', plots=False)[source]

Evaluate model in cross-validation of the provided data

  • | (estimator (causallib.estimation.base_estimator.IndividualOutcomeEstimator) – causallib.estimation.base_weight.WeightEstimator | causallib.estimation.base_weight.PropensityEstimator) : an estimator. If using cv, it will be refit, otherwise it should already be fit.

  • X (pd.DataFrame) – Covariates.

  • a (pd.Series) – Treatment assignment.

  • y (pd.Series) – Outcome.

  • cv (list[tuples] | generator[tuples] | None) – list the number of folds containing tuples of indices (train_idx, validation_idx) in an iloc manner (row number). If None, there will be no cross-validation. If cv=”auto”, a stratified Kfold with 5 folds will be created and used for cross-validation.

  • metrics_to_evaluate (dict | "defaults" | None) – key: metric’s name, value: callable that receives true labels, prediction, and sample_weights (the latter may be ignored). If “defaults”, default metrics are selected. If None, no metrics are evaluated.

  • plots (bool) – whether to generate plots



causallib.evaluation.evaluate_bootstrap(estimator, X, a, y, n_bootstrap, n_samples=None, replace=True, refit=False, metrics_to_evaluate=None)[source]

Evaluate model on a bootstrap sample of the provided data

  • X (pd.DataFrame) – Covariates.

  • a (pd.Series) – Treatment assignment.

  • y (pd.Series) – Outcome.

  • n_bootstrap (int) – Number of bootstrap sample to create.

  • n_samples (int | None) – Number of samples to sample in each bootstrap sampling. If None - will use the number samples (first dimension) of the data.

  • replace (bool) – Whether to use sampling with replacements. If False - n_samples (if provided) should be smaller than X.shape[0])

  • refit (bool) – Whether to refit the estimator on each bootstrap sample. Can be computational intensive if n_bootstrap is large.

  • metrics_to_evaluate (dict | None) – key: metric’s name, value: callable that receives true labels, prediction and sample_weights (the latter is allowed to be ignored). If not provided, default from causallib.evaluation.metrics are used.

