causallib.contrib.bicause_tree.BICauseTree#

class BICauseTree(outcome_model=None, individual=False, asmd_violation_threshold=0.1, min_leaf_size=0, min_treat_group_size=0, min_split_size=0, max_depth=10, stopping_criterion=<function default_stopping_criterion>, max_splitting_values=50, multiple_hypothesis_test_method='holm', multiple_hypothesis_test_alpha=0.1, positivity_filtering_method=<function prevalence_symmetric>, positivity_filtering_kwargs=None)[source]#

A causal model effect estimator built on top of a tree recursively stratifying the covariate space to balance between treated and untreated.

Parameters:

outcome_model (Union[IndividualOutcomeEstimator, PopulationOutcomeEstimator]) – An outcome model for generating counterfactual predictions at each leaf node of the tree. Defaults to a simple MarginalOutcomeEstimator that just takes the average outcome for each treatment group in each leaf. However, it may also be any arbitrary outcome model to further adjust for the covariates (that the tree might leave some residual bias in the stratification).
individual (bool) – If True (and if outcome_model has estimate_individual_outcomes) will generate individual-level predictions for observations within each leaf. Otherwise, each observation takes the value of the average outcome in that leaf (using the estimate_population_outcomes method).
asmd_violation_threshold (float) – The value of Absolute Standardized Mean Difference below which a subgroup is considered balanced.
min_leaf_size (int) – The minimum number of samples required to split an internal node
min_treat_group_size (int) – The minimum number of samples in all treatment groups required to split an internal node.
min_split_size (int) – The minimum number of samples required to split an internal node.
max_depth (int) – The maximum depth of the tree. Will be updated for each level of nodes as the tree grows.
stopping_criterion (callable) – A function that takes the node/subtree as well as the data (X, a) and returns a boolean True if to stop splitting the tree and False if to continue splitting.
max_splitting_values (int) – The maximal number of unique values to consider when splitting a single feature
multiple_hypothesis_test_method – The method for correcting p-values in multiple hypotheses testing. Should be compatible with statsmodels’ multipletests.
multiple_hypothesis_test_alpha (float) – The alpha value for correcting p-values in multiple hypotheses testing. Should be compatible with statsmodels’ multipletests.
positivity_filtering_method (callable) – A function that takes the current node (or subtree) as well as the arbitrary kwargs from positivity_filtering_kwargs and returns a list of the leaves/nodes’ indices that do not violate positivity.
positivity_filtering_kwargs – Keyword arguments to call the positivity_filtering_method with.

__init__(outcome_model=None, individual=False, asmd_violation_threshold=0.1, min_leaf_size=0, min_treat_group_size=0, min_split_size=0, max_depth=10, stopping_criterion=<function default_stopping_criterion>, max_splitting_values=50, multiple_hypothesis_test_method='holm', multiple_hypothesis_test_alpha=0.1, positivity_filtering_method=<function prevalence_symmetric>, positivity_filtering_kwargs=None)[source]#

A causal model effect estimator built on top of a tree recursively stratifying the covariate space to balance between treated and untreated.

Parameters:

outcome_model (Union[IndividualOutcomeEstimator, PopulationOutcomeEstimator]) – An outcome model for generating counterfactual predictions at each leaf node of the tree. Defaults to a simple MarginalOutcomeEstimator that just takes the average outcome for each treatment group in each leaf. However, it may also be any arbitrary outcome model to further adjust for the covariates (that the tree might leave some residual bias in the stratification).
individual (bool) – If True (and if outcome_model has estimate_individual_outcomes) will generate individual-level predictions for observations within each leaf. Otherwise, each observation takes the value of the average outcome in that leaf (using the estimate_population_outcomes method).
asmd_violation_threshold (float) – The value of Absolute Standardized Mean Difference below which a subgroup is considered balanced.
min_leaf_size (int) – The minimum number of samples required to split an internal node
min_treat_group_size (int) – The minimum number of samples in all treatment groups required to split an internal node.
min_split_size (int) – The minimum number of samples required to split an internal node.
max_depth (int) – The maximum depth of the tree. Will be updated for each level of nodes as the tree grows.
stopping_criterion (callable) – A function that takes the node/subtree as well as the data (X, a) and returns a boolean True if to stop splitting the tree and False if to continue splitting.
max_splitting_values (int) – The maximal number of unique values to consider when splitting a single feature
multiple_hypothesis_test_method – The method for correcting p-values in multiple hypotheses testing. Should be compatible with statsmodels’ multipletests.
multiple_hypothesis_test_alpha (float) – The alpha value for correcting p-values in multiple hypotheses testing. Should be compatible with statsmodels’ multipletests.
positivity_filtering_method (callable) – A function that takes the current node (or subtree) as well as the arbitrary kwargs from positivity_filtering_kwargs and returns a list of the leaves/nodes’ indices that do not violate positivity.
positivity_filtering_kwargs – Keyword arguments to call the positivity_filtering_method with.

fit(X, a, y, sample_weight=None)[source]#

Build the BICauseTree partition

Parameters:

X (pandas.DataFrame) – The feature matrix of all samples
a (pandas.Series) – The treatment assignments
y (pandas.Series) – The outcome values
sample_weight – IGNORED

Returns:

(class) BICauseTree

apply(X)[source]#

Get node assignments based on BICauseTree partition

Parameters:: X (pandas.DataFrame) – The feature matrix of all samples
Returns:: (pandas.Series) A vector of node indices indexed according to X

estimate_population_outcome(X, a, y=None, discard_violating_samples=True, agg_func='mean')[source]#

Estimates outcomes at a population-level and assigns them to individuals in X

Parameters:

X (pandas.DataFrame) – The feature matrix of all samples
a (pandas.Series) – The treatment assignments
y (pandas.Series) – The outcome values
discard_violating_samples (boolean) – whether to drop the NA in the individual outcomes
agg_func – aggregation function to go from individual to population outcome estimates

Returns:

A vector of potential outcomes indexed according to X

estimate_individual_outcome(X, a, y=None, same_dim_as_input=True)[source]#

estimate individual-level counterfactual predictions.

if self.individual is True and self.outcome_model has estimate_individual_outcome() then each observation will get a unique counterfactual value. otherwise, each individual gets the average prediction of its node, and this function is a way to transform it to a shape of predictions-per-observation.

Parameters:

X (pandas.DataFrame) – The feature matrix of all samples
a (pandas.Series) – The treatment assignments
y (pandas.Series) – The outcome values
same_dim_as_input (boolean) – whether to return nan values for positivity-violating observations or exclude them

Returns:

A matrix of individual outcomes indexed according to X

Return type:

pandas.DataFrame

fit_outcome_models(X, a, y)[source]#

Fits causal models to the nodes of a fitted (already grown) tree.

Parameters:

X (pandas.DataFrame) – The feature matrix of all samples
a (pandas.Series) – The treatment assignments
y (pandas.Series) – The outcome values

Returns:

dict[int, Union[IndividualOutcomeEstimator, PopulationOutcomeEstimator]

explain(X, a, split_condition=None)[source]#

Create a list of data frames summarizing the decision tree and the marginal effect.: Each data-frame represents a leaf in the tree, and the list represents the tree itself. Each data frame exhibits several summary statistics about the path from the root to the leaf, including the maximal asmd at that level and the marginal outcome value.

Parameters:

X (pandas.DataFrame) – The feature data. Assumed to be of the same column
data (structure as the training)
a (pandas.Series) – The treatment assignment vector
split_condition (str) – The string representing the first condition. Default to ‘All’
added (to which the split explanations are)

Returns:

The list representing the tree, holding a data-frame for every leaf.

Return type:

List[pandas.DataFrame]

set_fit_request(*, a='$UNCHANGED$', sample_weight='$UNCHANGED$')#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

a (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for a parameter in fit.
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self – The updated object.

Return type:

object