causallib.contrib.bicause_tree.BICauseTree#
- class BICauseTree(outcome_model=None, individual=False, asmd_violation_threshold=0.1, min_leaf_size=0, min_treat_group_size=0, min_split_size=0, max_depth=10, stopping_criterion=<function default_stopping_criterion>, max_splitting_values=50, multiple_hypothesis_test_method='holm', multiple_hypothesis_test_alpha=0.1, positivity_filtering_method=<function prevalence_symmetric>, positivity_filtering_kwargs=None)[source]#
A causal model effect estimator built on top of a tree recursively stratifying the covariate space to balance between treated and untreated.
- Parameters:
outcome_model (
Union[IndividualOutcomeEstimator,PopulationOutcomeEstimator]) – An outcome model for generating counterfactual predictions at each leaf node of the tree. Defaults to a simple MarginalOutcomeEstimator that just takes the average outcome for each treatment group in each leaf. However, it may also be any arbitrary outcome model to further adjust for the covariates (that the tree might leave some residual bias in the stratification).individual (
bool) – If True (and if outcome_model has estimate_individual_outcomes) will generate individual-level predictions for observations within each leaf. Otherwise, each observation takes the value of the average outcome in that leaf (using the estimate_population_outcomes method).asmd_violation_threshold (
float) – The value of Absolute Standardized Mean Difference below which a subgroup is considered balanced.min_leaf_size (
int) – The minimum number of samples required to split an internal nodemin_treat_group_size (
int) – The minimum number of samples in all treatment groups required to split an internal node.min_split_size (
int) – The minimum number of samples required to split an internal node.max_depth (
int) – The maximum depth of the tree. Will be updated for each level of nodes as the tree grows.stopping_criterion (
callable) – A function that takes the node/subtree as well as the data (X, a) and returns a boolean True if to stop splitting the tree and False if to continue splitting.max_splitting_values (
int) – The maximal number of unique values to consider when splitting a single featuremultiple_hypothesis_test_method – The method for correcting p-values in multiple hypotheses testing. Should be compatible with statsmodels’ multipletests.
multiple_hypothesis_test_alpha (
float) – The alpha value for correcting p-values in multiple hypotheses testing. Should be compatible with statsmodels’ multipletests.positivity_filtering_method (
callable) – A function that takes the current node (or subtree) as well as the arbitrary kwargs from positivity_filtering_kwargs and returns a list of the leaves/nodes’ indices that do not violate positivity.positivity_filtering_kwargs – Keyword arguments to call the positivity_filtering_method with.
- __init__(outcome_model=None, individual=False, asmd_violation_threshold=0.1, min_leaf_size=0, min_treat_group_size=0, min_split_size=0, max_depth=10, stopping_criterion=<function default_stopping_criterion>, max_splitting_values=50, multiple_hypothesis_test_method='holm', multiple_hypothesis_test_alpha=0.1, positivity_filtering_method=<function prevalence_symmetric>, positivity_filtering_kwargs=None)[source]#
A causal model effect estimator built on top of a tree recursively stratifying the covariate space to balance between treated and untreated.
- Parameters:
outcome_model (
Union[IndividualOutcomeEstimator,PopulationOutcomeEstimator]) – An outcome model for generating counterfactual predictions at each leaf node of the tree. Defaults to a simple MarginalOutcomeEstimator that just takes the average outcome for each treatment group in each leaf. However, it may also be any arbitrary outcome model to further adjust for the covariates (that the tree might leave some residual bias in the stratification).individual (
bool) – If True (and if outcome_model has estimate_individual_outcomes) will generate individual-level predictions for observations within each leaf. Otherwise, each observation takes the value of the average outcome in that leaf (using the estimate_population_outcomes method).asmd_violation_threshold (
float) – The value of Absolute Standardized Mean Difference below which a subgroup is considered balanced.min_leaf_size (
int) – The minimum number of samples required to split an internal nodemin_treat_group_size (
int) – The minimum number of samples in all treatment groups required to split an internal node.min_split_size (
int) – The minimum number of samples required to split an internal node.max_depth (
int) – The maximum depth of the tree. Will be updated for each level of nodes as the tree grows.stopping_criterion (
callable) – A function that takes the node/subtree as well as the data (X, a) and returns a boolean True if to stop splitting the tree and False if to continue splitting.max_splitting_values (
int) – The maximal number of unique values to consider when splitting a single featuremultiple_hypothesis_test_method – The method for correcting p-values in multiple hypotheses testing. Should be compatible with statsmodels’ multipletests.
multiple_hypothesis_test_alpha (
float) – The alpha value for correcting p-values in multiple hypotheses testing. Should be compatible with statsmodels’ multipletests.positivity_filtering_method (
callable) – A function that takes the current node (or subtree) as well as the arbitrary kwargs from positivity_filtering_kwargs and returns a list of the leaves/nodes’ indices that do not violate positivity.positivity_filtering_kwargs – Keyword arguments to call the positivity_filtering_method with.
- fit(X, a, y, sample_weight=None)[source]#
Build the BICauseTree partition
- Parameters:
X (
pandas.DataFrame) – The feature matrix of all samplesa (
pandas.Series) – The treatment assignmentsy (
pandas.Series) – The outcome valuessample_weight – IGNORED
- Returns:
(class) BICauseTree
- apply(X)[source]#
Get node assignments based on BICauseTree partition
- Parameters:
X (
pandas.DataFrame) – The feature matrix of all samples- Returns:
(pandas.Series) A vector of node indices indexed according to X
- estimate_population_outcome(X, a, y=None, discard_violating_samples=True, agg_func='mean')[source]#
Estimates outcomes at a population-level and assigns them to individuals in X
- Parameters:
X (
pandas.DataFrame) – The feature matrix of all samplesa (
pandas.Series) – The treatment assignmentsy (
pandas.Series) – The outcome valuesdiscard_violating_samples (
boolean) – whether to drop the NA in the individual outcomesagg_func – aggregation function to go from individual to population outcome estimates
- Returns:
A vector of potential outcomes indexed according to X
- estimate_individual_outcome(X, a, y=None, same_dim_as_input=True)[source]#
estimate individual-level counterfactual predictions.
if self.individual is True and self.outcome_model has estimate_individual_outcome() then each observation will get a unique counterfactual value. otherwise, each individual gets the average prediction of its node, and this function is a way to transform it to a shape of predictions-per-observation.
- Parameters:
X (
pandas.DataFrame) – The feature matrix of all samplesa (
pandas.Series) – The treatment assignmentsy (
pandas.Series) – The outcome valuessame_dim_as_input (
boolean) – whether to return nan values for positivity-violating observations or exclude them
- Returns:
A matrix of individual outcomes indexed according to X
- Return type:
- fit_outcome_models(X, a, y)[source]#
Fits causal models to the nodes of a fitted (already grown) tree.
- Parameters:
X (
pandas.DataFrame) – The feature matrix of all samplesa (
pandas.Series) – The treatment assignmentsy (
pandas.Series) – The outcome values
- Returns:
dict[int, Union[IndividualOutcomeEstimator, PopulationOutcomeEstimator]
- explain(X, a, split_condition=None)[source]#
- Create a list of data frames summarizing the decision tree and the marginal effect.
Each data-frame represents a leaf in the tree, and the list represents the tree itself. Each data frame exhibits several summary statistics about the path from the root to the leaf, including the maximal asmd at that level and the marginal outcome value.
- Parameters:
X (
pandas.DataFrame) – The feature data. Assumed to be of the same columndata (structure as the training)
a (
pandas.Series) – The treatment assignment vectorsplit_condition (
str) – The string representing the first condition. Default to ‘All’added (to which the split explanations are)
- Returns:
The list representing the tree, holding a data-frame for every leaf.
- Return type:
List[pandas.DataFrame]
- set_fit_request(*, a='$UNCHANGED$', sample_weight='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
- Returns:
self – The updated object.
- Return type: