causallib.simulation.CausalSimulator3 module

  1. Copyright 2019 IBM Corp.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Created on Jun 21, 2017

class causallib.simulation.CausalSimulator3.CausalSimulator3(topology, var_types, prob_categories, link_types, snr, treatment_importances, treatment_methods='gaussian', outcome_types='categorical', effect_sizes=None, survival_distribution='expon', survival_baseline=1, params=None)[source]

Bases: object

Constructor

Parameters
  • topology (np.ndarray) – A boolean adjacency matrix for variables (including covariates, treatment and outcome variables of the model). Every row is a binary vector for a variable, where v[i, j] = 1 iff j is a parent of i

  • var_types (Sequence[str]) – Vector the size of variables stating every variable to be “covariate”, “hidden”, “outcome”, “treatment”, “censor”. Notes: if type(pd.Series) variable names will be var_types.index, otherwise, if no-key-vector - var names will be just range(num-of-variables).

  • prob_categories (Sequence[float|None]) – vector the size of the number of variables. if prob_categories[i] = None -> than variable i is considered continuous. otherwise -> prob_categories[i] should be a list (or any iterable) which size specifies number of categories variable i has, and it contains multinomial probabilities for those categories (i.e. list non negative and sums to 1).

  • link_types (str|Sequence[str]) – set of string the size or string or specifying the relation between covariate parents to the covariate itself

  • snr (float|Sequence[float]) – Signal to noise ratio (use 1.0 to eliminate noise in the system). May be a vector the size of number of variables for stating different snr values for different variables.

  • treatment_importances (float|Sequence[float]) – The effect of treatment on the outcome. A float between 0 and 1.0 stating how much weight the treatment variable have vs. the other parents of an outcome variable. To support multi-treatment - place a list the size of the number of treatment variables (as stated in var_types). The matching between treatment variable and its importance will be according to the order of the treatment variables and the order of the list. If all treatments variables has the same importance - pass the float value.

  • treatment_methods (str|Sequence[str]) – method for creating treatment assignment and propensities, can be one of {“random”, “gaussian”, “logistic”}. To support multi-treatment - place a list the size of the number of treatment variables. The matching between treatment variable and its creation method will be according to the order of the treatment variables and the order of the list. If all treatment variables has the same type - pass the str value.

  • outcome_types (str|Sequence[str]) – outcome can either be ‘survival’ or ‘binary’. To support multi-outcome - place a list the size of the number of outcome variables (as stated in var_types). The matching between outcome variable and its type will be according to the order of the outcome variables and the order of the list. If all outcome variables has the same type - pass the str value.

  • effect_sizes (float|Sequence[float|None]|None) – The wanted mean effect size between two counterfactuals. If None - The mean effect size will not be adjusted, but will be whatever generated. If float - The mean effect size will be adjusted to be approximately the given number (considering the noise) To support multi-outcome - a list the size the number of the outcome variables (as stated in var_types). The matching between outcome variable and its effect size will be according to the order of the outcome variables and the order of the list.

  • survival_distribution (Sequence[str] or str) – The distribution family from which to generate the outcome values of outcome variables that their corresponding outcome_types is “survival”. Default value is exponent distribution. The same survival distribution will be used for the corresponding censoring variable as well. To support multi-outcome - place a list the size of the number of outcome variables of type “survival” (as stated in outcome_types). The matching between survival outcome variable and its survival distribution will be according to the order of the outcome variables and the order of the list. If all outcome variables has the same survival distribution - pass the str value (if present). Ignore if no outcome variable is of type survival

  • survival_baseline (Sequence[float] or float) – The survival baseline from the CoxPH model that will be the basics for the parameters of the corresponding survival_distribution. The same survival baseline will be used for the corresponding censoring variable as well (if present). Default value is 1 (no multiplicative meaning for baseline value). To support multi-outcome - place a list the size of the number of outcome variables of type “survival” (as stated in outcome_types). The matching between survival outcome variable and its survival distribution will be according to the order of the outcome variables and the order of the list. If all outcome variables has the same survival distribution - pass the str value. Ignore if no outcome variable is of type survival

  • params (dict | None) – Various parameters related to the generation process (e.g. the slope for sigmoid-based functions etc.). The form of: {var_name: {param_name: param_value, …}, …}

G_LINKING_METHODS = {'affine': <function CausalSimulator3.<lambda>>, 'exp': <function CausalSimulator3.<lambda>>, 'linear': <function CausalSimulator3.<lambda>>, 'log': <function CausalSimulator3.<lambda>>, 'poly': <function CausalSimulator3.<lambda>>}
O_LINKING_METHODS = {'marginal_structural_model': <function CausalSimulator3.<lambda>>, None: <function CausalSimulator3.<lambda>>}
TREATMENT_METHODS = {'gaussian': <function CausalSimulator3.<lambda>>, 'logistic': <function CausalSimulator3.<lambda>>, 'odds_ratio': <function CausalSimulator3.<lambda>>, 'quantile_gauss_fit': <function CausalSimulator3.<lambda>>, 'random': <function CausalSimulator3.<lambda>>}
format_for_training(X, propensities, cf, headers_chars=None, exclude_hidden_vars=True)[source]

prepare to output. merge the data into two DataFrames - an observed one and one gathering the counterfactuals.

Parameters
  • X (pd.DataFrame) – Containing the data (covariates) , treatment and outcomes

  • propensities (pd.DataFrame) – Containing the propensity values for the treatmetns

  • cf (pd.DataFrame) – Containing the counterfactuals results for all possible treatments.

  • headers_chars (dict) – Optional. Containing the column header prefix for different types of variables. Examples: {“covariate”: “x”, “treatment”: “t”, “outcome”: “y”}

  • exclude_hidden_vars – If to exclude hidden variables from the resulting dataset.

Returns

2-element tuple containing:

  • df_X (pd.DataFrame): The observed dataset (if hidden variables are excluded).

  • df_cf (pd.DataFrame): Containing the two counterfactuals, treatments and propensities.

Return type

(pd.DataFrame, pd.DataFrame)

generate_censor_col(X_parents, link_type, snr, prob_category, outcome_type, treatment_importance=None, survival_distribution=None, survival_baseline=None, var_name=None)[source]

Generates a single censor variable column.

Parameters
  • X_parents (pd.DataFrame) – Sub-dataset containing only the relevant columns (features which are topological parents to the current covariate being created)

  • link_type (str) – How the parents variables (parents covariate columns) influence the current generated column. What relation is there between them.

  • snr (float) – Signal to noise ratio that controls the amount of noise to add (value of 1.0 will not generate noise)

  • prob_category (Sequence | None) –

    A k-length distribution vector over k-1 treatments with the probability of being untreated in prob_category[0] (prob_category.iloc[0]) and all other k-1 probabilities corresponds to k-1 treatments.

    Notes: vector must sum to 1. If None - the covariate column is left untouched (i.e. continuous)

  • outcome_type (str) – The type of the outcome variable that is dependent on the current censor variable. The censoring mechanism varies given different types of outcome variables.

  • treatment_importance (float) – The effect power of the treatment on the current generated outcome variable, as opposed to other variables that may influence on it.

  • survival_distribution (str) – The type of the distribution of which to sample the survival time from. relevant only if outcome_type is “survival”

  • survival_baseline – The baseline value of the the cox ph model. relevant only if outcome_type is “survival”

  • var_name (int|str) – The name of the variable currently being generated. Optional.

Returns

2-element tuple containing:

  • x_censor (pd.Series): a column describing the censor variable

  • beta (pd.Series): The coefficients used to generate current variable from it predecessors.

Return type

(pd.Series, pd.Series)

generate_covariate_col(X_parents, link_type, snr, prob_category, num_samples, var_name=None)[source]

Generates a single signal (covariate) column

Parameters
  • X_parents (pd.DataFrame) – Sub-dataset containing only the relevant columns (features which are topological parents to the current covariate being created)

  • link_type (str) – How the parents variables (parents covariate columns) influence the current generated column. What relation is there between them.

  • snr (float) – Signal to noise ratio that controls the amount of noise to add (value of 1.0 will not generate noise)

  • prob_category (pd.Series|None) –

    A vector which length states the number of classes (number of discrete values) and every value is fractional - the probability of the corresponding class.

    Notes: vector must sum to 1 If None - the covariate column is left untouched (i.e. continuous)

  • num_samples (int) – number of samples to generate

  • var_name (int|str) – The name of the variable currently being generated. Optional.

Returns

2-element tuple containing:

  • X_final (pd.Series): The final (i.e. noised and discretize [if needed]) covariate column.

  • beta (pd.Series): The coefficients used to generate current variable from it predecessors.

Return type

(pd.Series, pd.Series)

Raises

ValueError – if the given link_type is not a valid link_type. (Supported link types are placed in self.G_LINKING_METHODS)

generate_data(X_given=None, num_samples=None, random_seed=None)[source]

Generates tables of dataset given the object’s initial parameters.

Parameters
  • num_samples (int) – Number of samples that will be in the dataset.

  • X_given (pd.DataFrame) –

    A baseline dataset to generate from. This dataset may contain only some of variables stated in the initialized topology. The rest of the dataset (variables which are stated in the topology and not in this dataset) will be generated. Notes: The data given will not be overwritten and will be taken as is. It is

    user responsibility to see that the given table has no dependant variables since they will not be re-generated according to the graph.

  • random_seed (int) – A seed for the pseudo-random-number-generator in order to reproduce results.

Returns

3-element tuple containing:

  • X (pd.DataFrame): A (num_samples x num_covariates) matrix of all covariates

    (including treatments and outcomes) over samples.

  • propensities (pd.DataFrame): A (num_samples x num_treatments) matrix (or vector) of propensity

    values of every treatment.

  • counterfactuals (pd.DataFrame): A (num_samples x num_outcomes) matrix -

Return type

(pd.DataFrame, pd.DataFrame, pd.DataFrame)

generate_outcome_col(X_parents, link_type, snr, prob_category, outcome_type, treatment_importance=None, effect_size=None, survival_distribution=None, survival_baseline=None, var_name=None)[source]

Generates a single outcome variable column.

Parameters
  • X_parents (pd.DataFrame) – Sub-dataset containing only the relevant columns (features which are topological parents to the current covariate being created)

  • link_type (str) – How the parents variables (parents covariate columns) influence the current generated column. What relation is there between them.

  • treatment_importance (float) – The effect power of the treatment on the current generated outcome variable, as opposed to other variables that may influence on it.

  • snr (float) – Signal to noise ratio that controls the amount of noise to add (value of 1.0 will not generate noise)

  • prob_category (pd.Series|None) –

    A k-length distribution vector over k-1 treatments with the probability of being untreated in prob_category[0] (prob_category.iloc[0]) and all other k-1 probabilities corresponds to k-1 treatments.

    Notes: vector must sum to 1. If None - the covariate column is left

    untouched (i.e. continuous)

  • effect_size (float) – wanted mean effect size.

  • outcome_type (str) – Type of outcome variable. Either categorical (and continuous) or survival

  • survival_distribution (str) – The type of the distribution of which to sample the survival time from. relevant only if outcome_type is “survival”

  • survival_baseline – The baseline value of the the cox ph model. relevant only if outcome_type is “survival”

  • var_name (int|str) – The name of the variable currently being generated. Optional.

Returns

3-element tuple containing:

  • x_outcome (pd.Series): Outcome assignment for each sample.

  • cf (pd.DataFrame): Holding the counterfactuals for every possible treatment category of the

    outcome’s treatment predecessor variable.

  • beta (pd.DataFrame): The coefficients used to generate current variable from it predecessors.

Return type

(pd.Series, pd.DataFrame, pd.DataFrame)

Raises
  • ValueError – if the given link_type is not a valid link_type. (Supported link types are placed in self.G_LINKING_METHODS)

  • ValueError – if prob_category is neither None nor a legitimate distribution vector.

generate_treatment_col(X_parents, link_type, snr, prob_category, method='logistic', var_name=None)[source]

Generates a single treatment variable column.

Parameters
  • X_parents (pd.DataFrame) – Sub-dataset containing only the relevant columns (features which are topological parents to the current covariate being created)

  • link_type (str) – How the parents variables (parents covariate columns) influence the current generated column. What relation is there between them.

  • snr (float) – Signal to noise ratio that controls the amount of noise to add (value of 1.0 will not generate noise)

  • prob_category (pd.Series|None) –

    A k-length distribution vector over k-1 treatments with the probability of being untreated in prob_category[0] (prob_category.iloc[0]) and all other k-1 probabilities corresponds to k-1 treatments.

    Notes: vector must sum to 1. If None - the covariate column is left

    untouched (i.e. continuous)

  • method (str) – A type of method to generate the treatment signal and the corresponding propensities.

  • var_name (int|str) – The name of the variable currently being generated. Optional.

Returns

3-element tuple containing:

  • treatment (pd.Series): Treatment assignment to each sample.

  • propensity (pd.DataFrame): The marginal conditional probability of treatment given covariates.

    A DataFrame shaped (num_samples x num_of_possible_treatment_categories).

  • beta (pd.Series): The coefficients used to generate current variable from it predecessors.

Return type

(pd.Series, pd.DataFrame, pd.Series)

Raises
  • ValueError – if prob_category is None (treatment must be categorical)

  • ValueError – If prob_category is not a legitimate probability vector (non negative, sums to 1)

reset_coefficients(variables=None)[source]

Delete the linking coefficients that accumulated in the generating model so far.

Parameters

variables (list|None) – list of variables to reset the coefficients linking into them (Not from them). if None - all the available coefficients will be deleted.

static to_csv(data, out_file=None)[source]
causallib.simulation.CausalSimulator3.generate_random_topology(n_covariates, p, n_treatments=1, n_outcomes=1, n_censoring=0, given_vars=(), p_hidden=0.0)[source]

Creates a random graph topology, suitable for describing a causal graph model. Generation is based on a G(n,p) random graph model (each edge independently generated or not by a coin toss).

Parameters
  • n_covariates (int) – Number of simple covariates to generate

  • p (float) – Probability to generate an edge.

  • n_treatments (int) – Number of treatment variables.

  • n_outcomes (int) – Number of outcome variables.

  • n_censoring (int) – Number of censoring variables.

  • given_vars (Sequence[Any]) – Vector of names of given variables. These variables are considered independent. These suppose to mimic a situation where a partial dataset can be supplied to the generation process. Those names will correspond to the variable names in this existing baseline dataset.

  • p_hidden (float) – The probability to convert a simple covariate variable into a latent (i.e. hidden) variable.

Returns

2-element tuple containing:

  • topology (pd.DataFrame): A boolean matrix describing graph dependencies.

    Where T[i,j] = True iff j is a predecessor of i.

  • var_types (pd.Series): A Series which index holds variable names and values are variable types.

    (e.g. “treatment”, “covariate”, “hidden’, “outcome”…) The given_vars will be the first variable, followed by the generated vars (covariates, then treatment, then outcome, then censors)

Return type

(pd.DataFrame, pd.Series)

causallib.simulation.CausalSimulator3.idx2var_vector(num_vars, args)[source]