causallib.datasets.data_loader module

causallib.datasets.data_loader.load_acic16(instance=1, raw=False)[source]

Loads single dataset from the 2016 Atlantic Causal Inference Conference data challenge.

The dataset is based on real covariates but synthetically simulates the treatment assignment and potential outcomes. It therefore also contains sufficient ground truth to evaluate the effect estimation of causal models. The competition introduced 7700 simulated files (100 instances for each of the 77 data-generating-processes). We provide a smaller sample of one instance from 10 DGPs. For the full dataset, see the link below to the competition site.

If used for academic purposes, please consider citing the competition organizers:

Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, and Dan Cervone. “Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition.” Statistical Science 34, no. 1 (2019): 43-68.

Parameters
  • instance (int) – number between 1-10 (inclusive), dataset to load.

  • raw (bool) – Whether to apply contrast (“dummify”) on non-numeric columns If True, returns a (pd.DataFrame, pd.DataFrame) tuple (one for covariates and the second with treatment assignment, noisy potential outcomes and true potential outcomes).

Returns

dictionary-like object
attributes are: X (covariates), a (treatment assignment), y (outcome),
po (ground truth potential outcomes: po[0] potential outcome for controls and

po[1] potential outcome for treated),

descriptors (feature description).

Return type

Bunch

causallib.datasets.data_loader.load_data_file(file_name, data_dir_name, sep=',')[source]
causallib.datasets.data_loader.load_nhefs(raw=False, restrict=True, augment=True, onehot=True)[source]

Loads the NHEFS smoking-cessation and weight-loss dataset.

Data was gathered during an observational study conducted by the NHANS during the 1970’s and 1980’. It follows a cohort a people whom some decided to quite smoking and some decided to persist, and record the gain in weight for each individual to try estimate the causal contribution of smoking cessation on weight gain.

This dataset is used throughout Hernán and Robins’ Causal Inference Book.

https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

If used for academic purposes, please consider citing the book:

Hernán MA, Robins JM (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.

Parameters
  • raw (bool) – Whether to return the entire DataFrame and descriptors or not. If False, only confounders are used for the data. If True, returns a (pd.DataFrame, pd.Series) tuple (data and description).

  • restrict (bool) – Whether to apply exclusion criteria on missing data or not. Note: if False - data will have censored (NaN) outcomes.

  • augment (bool) – Whether to add augmented (squared) features If False, only original data returned. If True, squares continuous valued columns [‘age’, ‘wt71’, ‘smokeintensity’, ‘smokeyrs’] and joins to data frame with suffix ‘^2’

  • onehot (bool) – Whether to one-hot encode categorical data. If False, categorical data [“active”, “education”, “exercise”], will be returned in individual columns with categorical values. If True, extra columns with the categorical value one-hot encoded.

Returns

dictionary-like object
attributes are: X (covariates), a (treatment assignment) y (outcome),

descriptors (feature description)

Return type

Bunch

causallib.datasets.data_loader.load_nhefs_survival(augment=True, onehot=True)[source]

Loads and pre-processes the NHEFS smoking-cessation dataset.

Data was gathered in an observational study conducted by the NHANS during the 1970’s and 1980’. It follows a cohort a people whom some decided to quite smoking and some decided to persist, and record the death events within 10 years of follow-up.

This dataset is used throughout Hernán and Robins’ Causal Inference Book. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/ If used for academic purposes, please consider citing the book: Hernán MA, Robins JM (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.

Parameters
  • augment (bool) – Whether to add augmented (squared) features If False, only original data returned. If True, squares continuous valued columns [‘age’, ‘wt71’, ‘smokeintensity’, ‘smokeyrs’] and joins to data frame with suffix ‘^2’

  • onehot (bool) – Whether to one-hot encode categorical data. If False, categorical data [“active”, “education”, “exercise”], will be returned in individual columns with categorical values. If True, extra columns with the categorical value one-hot encoded.

Returns

Baseline covariate matrix of size (num_subjects, num_features). a (pd.Series): Treatment assignment of size (num_subjects,). Quit smoking vs. non-quit. t (pd.Series): Followup duration, size (num_subjects,). y (pd.Series): Observed outcome (1) or right censoring event (0), size (num_subjects,).

Return type

X (pd.DataFrame)