causallib.datasets.load_nhefs#

load_nhefs(raw=False, restrict=True, augment=True, onehot=True)[source]#

Loads the NHEFS smoking-cessation and weight-loss dataset.

Data was gathered during an observational study conducted by the NHANS during the 1970’s and 1980’. It follows a cohort a people whom some decided to quite smoking and some decided to persist, and record the gain in weight for each individual to try estimate the causal contribution of smoking cessation on weight gain.

This dataset is used throughout Hernán and Robins’ Causal Inference Book.

https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

If used for academic purposes, please consider citing the book:

Hernán MA, Robins JM (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.

Parameters:
  • raw (bool) – Whether to return the entire DataFrame and descriptors or not. If False, only confounders are used for the data. If True, returns a (pandas.DataFrame, pandas.Series) tuple (data and description).

  • restrict (bool) – Whether to apply exclusion criteria on missing data or not. Note: if False - data will have censored (NaN) outcomes.

  • augment (bool) – Whether to add augmented (squared) features If False, only original data returned. If True, squares continuous valued columns [‘age’, ‘wt71’, ‘smokeintensity’, ‘smokeyrs’] and joins to data frame with suffix ‘^2’

  • onehot (bool) – Whether to one-hot encode categorical data. If False, categorical data [“active”, “education”, “exercise”], will be returned in individual columns with categorical values. If True, extra columns with the categorical value one-hot encoded.

Returns:

dictionary-like object
attributes are: X (covariates), a (treatment assignment) y (outcome),

descriptors (feature description)

Return type:

Bunch