causallib.datasets.load_nhefs_survival#
- load_nhefs_survival(augment=True, onehot=True)[source]#
Loads and pre-processes the NHEFS smoking-cessation dataset.
Data was gathered in an observational study conducted by the NHANS during the 1970’s and 1980’. It follows a cohort a people whom some decided to quite smoking and some decided to persist, and record the death events within 10 years of follow-up.
This dataset is used throughout Hernán and Robins’ Causal Inference Book. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/ If used for academic purposes, please consider citing the book: Hernán MA, Robins JM (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.
- Parameters:
augment (
bool) – Whether to add augmented (squared) features If False, only original data returned. If True, squares continuous valued columns [‘age’, ‘wt71’, ‘smokeintensity’, ‘smokeyrs’] and joins to data frame with suffix ‘^2’onehot (
bool) – Whether to one-hot encode categorical data. If False, categorical data [“active”, “education”, “exercise”], will be returned in individual columns with categorical values. If True, extra columns with the categorical value one-hot encoded.
- Returns:
Baseline covariate matrix of size (num_subjects, num_features). a (pandas.Series): Treatment assignment of size (num_subjects,). Quit smoking vs. non-quit. t (pandas.Series): Followup duration, size (num_subjects,). y (pandas.Series): Observed outcome (1) or right censoring event (0), size (num_subjects,).
- Return type:
X (pandas.DataFrame)