causallib.utils.stat_utils module

  1. Copyright 2019 IBM Corp.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

causallib.utils.stat_utils.areColumnsBinary(X)[source]

Assess whether all matrix columns are binary. :param X: Covariate matrix. :type X: np.ndarray | pdDataFrame

Returns

A boolean vector the length of number of features (columns). An entry is True iff the corresponding

column is binary.

Return type

np.ndarray

causallib.utils.stat_utils.calc_weighted_ks2samp(x, y, wx, wy)[source]

Weighted Kolmogorov-Smirnov

References

[1] https://stackoverflow.com/a/40059727

causallib.utils.stat_utils.calc_weighted_standardized_mean_differences(x, y, wx, wy, weighted_var=False)[source]

Standardized mean difference: frac{mu_1 - mu_2 }{sqrt{sigma_1^2 + sigma_2^2}}

References

[1]https://cran.r-project.org/web/packages/cobalt/vignettes/cobalt_A0_basic_use.html#details-on-calculations [2]https://en.wikipedia.org/wiki/Strictly_standardized_mean_difference#Concept

Note on variance: - The variance is calculated on unadjusted to avoid paradoxical situation when adjustment decreases both the

mean difference and the spread of the sample, yielding a larger smd than that prior to adjustment, even though the adjusted groups are now more similar [1].

causallib.utils.stat_utils.chi2_test(X, y)[source]
Parameters
  • X (np.ndarray) – Binary feature matrix

  • y (np.ndarray) – Binary response vector

Returns

A vector of p-values, one for every feature.

Return type

np.ndarray

causallib.utils.stat_utils.computeCorrPvals(X, y, is_X_binary, is_y_binary, isLinear=True)[source]
Parameters
  • X (pd.DataFrame) – The covariate matrix

  • y (pdSeries) – The response

  • is_X_binary (np.ndarray) – Indication which columns are binary

  • is_y_binary (bool) – Indication whether the response vector is binary or not.

  • isLinear (bool) – Whether to perform a linear (slope) test (t-test) on the non-binary features or to perform a two-sample Kolmogorov-Smirnov test

Returns

A vector of p-values, one for every feature.

Return type

np.array

causallib.utils.stat_utils.isBinary(x)[source]

Asses whether a vector is binary. :param x: :type x: pdSeries | np.ndarray

Returns

True iff x is binary.

Return type

bool

causallib.utils.stat_utils.is_vector_binary(vec)[source]
causallib.utils.stat_utils.robust_lookup(df, indexer)[source]

Robust way to apply pandas lookup when indices are not unique

Parameters
  • df (pdDataFrame) –

  • indexer (pdSeries) – A Series whose index is either same or a subset of df.index and whose values are values from df.columns. If a.index contains values not in df.index they will have NaN values.

Returns

a vector where (logically) extracted[i] = df.loc[indexer.index[i], indexer[i]].

In most cases, when indexer.index == df.index this translates to extracted[i] = df.loc[i, indexer[i]]

Return type

pdSeries

causallib.utils.stat_utils.which_columns_are_binary(X)[source]
Parameters

X (pdDataFrame) –

Returns: