causallib.utils.stat_utils module

Copyright 2019 IBM Corp.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

causallib.utils.stat_utils.areColumnsBinary(X)[source]

Assess whether all matrix columns are binary. :param X: Covariate matrix. :type X: np.ndarray | pdDataFrame

Returns

A boolean vector the length of number of features (columns). An entry is True iff the corresponding: column is binary.

Return type

np.ndarray

causallib.utils.stat_utils.calc_weighted_ks2samp(x, y, wx, wy)[source]

Weighted Kolmogorov-Smirnov

References

[1] https://stackoverflow.com/a/40059727

causallib.utils.stat_utils.calc_weighted_standardized_mean_differences(x, y, wx, wy, weighted_var=False)[source]

Standardized mean difference: frac{mu_1 - mu_2 }{sqrt{sigma_1^2 + sigma_2^2}}

References

[1]https://cran.r-project.org/web/packages/cobalt/vignettes/cobalt_A0_basic_use.html#details-on-calculations [2]https://en.wikipedia.org/wiki/Strictly_standardized_mean_difference#Concept

Note on variance: - The variance is calculated on unadjusted to avoid paradoxical situation when adjustment decreases both the

mean difference and the spread of the sample, yielding a larger smd than that prior to adjustment, even though the adjusted groups are now more similar [1].

The denominator is as depicted in the “statistical estimation” section: https://en.wikipedia.org/wiki/Strictly_standardized_mean_difference#Statistical_estimation, namely, disregarding the covariance term [2], and is unweighted as suggested above in [1].

causallib.utils.stat_utils.chi2_test(X, y)[source]

Parameters

X (np.ndarray) – Binary feature matrix
y (np.ndarray) – Binary response vector

Returns

A vector of p-values, one for every feature.

Return type

np.ndarray

causallib.utils.stat_utils.computeCorrPvals(X, y, is_X_binary, is_y_binary, isLinear=True)[source]

Parameters

X (pd.DataFrame) – The covariate matrix
y (pdSeries) – The response
is_X_binary (np.ndarray) – Indication which columns are binary
is_y_binary (bool) – Indication whether the response vector is binary or not.
isLinear (bool) – Whether to perform a linear (slope) test (t-test) on the non-binary features or to perform a two-sample Kolmogorov-Smirnov test

Returns

A vector of p-values, one for every feature.

Return type

np.array

causallib.utils.stat_utils.isBinary(x)[source]

Asses whether a vector is binary. :param x: :type x: pdSeries | np.ndarray

Returns: True iff x is binary.
Return type: bool

causallib.utils.stat_utils.is_vector_binary(vec)[source]

causallib.utils.stat_utils.robust_lookup(df, indexer)[source]

Robust way to apply pandas lookup when indices are not unique

Parameters

df (pdDataFrame) –
indexer (pdSeries) – A Series whose index is either same or a subset of df.index and whose values are values from df.columns. If a.index contains values not in df.index they will have NaN values.

Returns

a vector where (logically) extracted[i] = df.loc[indexer.index[i], indexer[i]].: In most cases, when indexer.index == df.index this translates to extracted[i] = df.loc[i, indexer[i]]

Return type

pdSeries

causallib.utils.stat_utils.which_columns_are_binary(X)[source]

Parameters: X (pdDataFrame) –

Returns: