causallib.utils.stat_utils module
Copyright 2019 IBM Corp.
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
- causallib.utils.stat_utils.areColumnsBinary(X)[source]
Assess whether all matrix columns are binary. :param X: Covariate matrix. :type X: np.ndarray | pdDataFrame
- Returns
- A boolean vector the length of number of features (columns). An entry is True iff the corresponding
column is binary.
- Return type
np.ndarray
- causallib.utils.stat_utils.calc_weighted_ks2samp(x, y, wx, wy)[source]
Weighted Kolmogorov-Smirnov
References
- causallib.utils.stat_utils.calc_weighted_standardized_mean_differences(x, y, wx, wy, weighted_var=False)[source]
Standardized mean difference: frac{mu_1 - mu_2 }{sqrt{sigma_1^2 + sigma_2^2}}
References
[1]https://cran.r-project.org/web/packages/cobalt/vignettes/cobalt_A0_basic_use.html#details-on-calculations [2]https://en.wikipedia.org/wiki/Strictly_standardized_mean_difference#Concept
Note on variance: - The variance is calculated on unadjusted to avoid paradoxical situation when adjustment decreases both the
mean difference and the spread of the sample, yielding a larger smd than that prior to adjustment, even though the adjusted groups are now more similar [1].
The denominator is as depicted in the “statistical estimation” section: https://en.wikipedia.org/wiki/Strictly_standardized_mean_difference#Statistical_estimation, namely, disregarding the covariance term [2], and is unweighted as suggested above in [1].
- causallib.utils.stat_utils.chi2_test(X, y)[source]
- Parameters
X (np.ndarray) – Binary feature matrix
y (np.ndarray) – Binary response vector
- Returns
A vector of p-values, one for every feature.
- Return type
np.ndarray
- causallib.utils.stat_utils.computeCorrPvals(X, y, is_X_binary, is_y_binary, isLinear=True)[source]
- Parameters
X (pd.DataFrame) – The covariate matrix
y (pdSeries) – The response
is_X_binary (np.ndarray) – Indication which columns are binary
is_y_binary (bool) – Indication whether the response vector is binary or not.
isLinear (bool) – Whether to perform a linear (slope) test (t-test) on the non-binary features or to perform a two-sample Kolmogorov-Smirnov test
- Returns
A vector of p-values, one for every feature.
- Return type
np.array
- causallib.utils.stat_utils.isBinary(x)[source]
Asses whether a vector is binary. :param x: :type x: pdSeries | np.ndarray
- Returns
True iff x is binary.
- Return type
- causallib.utils.stat_utils.robust_lookup(df, indexer)[source]
Robust way to apply pandas lookup when indices are not unique
- Parameters
df (pdDataFrame) –
indexer (pdSeries) – A Series whose index is either same or a subset of df.index and whose values are values from df.columns. If a.index contains values not in df.index they will have NaN values.
- Returns
- a vector where (logically) extracted[i] = df.loc[indexer.index[i], indexer[i]].
In most cases, when indexer.index == df.index this translates to extracted[i] = df.loc[i, indexer[i]]
- Return type
pdSeries