ui module¶
DVHA-Stats classes for user interaction
-
class
dvhastats.ui.
ControlChartUI
(y, std=3, ucl_limit=None, lcl_limit=None, var_name=None, x=None, plot_title=None)[source]¶ Bases:
dvhastats.ui.DVHAStatsBaseClass
,dvhastats.stats.ControlChart
Univariate Control Chart
Parameters: - y (list, np.ndarray) – Input data (1-D)
- std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control.
- ucl_limit (float, optional) – Limit the upper control limit to this value
- lcl_limit (float, optional) – Limit the lower control limit to this value
- plot_title (str, optional) – Over-ride the plot title
-
class
dvhastats.ui.
CorrelationMatrixUI
(X, var_names=None, corr_type='Pearson', cmap='coolwarm')[source]¶ Bases:
dvhastats.ui.DVHAStatsBaseClass
,dvhastats.stats.CorrelationMatrix
Pearson-R correlation matrix UI object
Parameters: - X (np.ndarray) – Input data (2-D) with N rows of observations and p columns of variables.
- var_names (list, optional) – Optionally set the variable names with a list of str
- corr_type (str) – Either “Pearson” or “Spearman”
- cmap (str) – matplotlib compatible color map
-
show
(absolute=False, corr=True)[source]¶ Create a heat map of PCA components
Parameters: - absolute (bool) – Heat map will display the absolute values in PCA components if True
- corr (bool) – Plot a p-value matrix if False, correlation matrix if True.
Returns: The number of the newly created matplotlib figure
Return type: int
-
class
dvhastats.ui.
DVHAStats
(data=None, var_names=None, x_axis=None, avg_len=5, del_const_vars=False)[source]¶ Bases:
dvhastats.ui.DVHAStatsBaseClass
The main UI class object for DVHAStats
Parameters: - data (numpy.array, dict, str, None) – Input data (2-D) with N rows of observations and p columns of variables. The CSV file must have a header row for column names. Test data is loaded if None
- var_names (list of str, optional) – If data is a numpy array, optionally provide the column names.
- x_axis (numpy.array, list, optional) – Specify x_axis for plotting purposes. Default is based on row number in data
- avg_len (int) – When plotting raw data, a trend line will be plotted using this value as an averaging length. If N < avg_len + 1 will not plot a trend line
- del_const_vars (bool) – Automatically delete any variables that have constant data. The names of these variables are stored in the excluded_vars attr. Default value is False.
-
box_cox
(alpha=None, lmbda=None, const_policy='propagate')[source]¶ Apply box_cox_by_index for all data
-
box_cox_by_index
(index, alpha=None, lmbda=None, const_policy='propagate')[source]¶ Parameters: - index (int, str) – The index corresponding to the variable data to have a box-cox transformation applied. If index is a string, it will be assumed to be the var_name
- lmbda (None, scalar, optional) – If lmbda is not None, do the transformation for that value. If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument.
- alpha (None, float, optional) – If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0.
- const_policy (str) – {‘propagate’, ‘raise’, ‘omit’} Defines how to handle when data is constant. The following options are available (default is ‘propagate’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: remove
Returns: Results from stats.box_cox
Return type: np.ndarray
-
constant_var_indices
¶ Get a list of all constant variable indices
Returns: Indices of variables with no variation Return type: list
-
constant_vars
¶ Get a list of all constant variables
Returns: Names of variables with no variation Return type: list
-
correlation_matrix
(corr_type='Pearson')[source]¶ Get a Pearson-R or Spearman correlation matrices
Parameters: corr_type (str) – Either “Pearson” or “Spearman” Returns: A CorrelationMatrixUI class object Return type: CorrelationMatrixUI
-
del_var
(var_name)[source]¶ Determine if data by var_name is constant
Parameters: var_name (int, str) – The var_name to delete (or index of variable)
-
get_data_by_var_name
(var_name)[source]¶ Get the single variable array based on var_name
Parameters: var_name (int, str) – The name (str) or index (int) of the variable of interest Returns: The column of data for the given var_name Return type: np.ndarray
-
get_index_by_var_name
(var_name)[source]¶ Get the variable index by var_name
Parameters: var_name (int, str) – The name (str) or index (int) of the variable of interest Returns: The column index for the given var_name Return type: int
-
histogram
(var_name, bins='auto', nan_policy='omit')[source]¶ Get a Histogram class object
- var_name : str, int
- The name (str) or index (int) of teh variable to plot
- bins : int, list, str, optional
- See https://numpy.org/doc/stable/reference/generated/numpy.histogram.html for details
- nan_policy : str
- Value must be one of the following: ‘propagate’, ‘raise’, ‘omit’ Defines how to handle when input contains nan. The following options are available (default is ‘omit’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: performs the calculations ignoring nan values
-
hotelling_t2
(alpha=0.05, box_cox=False, box_cox_alpha=None, box_cox_lmbda=None, const_policy='omit')[source]¶ Calculate control limits for a standard univariate Control Chart
Parameters: - alpha (float) – Significance level used to determine the upper control limit (ucl)
- box_cox (bool, optional) – Set to true to perform a Box-Cox transformation on data prior to calculating the control chart statistics
- box_cox_alpha (float, optional) – If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0.
- box_cox_lmbda (float, optional) – If lmbda is not None, do the transformation for that value. If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument.
- const_policy (str) – {‘raise’, ‘omit’} Defines how to handle when data is constant. The following options are available (default is ‘raise’): ‘raise’: throws an error ‘omit’: exclude constant variables from calculation
Returns: HotellingT2UI class object
Return type:
-
is_constant
(var_name)[source]¶ Determine if data by var_name is constant
Parameters: var_name (int, str) – The var_name to check (or index of variable) Returns: True if all values of var_name are the same (i.e., no variation) Return type: bool
-
linear_reg
(y, y_var_name=None, reg_vars=None, saved_reg=None, back_elim=False, back_elim_p=0.05)[source]¶ Initialize a MultiVariableRegression class object
Parameters: - y (np.ndarray, list, str, int) – Dependent data based on DVHAStats.data. If y is str or int, then it is assumed to be the var_name or index of data to be set as the dependent variable
- y_var_name (int, str, optional) – Optionally provide name of the dependent variable. Automatically set if y is str or int
- reg_vars (list, optional) – Optionally specify variable names or indices of data to be used in the regression
- saved_reg (MultiVariableRegression, optional) – If supplied, predicted values (y-hat) will be calculated with DVHAStats.data and the regression from saved_reg. This is useful if testing a regression model on new data.
- back_elim (bool) – Automatically perform backward elimination if True
- back_elim_p (float) – p-value threshold for backward elimination
Returns: A LinearRegUI class object.
Return type:
-
non_const_data
¶ Return self.data excluding any constant variables
Returns: Data with constant variables removed. This does not alter the data property. Return type: np.ndarray
-
observations
¶ Number of observations in data
Returns: Number of rows in data Return type: int
-
pca
(n_components=0.95, transform=True, **kwargs)[source]¶ Return an sklearn PCA-like object, see PCA object for details
Parameters: - n_components (int, float, None or str) –
Number of components to keep. if n_components is not set all components are kept: n_components == min(n_samples, n_features)
If n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension. Use of n_components == ‘mle’ will interpret svd_solver == ‘auto’ as svd_solver == ‘full’.
If 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
If svd_solver == ‘arpack’, the number of components must be strictly less than the minimum of n_features and n_samples.
- transform (bool) – Fit the model and apply the dimensionality reduction
- kwargs (any) – Provide any keyword arguments for sklearn.decomposition.PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Returns: A principal component analysis object inherited from sklearn.decomposition.PCA
Return type: - n_components (int, float, None or str) –
-
risk_adjusted_control_chart
(y, std=3, ucl_limit=None, lcl_limit=None, saved_reg=None, y_name=None, reg_vars=None, back_elim=False, back_elim_p=0.05)[source]¶ Calculate control limits for a Risk-Adjusted Control Chart
Parameters: - y (list, np.ndarray) – 1-D Input data (dependent data)
- std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control.
- ucl_limit (float, optional) – Limit the upper control limit to this value
- lcl_limit (float, optional) – Limit the lower control limit to this value
- saved_reg (MultiVariableRegression, optional) – Optionally provide a previously calculated regression
- y_name (int, str, optional) – Optionally provide name of the dependent variable. Automatically set if y is str or int
- reg_vars (list, optional) – Optionally specify variable names or indices of data to be used in the regression
- saved_reg – If supplied, predicted values (y-hat) will be calculated with DVHAStats.data and the regression from saved_reg. This is useful if testing a regression model on new data.
- back_elim (bool) – Automatically perform backward elimination if True
- back_elim_p (float) – p-value threshold for backward elimination
-
show
(var_name=None, plot_type='trend', **kwargs)[source]¶ Display a plot of var_name with matplotlib
Parameters: - var_name (str, int, None) – The name (str) or index (int) of the variable to plot. If None and plot_type=”boxplot”, all variables will be plotted.
- plot_type (str) – Either “trend”, “hist”, “box”
- kwargs (any) – If plot_type is “hist”, pass any of the matplotlib hist key word arguments
Returns: The number of the newly created matplotlib figure
Return type: int
-
univariate_control_chart
(var_name, std=3, ucl_limit=None, lcl_limit=None, box_cox=False, box_cox_alpha=None, box_cox_lmbda=None, const_policy='propagate')[source]¶ Calculate control limits for a standard univariate Control Chart
Parameters: - var_name (str, int) – The name (str) or index (int) of teh variable to plot
- std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control
- ucl_limit (float, optional) – Limit the upper control limit to this value
- lcl_limit (float, optional) – Limit the lower control limit to this value
- box_cox (bool, optional) – Set to true to perform a Box-Cox transformation on data prior to calculating the control chart statistics
- box_cox_alpha (float, optional) – If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0.
- box_cox_lmbda (float, optional) – If lmbda is not None, do the transformation for that value. If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument.
- const_policy (str) – {‘propagate’, ‘raise’, ‘omit’} Defines how to handle when data is constant. The following options are available (default is ‘propagate’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: remove NaN data
Returns: stats.ControlChart class object
Return type:
-
univariate_control_charts
(**kwargs)[source]¶ Calculate Control charts for all variables
Parameters: kwargs (any) – See univariate_control_chart for keyword parameters Returns: ControlChart class objects stored in a dictionary with var_names and indices as keys (can use var_name or index) Return type: dict
-
variable_count
¶ Number of variables in data
Returns: Number of columns in data Return type: int
-
class
dvhastats.ui.
DVHAStatsBaseClass
[source]¶ Bases:
object
Base Class for DVHAStats objects and child objects
-
class
dvhastats.ui.
HotellingT2UI
(data, alpha=0.05, plot_title=None)[source]¶ Bases:
dvhastats.ui.DVHAStatsBaseClass
,dvhastats.stats.HotellingT2
Hotelling’s t-squared statistic for multivariate hypothesis testing
Parameters: - data (np.ndarray) – A 2-D array of data to perform multivariate analysis. (e.g., DVHAStats.data)
- alpha (float) – The significance level used to calculate the upper control limit (UCL)
- plot_title (str, optional) – Over-ride the plot title
-
class
dvhastats.ui.
LinearRegUI
(X, y, saved_reg=None, var_names=None, y_var_name=None, back_elim=False, back_elim_p=0.05)[source]¶ Bases:
dvhastats.ui.DVHAStatsBaseClass
,dvhastats.stats.MultiVariableRegression
A MultiVariableRegression class UI object
Parameters: - y (np.ndarray, list) – Dependent data based on DVHAStats.data
- saved_reg (MultiVariableRegression, optional) – If supplied, predicted values (y-hat) will be calculated with DVHAStats.data and the regression from saved_reg. This is useful if testing a regression model on new data.
- var_names (list, optional) – Optionally provide names of the independent variables
- y_var_name (int, str, optional) – Optionally provide name of the dependent variable
- back_elim (bool) – Automatically perform backward elimination if True
- back_elim_p (float) – p-value threshold for backward elimination
-
class
dvhastats.ui.
PCAUI
(X, var_names=None, n_components=0.95, transform=True, **kwargs)[source]¶ Bases:
dvhastats.ui.DVHAStatsBaseClass
,dvhastats.stats.PCA
Hotelling’s t-squared statistic for multivariate hypothesis testing
Parameters: - X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
- var_names (str, optional) – Names of the independent variables in X
- n_components (int, float, None or str) – Number of components to keep. if n_components is not set all components are kept: n_components == min(n_samples, n_features) If n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension. Use of n_components == ‘mle’ will interpret svd_solver == ‘auto’ as svd_solver == ‘full’. If 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If svd_solver == ‘arpack’, the number of components must be strictly less than the minimum of n_features and n_samples.
- transform (bool) – Fit the model and apply the dimensionality reduction
- kwargs (any) – Provide any keyword arguments for sklearn.decomposition.PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
-
show
(plot_type='feature_map', absolute=True)[source]¶ Create a heat map of PCA components
Parameters: - plot_type (str) – Select a plot type to display. Options include: feature_map.
- absolute (bool) – Heat map will display the absolute values in PCA components if True
Returns: The number of the newly created matplotlib figure
Return type: int
-
class
dvhastats.ui.
RiskAdjustedControlChartUI
(X, y, std=3, ucl_limit=None, lcl_limit=None, x=None, y_name=None, var_names=None, saved_reg=None, plot_title=None, back_elim=False, back_elim_p=0.05)[source]¶ Bases:
dvhastats.ui.DVHAStatsBaseClass
,dvhastats.stats.RiskAdjustedControlChart
Risk-Adjusted Control Chart using a Multi-Variable Regression
Parameters: - X (array-like) – Input array (independent data)
- y (list, np.ndarray) – 1-D Input data (dependent data)
- std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control.
- ucl_limit (float, optional) – Limit the upper control limit to this value
- lcl_limit (float, optional) – Limit the lower control limit to this value
- x (list, np.ndarray, optional) – x-axis values
- plot_title (str, optional) – Over-ride the plot title
- saved_reg (MultiVariableRegression, optional) – Optionally provide a previously calculated regression
- var_names (list, optional) – Optionally provide names of the variables
- back_elim (bool) – Automatically perform backward elimination if True
- back_elim_p (float) – p-value threshold for backward elimination
stats module¶
Statistical calculations and class objects
-
class
dvhastats.stats.
ControlChart
(y, std=3, ucl_limit=None, lcl_limit=None, x=None)[source]¶ Bases:
object
Calculate control limits for a standard univariate Control Chart”
Parameters: - y (list, np.ndarray) – Input data (1-D)
- std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control.
- ucl_limit (float, optional) – Limit the upper control limit to this value
- lcl_limit (float, optional) – Limit the lower control limit to this value
-
avg_moving_range
¶ Avg moving range based on 2 consecutive points
Returns: Average moving range. Returns NaN if arr is empty. Return type: np.ndarray, np.nan
-
center_line
¶ Center line of charting data (i.e., mean value)
Returns: Mean value of y with np.mean() or np.nan if y is empty Return type: np.ndarray, np.nan
-
chart_data
¶ JSON compatible dict for chart generation
Returns: Data used for Histogram visuals. Keys include ‘x’, ‘y’, ‘out_of_control’, ‘center_line’, ‘lcl’, ‘ucl’ Return type: dict
-
control_limits
¶ Calculate the lower and upper control limits
Returns: - lcl (float) – Lower Control Limit (LCL)
- ucl (float) – Upper Control Limit (UCL)
-
out_of_control
¶ Get the indices of out-of-control observations
Returns: An array of indices that are not between the lower and upper control limits Return type: np.ndarray
-
out_of_control_high
¶ Get the indices of observations > ucl
Returns: An array of indices that are greater than the upper control limit Return type: np.ndarray
-
out_of_control_low
¶ Get the indices of observations < lcl
Returns: An array of indices that are less than the lower control limit Return type: np.ndarray
-
sigma
¶ UCL/LCL = center_line +/- sigma * std
Returns: sigma or np.nan if arr is empty Return type: np.ndarray, np.nan
-
class
dvhastats.stats.
CorrelationMatrix
(X, corr_type='Pearson')[source]¶ Bases:
object
Pearson-R correlation matrix
Parameters: - X (np.ndarray) – Input data (2-D) with N rows of observations and p columns of variables.
- corr_type (str) – Either “Pearson” or “Spearman”
-
chart_data
¶ JSON compatible dict for chart generation
Returns: Data used for Histogram visuals. Keys include ‘corr’, ‘p’, ‘norm’, ‘norm_p’ Return type: dict
-
normality
¶ The normality and normality p-value of the input array
Returns: - statistic (np.ndarray) – Normality calculated with scipy.stats.normaltest
- p-value (np.ndarray) – A 2-sided chi squared probability for the hypothesis test.
-
class
dvhastats.stats.
Histogram
(y, bins, nan_policy='omit')[source]¶ Bases:
object
Basic histogram plot using matplotlib histogram calculation
Parameters: - y (array-like) – Input array.
- bins (int, list, str, optional) – If bins is an int, it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines a monotonically increasing array of bin edges, including the rightmost edge, allowing for non-uniform bin widths. If bins is a string, it defines the method used to calculate the optimal bin width, as defined by histogram_bin_edges. ‘auto’ - Maximum of the ‘sturges’ and ‘fd’ estimators. Provides good all around performance. ‘fd’ - (Freedman Diaconis Estimator) Robust (resilient to outliers) estimator that takes into account data variability and data size. ‘doane’ - An improved version of Sturges’ estimator that works better with non-normal datasets. ‘scott’ - Less robust estimator that that takes into account data variability and data size. ‘stone’ - Estimator based on leave-one-out cross-validation estimate of the integrated squared error. Can be regarded as a generalization of Scott’s rule. ‘rice’ - Estimator does not take variability into account, only data size. Commonly overestimates number of bins required. ‘sturges’ - R’s default method, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets. ‘sqrt’ - Square root (of data size) estimator, used by Excel and other programs for its speed and simplicity.
- nan_policy (str) – Value must be one of the following: ‘propagate’, ‘raise’, ‘omit’ Defines how to handle when input contains nan. The following options are available (default is ‘omit’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: performs the calculations ignoring nan values
-
chart_data
¶ JSON compatible dict for chart generation
Returns: Data used for Histogram visuals. Keys include ‘x’, ‘y’, ‘mean’, ‘median’, ‘std’, ‘normality’, ‘normality_p’ Return type: dict
-
hist_data
¶ Get the histogram data
Returns: - hist (np.ndarray) – The values of the histogram
- center (np.ndarray) – The centers of the bins
-
mean
¶ The mean value of the input array
Returns: Mean value of y with np.mean() Return type: np.ndarray
-
median
¶ The median value of the input array
Returns: Median value of y with np.median() Return type: np.ndarray
-
normality
¶ The normality and normality p-value of the input array
Returns: - statistic (float) – Normality calculated with scipy.stats.normaltest
- p-value (float) – A 2-sided chi squared probability for the hypothesis test.
-
std
¶ The standard deviation of the input array
Returns: Standard deviation of y with np.std() Return type: np.ndarray
-
class
dvhastats.stats.
HotellingT2
(data, alpha=0.05, const_policy='raise')[source]¶ Bases:
object
Hotelling’s t-squared statistic for multivariate hypothesis testing
Parameters: - data (np.ndarray) – A 2-D array of data to perform multivariate analysis. (e.g., DVHAStats.data)
- alpha (float) – The significance level used to calculate the upper control limit (UCL)
- const_policy (str) – {‘raise’, ‘omit’} Defines how to handle when data is constant. The following options are available (default is ‘raise’): ‘raise’: throws an error ‘omit’: exclude constant variables from calculation
-
Q
¶ Calculate Hotelling T^2 statistic (Q) from a 2-D numpy array
Returns: A numpy array of Hotelling T^2 (1-D of length N) Return type: np.ndarray
-
center_line
¶ Center line for the control chart
Returns: Median value of beta distribution. Return type: float
-
chart_data
¶ JSON compatible dict for chart generation
Returns: Data used for Histogram visuals. Keys include ‘x’, ‘y’, ‘out_of_control’, ‘center_line’, ‘lcl’, ‘ucl’ Return type: dict
-
control_limits
¶ Lower and Upper control limits
Returns: - lcl (float) – Lower Control Limit (LCL). This is fixed to 0 for Hotelling T2
- ucl (float) – Upper Control Limit (UCL)
-
get_control_limit
(x)[source]¶ Calculate a Hotelling T^2 control limit using a beta distribution
Parameters: x (float) – Value where the beta function is evaluated Returns: The control limit for a beta distribution Return type: float
-
observations
¶ Number of observations in data
Returns: Number of rows in data Return type: int
-
out_of_control
¶ Indices of out-of-control observations
Returns: An array of indices that are greater than the upper control limit. (NOTE: Q is never negative) Return type: np.ndarray
-
ucl
¶ Upper control limit
Returns: ucl – Upper Control Limit (UCL) Return type: float
-
variable_count
¶ Number of variables in data
Returns: Number of columns in data Return type: int
-
class
dvhastats.stats.
MultiVariableRegression
(X, y, saved_reg=None, var_names=None, y_var_name=None, back_elim=False, back_elim_p=0.05)[source]¶ Bases:
object
Multi-variable regression using scikit-learn
Parameters: - X (array-like) – Independent data
- y (array-like) – Dependent data
- saved_reg (MultiVariableRegression, optional) – Optionally provide a previously calculated regression
- var_names (list, optional) – Optionally provide names of the variables
- y_var_name (int, str, optional) – Optionally provide name of the dependent variable
- back_elim (bool) – Automatically perform backward elimination if True
- back_elim_p (float) – p-value threshold for backward elimination
-
backward_elimination
(p_value=0.05)[source]¶ Remove insignificant variables from regression
- p_value : float
- Iteratively remove the least significant variable until all variables have p-values less than p_value or only one variable remains.
-
chart_data
¶ JSON compatible dict for chart generation
Returns: Data used for residual visuals. Keys include ‘x’, ‘y’, ‘pred’, ‘resid’, ‘coef’, ‘r_sq’, ‘mse’, ‘std_err’, ‘t_value’, ‘p_value’ Return type: dict
-
coef
¶ Coefficients for the regression
Returns: An array of regression coefficients (i.e., y_intercept, 1st var slope, 2nd var slope, etc.) Return type: np.ndarray
-
df_error
¶ Error degrees of freedom
Returns: Degrees of freedom for the error Return type: int
-
df_model
¶ Model degrees of freedom
Returns: Degrees of freedom for the model Return type: int
-
f_p_value
¶ p-value of the f-statistic
Returns: p-value of the F-statistic of beta coefficients using scipy Return type: float
-
f_stat
¶ The F-statistic of the regression
Returns: F-statistic of beta coefficients using regressors.stats Return type: float
-
mse
¶ Mean squared error of the linear regression
Returns: A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target. Return type: float, nd.array
-
prob_plot
¶ Calculate quantiles for a probability plot
Returns: Data for generating a probablity plot. Keys include: ‘x’, ‘y’, ‘y_intercept’, ‘slope’, ‘x_trend’, ‘y_trend’ Return type: dict
-
r_sq
¶ R^2 (coefficient of determination) regression score function.
Returns: The R^2 score Return type: float
-
residuals
¶ Residuals of the prediction and sample data
Returns: y - predictions Return type: np.ndarray
-
slope
¶ The slope of the linear regression
Returns: Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features. Return type: np.ndarray
-
y_intercept
¶ The y-intercept of the linear regression
Returns: Independent term in the linear model. Return type: float
-
class
dvhastats.stats.
PCA
(X, var_names=None, n_components=0.95, transform=True, **kwargs)[source]¶ Bases:
sklearn.decomposition._pca.PCA
Principal Component Analysis with sklearn.decomposition.PCA
Parameters: - X (np.ndarray) – Training data (2-D), where n_samples is the number of samples and n_features is the number of features. shape (n_samples, n_features)
- var_names (list, optional) – Optionally provide names of the features
- n_components (int, float, None or str) – Number of components to keep. if n_components is not set all components are kept: n_components == min(n_samples, n_features) If n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension. Use of n_components == ‘mle’ will interpret svd_solver == ‘auto’ as svd_solver == ‘full’. If 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If svd_solver == ‘arpack’, the number of components must be strictly less than the minimum of n_features and n_samples.
- transform (bool) – Fit the model and apply the dimensionality reduction
- kwargs (any) – Provide any keyword arguments for sklearn.decomposition.PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
-
component_labels
¶ Get component names
Returns: Labels for plotting. (1st Comp, 2nd Comp, 3rd Comp, etc.) Return type: list
-
feature_map_data
¶ Used for feature analysis heat map
Returns: Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance. Return type: np.ndarray Shape (n_components, n_features)
-
class
dvhastats.stats.
RiskAdjustedControlChart
(X, y, std=3, ucl_limit=None, lcl_limit=None, x=None, saved_reg=None, var_names=None, back_elim=False, back_elim_p=0.05)[source]¶ Bases:
dvhastats.stats.ControlChart
Calculate a risk-adjusted univariate Control Chart (with linear MVR)
Parameters: - X (array-like) – Independent data
- y (list, np.ndarray) – Input data (1-D)
- std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control.
- ucl_limit (float, optional) – Limit the upper control limit to this value
- lcl_limit (float, optional) – Limit the lower control limit to this value
- saved_reg (MultiVariableRegression, optional) – Optionally provide a previously calculated regression
- var_names (list, optional) – Optionally provide names of the variables
- back_elim (bool) – Automatically perform backward elimination if True
- back_elim_p (float) – p-value threshold for backward elimination
-
dvhastats.stats.
avg_moving_range
(arr, nan_policy='omit')[source]¶ Calculate the average moving range (over 2-consecutive point1)
Parameters: - arr (array-like (1-D)) – Input array. Must be positive 1-dimensional.
- nan_policy (str, optional) – Value must be one of the following: {‘propagate’, ‘raise’, ‘omit’} Defines how to handle when input contains nan. The following options are available (default is ‘omit’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: performs the calculations ignoring nan values
Returns: Average moving range. Returns NaN if arr is empty
Return type: np.ndarray, np.nan
-
dvhastats.stats.
box_cox
(arr, alpha=None, lmbda=None, const_policy='propagate')[source]¶ Parameters: - arr (np.ndarray) – Input array. Must be positive 1-dimensional.
- lmbda (None, scalar, optional) – If lmbda is not None, do the transformation for that value. If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument.
- alpha (None, float, optional) – If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0.
- const_policy (str) – {‘propagate’, ‘raise’, ‘omit’} Defines how to handle when data is constant. The following options are available (default is ‘propagate’): ‘propagate’: returns nan ‘raise’: throws an error
Returns: box_cox – Box-Cox power transformed array
Return type: np.ndarray
-
dvhastats.stats.
get_lin_reg_p_values
(X, y, predictions, y_intercept, slope)[source]¶ Get p-values of a linear regression using sklearn based on https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression
Parameters: - X (np.ndarray) – Independent data
- y (np.ndarray, list) – Dependent data
- predictions (np.ndarray, list) – Predictions using the linear regression. (Output from linear_model.LinearRegression.predict)
- y_intercept (float, np.ndarray) – The y-intercept of the linear regression
- slope (float, np.ndarray) – The slope of the linear regression
Returns: - p_value (np.ndarray) – p-value of the linear regression coefficients
- std_errs (np.ndarray) – standard errors of the linear regression coefficients
- t_value (np.ndarray) – t-values of the linear regression coefficients
-
dvhastats.stats.
get_ordinal
(n)[source]¶ Convert number to its ordinal (e.g., 1 to 1st)
Parameters: n (int) – Number to be converted to ordinal Returns: the ordinal of n Return type: str
-
dvhastats.stats.
is_arr_constant
(arr)[source]¶ Determine if data by var_name is constant
Parameters: arr (array-like) – Input array or object that can be converted to an array Returns: True if all values the same (i.e., no variation) Return type: bool
-
dvhastats.stats.
is_nan_arr
(arr)[source]¶ Check if array has only NaN elements
Parameters: arr (np.ndarray) – Input array Returns: True if all elements are np.nan Return type: bool
-
dvhastats.stats.
moving_avg
(y, avg_len, x=None, weight=None)[source]¶ Calculate the moving (rolling) average of a set of data
Parameters: - y (array-like) – data (1-D) to be averaged
- avg_len (int) – Data is averaged over this many points (current value and avg_len - 1 prior points)
- x (np.ndarray, list, optional) – Optionally specify the x-axis values. Otherwise index+1 is used.
- weight (np.ndarray, list, optional) – A weighted moving average is calculated based on the provided weights. weight must be of same length as y. Weights of one are assumed by default.
Returns: - x (np.ndarray) – Resulting x-values for the moving average
- moving_avg (np.ndarray) – moving average values
-
dvhastats.stats.
pearson_correlation_matrix
(X)[source]¶ Calculate a correlation matrix of Pearson-R values
Parameters: X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features. Returns: - r (np.ndarray) – Array (2-D) of Pearson-R correlations between the row indexed and column indexed variables
- p (np.ndarray) – Array (2-D) of p-values associated with r
-
dvhastats.stats.
process_nan_policy
(arr, nan_policy)[source]¶ Calculate the average moving range (over 2-consecutive point1)
Parameters: - arr (array-like (1-D)) – Input array. Must be positive 1-dimensional.
- nan_policy (str) – Value must be one of the following: {‘propagate’, ‘raise’, ‘omit’} Defines how to handle when input contains nan. The following options are available (default is ‘omit’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: performs the calculations ignoring nan values
Returns: Input array evaluated per nan_policy
Return type: np.ndarray, np.nan
-
dvhastats.stats.
remove_const_column
(arr)[source]¶ Remove all columns with zero variance
Parameters: arr (np.ndarray) – Input array (2-D) Returns: Input array with columns of a constant value removed Return type: np.ndarray
-
dvhastats.stats.
remove_nan
(arr)[source]¶ Remove indices from 1-D array with values of np.nan
Parameters: arr (np.ndarray (1-D)) – Input array. Must be positive 1-dimensional. Returns: arr with NaN values deleted Return type: np.ndarray
-
dvhastats.stats.
spearman_correlation_matrix
(X, nan_policy='omit')[source]¶ Calculate a Spearman correlation matrix
Parameters: - X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
- nan_policy (str) – Value must be one of the following: ‘propagate’, ‘raise’, ‘omit’ Defines how to handle when input contains nan. The following options are available (default is ‘omit’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: performs the calculations ignoring nan values
Returns: - correlation (float or ndarray (2-D square)) – Spearman correlation matrix or correlation coefficient (if only 2 variables are given as parameters. Correlation matrix is square with length equal to total number of variables (columns or rows) in a and b combined.
- p-value (float) – The two-sided p-value for a hypothesis test whose null hypothesis is that two sets of data are uncorrelated, has same dimension as rho.
plot module¶
Basic plotting class objects for DVHA-Stats based on matplotlib
-
class
dvhastats.plot.
BoxPlot
(data, title='Box and Whisker', xlabel='', ylabel='', xlabels=None, **kwargs)[source]¶ Bases:
dvhastats.plot.DistributionChart
Box and Whisker plotting class object
Parameters: - data (array-like) – Input array (1-D or 2-D)
- title (str, optional) – Set the plot title
- xlabel (str, optional) – Set the x-axis title
- xlabels (array-like, optional) – Set the xtick labels (e.g., variable names for each box plot)
- ylabel (str, optional) – Set the y-axis title
- kwargs (any, optional) – Any keyword argument may be set per matplotlib histogram: https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.boxplot.html
-
class
dvhastats.plot.
Chart
(title=None, fig_init=True)[source]¶ Bases:
object
Base class for charts
Parameters: - title (str, optional) – Set the title suptitle
- fig_init (bool) – Automatically call pyplot.figure, store in Chart.figure
-
class
dvhastats.plot.
ControlChart
(y, out_of_control, center_line, lcl=None, ucl=None, title='Control Chart', xlabel='Observation', ylabel='Charting Variable', line_color='black', line_width=0.75, center_line_color='black', center_line_width=1.0, center_line_style='--', limit_line_color='red', limit_line_width=1.0, limit_line_style='--', **kwargs)[source]¶ Bases:
dvhastats.plot.Plot
ControlChart class object
Parameters: - y (np.ndarray, list) – Charting data
- out_of_control (np.ndarray, list) – The indices of y that are out-of-control
- center_line (float, np.ndarray) – The center line value (e.g., np.mean(y))
- lcl (float, optional) – The lower control limit (LCL). Line omitted if lcl is None.
- ucl (float, optional) – The upper control limit (UCL). Line omitted if ucl is None.
- title (str) – Set the plot title
- xlabel (str) – Set the x-axis title
- ylabel (str) – Set the y-axis title
- line_color (str, optional) – Specify the line color
- line_width (float, int) – Specify the line width
- kwargs (any) – Any additional keyword arguments applicable to the Plot class
-
add_center_line
(color=None, line_width=None, line_style=None)[source]¶ Add the center line to the plot
-
class
dvhastats.plot.
DistributionChart
(data, title='Chart', xlabel='Bins', ylabel='Counts', **kwargs)[source]¶ Bases:
dvhastats.plot.Chart
Distribution plotting class object (base for histogram / boxplot
Parameters: - data (array-like) – Input array (1-D or 2-D)
- title (str) – Set the plot title
- xlabel (str) – Set the x-axis title
- ylabel (str) – Set the y-axis title
- kwargs (any) – Any keyword argument may be set per matplotlib histogram: https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.hist.html
-
class
dvhastats.plot.
HeatMap
(X, xlabels=None, ylabels=None, title=None, cmap='viridis', show=True)[source]¶ Bases:
dvhastats.plot.Chart
Create a heat map using matplotlib.pyplot.matshow
Parameters: - X (np.ndarray) – Input data (2-D) with N rows of observations and p columns of variables.
- xlabels (list, optional) – Optionally set the variable names with a list of str
- ylabels (list, optional) – Optionally set the variable names with a list of str
- title (str, optional) – Set the title suptitle
- cmap (str) – matplotlib compatible color map
- show (bool) – Automatically show the figure
-
class
dvhastats.plot.
Histogram
(data, bins=10, title='Histogram', xlabel='Bins', ylabel='Counts', **kwargs)[source]¶ Bases:
dvhastats.plot.DistributionChart
Histogram plotting class object
Parameters: - data (array-like) – Input array (1-D)
- bins (int, sequence, str) – default: rcParams[“hist.bins”] (default: 10) If bins is an integer, it defines the number of equal-width bins in the range. If bins is a sequence, it defines the bin edges, including the left edge of the first bin and the right edge of the last bin; in this case, bins may be unequally spaced. All but the last (righthand-most) bin is half-open. In other words, if bins is: [1, 2, 3, 4] then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4. If bins is a string, it is one of the binning strategies supported by numpy.histogram_bin_edges: ‘auto’, ‘fd’, ‘doane’, ‘scott’, ‘stone’, ‘rice’, ‘sturges’, or ‘sqrt’.
- title (str) – Set the plot title
- xlabel (str) – Set the x-axis title
- ylabel (str) – Set the y-axis title
- kwargs (any) – Any keyword argument may be set per matplotlib histogram: https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.hist.html
-
class
dvhastats.plot.
PCAFeatureMap
(X, features=None, cmap='viridis', show=True, title='PCA Feature Heat Map')[source]¶ Bases:
dvhastats.plot.HeatMap
Specialized Heat Map for PCA feature evaluation
Parameters: - X (np.ndarray) – Input data (2-D) with N rows of observations and p columns of variables.
- features (list, optional) – Optionally set the feature names with a list of str
- title (str, optional) – Set the title suptitle
- cmap (str) – matplotlib compatible color map
- show (bool) – Automatically show the figure
-
class
dvhastats.plot.
Plot
(y, x=None, show=True, title='Chart', xlabel='Independent Variable', ylabel='Dependent Variable', line=True, line_color=None, line_width=1.0, line_style='-', scatter=True, scatter_color=None)[source]¶ Bases:
dvhastats.plot.Chart
Generic plotting class with matplotlib
Parameters: - y (np.ndarray, list) – The y data to be plotted (1-D only)
- x (np.ndarray, list, optional) – Optionally specify the x-axis values. Otherwise index+1 is used.
- show (bool) – Automatically plot the data if True
- title (str) – Set the plot title
- xlabel (str) – Set the x-axis title
- ylabel (str) – Set the y-axis title
- line (bool) – Plot the data as a line series
- line_color (str, optional) – Specify the line color
- line_width (float, int) – Specify the line width
- line_style (str) – Specify the line style
- scatter (bool) – Plot the data as a scatter plot (circles)
- scatter_color (str, optional) – Specify the scatter plot circle color
-
add_line
(y, x=None, line_color=None, line_width=None, line_style=None)[source]¶ Add another line with the provided data
Parameters: - y (np.ndarray, list) – The y data to be plotted (1-D only)
- x (np.ndarray, list, optional) – Optionally specify the x-axis values. Otherwise index+1 is used.
- line_color (str, optional) – Specify the line color
- line_width (float, int) – Specify the line width
- line_style (str) – Specify the line style
utilities module¶
Common functions for the DVHA-Stats.
-
dvhastats.utilities.
apply_dtype
(value, dtype)[source]¶ Convert value with the provided data type
Parameters: - value (any) – Value to be converted
- dtype (function, None) – python reserved types, e.g., int, float, str, etc. However, dtype could be any callable that raises a ValueError on failure.
Returns: The return of dtype(value) or numpy.nan on ValueError
Return type: any
-
dvhastats.utilities.
csv_to_dict
(csv_file_path, delimiter=', ', dtype=None, header_row=True)[source]¶ Read in a csv file, return data as a dictionary
Parameters: - csv_file_path (str) – File path to the CSV file to be processed.
- delimiter (str) – Specify the delimiter used in the csv file (default = ‘,’)
- dtype (callable, type, optional) – Optionally force values to a type (e.g., float, int, str, etc.).
- header_row (bool, optional) – If True, the first row is interpreted as column keys, otherwise row indices will be used
Returns: CSV data as a dict, using the first row values as keys
Return type: dict
-
dvhastats.utilities.
dict_to_array
(data, key_order=None)[source]¶ Convert a dict of data to a numpy array
Parameters: - data (dict) – Dictionary of data to be converted to np.array.
- key_order (None, list of str) – Optionally the order of columns
Returns: A dictionary with keys of ‘data’ and ‘columns’, pointing to a numpy array and list of str, respectively
Return type: dict
-
dvhastats.utilities.
get_sorted_indices
(list_data)[source]¶ Get original indices of a list after sorting
Parameters: list_data (list) – Any python sortable list Returns: list_data indices of sorted(list_data) Return type: list
-
dvhastats.utilities.
import_data
(data, var_names=None)[source]¶ Generalized data importer for np.ndarray, dict, and csv file
Parameters: - data (numpy.array, dict, str) – Input data (2-D) with N rows of observations and p columns of variables. The CSV file must have a header row for column names.
- var_names (list of str, optional) – If data is a numpy array, optionally provide the column names.
Returns: A tuple: data as an array and variable names as a list
Return type: np.ndarray, list
-
dvhastats.utilities.
is_numeric
(val)[source]¶ Check if value is numeric (float or int)
Parameters: val (any) – Any value Returns: Returns true if float(val) doesn’t raise a ValueError Return type: bool
-
dvhastats.utilities.
sort_2d_array
(arr, index, mode='col')[source]¶ Sort a 2-D numpy array
Parameters: - arr (np.ndarray) – Input 2-D array to be sorted
- index (int, list) – Index of column or row to sort arr. If list, will sort by each index in the order provided.
- mode (str) – Either ‘col’ or ‘row’
-
dvhastats.utilities.
str_arr_to_date_arr
(arr, date_parser_kwargs=None, force=False)[source]¶ Convert an array of datetime strings to a list of datetime objects
Parameters: - arr (array-like) – Array of datetime strings compatible with dateutil.parser.parse
- date_parser_kwargs (dict, optional) – Keyword arguments to be passed into dateutil.parser.parse
- force (bool) – If true, failed parsings will result in original value. If false, dateutil.parser.parse’s error will be raised on failures.
Returns: list of datetime objects
Return type: list
-
dvhastats.utilities.
widen_data
(data_dict, uid_columns, x_data_cols, y_data_col, date_col=None, sort_by_date=True, remove_partial_columns=False, multi_val_policy='first', dtype=None, date_parser_kwargs=None)[source]¶ Convert a narrow data dictionary into wide format (i.e., from one row per dependent value to one row per observation)
Parameters: - data_dict (dict) – Data to be converted. The length of each array must be uniform.
- uid_columns (list) – Keys of data_dict used to create an observation uid
- x_data_cols (list) – Keys of columns representing independent data
- y_data_col (int, str) – Key of data_dict representing dependent data
- date_col (int, str, optional) – Key of date column
- sort_by_date (bool, optional) – Sort output by date (date_col required)
- remove_partial_columns (bool, optional) – If true, any columns that have a blank row will be removed
- multi_val_policy (str) – Either ‘first’, ‘last’, ‘min’, ‘max’. If multiple values are found for a particular combination of x_data_cols, one value will be selected based on this policy.
- dtype (function) – python reserved types, e.g., int, float, str, etc. However, dtype could be any callable that raises a ValueError on failure.
- date_parser_kwargs (dict, optional) – Keyword arguments to be passed into dateutil.parser.parse
Returns: data_dict reformatted to one row per UID
Return type: dict