ui module¶

DVHA-Stats classes for user interaction

class dvhastats.ui.ControlChartUI(y, std=3, ucl_limit=None, lcl_limit=None, var_name=None, x=None, plot_title=None)[source]¶

Bases: dvhastats.ui.DVHAStatsBaseClass, dvhastats.stats.ControlChart

Univariate Control Chart

Parameters:

y (list, np.ndarray) – Input data (1-D)
std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control.
ucl_limit (float, optional) – Limit the upper control limit to this value
lcl_limit (float, optional) – Limit the lower control limit to this value
plot_title (str, optional) – Over-ride the plot title

show()[source]¶

Display the univariate control chart with matplotlib

Returns:	The number of the newly created matplotlib figure
Return type:	int

class dvhastats.ui.CorrelationMatrixUI(X, var_names=None, corr_type='Pearson', cmap='coolwarm')[source]¶

Bases: dvhastats.ui.DVHAStatsBaseClass, dvhastats.stats.CorrelationMatrix

Pearson-R correlation matrix UI object

Parameters:	X (np.ndarray) – Input data (2-D) with N rows of observations and p columns of variables. var_names (list, optional) – Optionally set the variable names with a list of str corr_type (str) – Either “Pearson” or “Spearman” cmap (str) – matplotlib compatible color map

show(absolute=False, corr=True)[source]¶

Create a heat map of PCA components

Parameters:	absolute (bool) – Heat map will display the absolute values in PCA components if True corr (bool) – Plot a p-value matrix if False, correlation matrix if True.
Returns:	The number of the newly created matplotlib figure
Return type:	int

class dvhastats.ui.DVHAStats(data=None, var_names=None, x_axis=None, avg_len=5, del_const_vars=False)[source]¶

Bases: dvhastats.ui.DVHAStatsBaseClass

The main UI class object for DVHAStats

Parameters:

data (numpy.array, dict, str, None) – Input data (2-D) with N rows of observations and p columns of variables. The CSV file must have a header row for column names. Test data is loaded if None
var_names (list of str, optional) – If data is a numpy array, optionally provide the column names.
x_axis (numpy.array, list, optional) – Specify x_axis for plotting purposes. Default is based on row number in data
avg_len (int) – When plotting raw data, a trend line will be plotted using this value as an averaging length. If N < avg_len + 1 will not plot a trend line
del_const_vars (bool) – Automatically delete any variables that have constant data. The names of these variables are stored in the excluded_vars attr. Default value is False.

box_cox(alpha=None, lmbda=None, const_policy='propagate')[source]¶: Apply box_cox_by_index for all data

box_cox_by_index(index, alpha=None, lmbda=None, const_policy='propagate')[source]¶

Parameters:	index (int, str) – The index corresponding to the variable data to have a box-cox transformation applied. If index is a string, it will be assumed to be the var_name lmbda (None, scalar, optional) – If lmbda is not None, do the transformation for that value. If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument. alpha (None, float, optional) – If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0. const_policy (str) – {‘propagate’, ‘raise’, ‘omit’} Defines how to handle when data is constant. The following options are available (default is ‘propagate’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: remove
Returns:	Results from stats.box_cox
Return type:	np.ndarray

constant_var_indices¶

Get a list of all constant variable indices

Returns:	Indices of variables with no variation
Return type:	list

constant_vars¶

Get a list of all constant variables

Returns:	Names of variables with no variation
Return type:	list

correlation_matrix(corr_type='Pearson')[source]¶

Get a Pearson-R or Spearman correlation matrices

Parameters:	corr_type (str) – Either “Pearson” or “Spearman”
Returns:	A CorrelationMatrixUI class object
Return type:	CorrelationMatrixUI

del_const_vars()[source]¶: Permanently remove variables with no variation

del_var(var_name)[source]¶

Determine if data by var_name is constant

Parameters:	var_name (int, str) – The var_name to delete (or index of variable)

get_data_by_var_name(var_name)[source]¶

Get the single variable array based on var_name

Parameters:	var_name (int, str) – The name (str) or index (int) of the variable of interest
Returns:	The column of data for the given var_name
Return type:	np.ndarray

get_index_by_var_name(var_name)[source]¶

Get the variable index by var_name

Parameters:	var_name (int, str) – The name (str) or index (int) of the variable of interest
Returns:	The column index for the given var_name
Return type:	int

histogram(var_name, bins='auto', nan_policy='omit')[source]¶

Get a Histogram class object

var_name : str, int: The name (str) or index (int) of teh variable to plot
bins : int, list, str, optional: See https://numpy.org/doc/stable/reference/generated/numpy.histogram.html for details
nan_policy : str: Value must be one of the following: ‘propagate’, ‘raise’, ‘omit’ Defines how to handle when input contains nan. The following options are available (default is ‘omit’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: performs the calculations ignoring nan values

hotelling_t2(alpha=0.05, box_cox=False, box_cox_alpha=None, box_cox_lmbda=None, const_policy='omit')[source]¶

Calculate control limits for a standard univariate Control Chart

Parameters:	alpha (float) – Significance level used to determine the upper control limit (ucl) box_cox (bool, optional) – Set to true to perform a Box-Cox transformation on data prior to calculating the control chart statistics box_cox_alpha (float, optional) – If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0. box_cox_lmbda (float, optional) – If lmbda is not None, do the transformation for that value. If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument. const_policy (str) – {‘raise’, ‘omit’} Defines how to handle when data is constant. The following options are available (default is ‘raise’): ‘raise’: throws an error ‘omit’: exclude constant variables from calculation
Returns:	HotellingT2UI class object
Return type:	HotellingT2UI

is_constant(var_name)[source]¶

Determine if data by var_name is constant

Parameters:	var_name (int, str) – The var_name to check (or index of variable)
Returns:	True if all values of var_name are the same (i.e., no variation)
Return type:	bool

linear_reg(y, y_var_name=None, reg_vars=None, saved_reg=None, back_elim=False, back_elim_p=0.05)[source]¶

Initialize a MultiVariableRegression class object

Parameters:	y (np.ndarray, list, str, int) – Dependent data based on DVHAStats.data. If y is str or int, then it is assumed to be the var_name or index of data to be set as the dependent variable y_var_name (int, str, optional) – Optionally provide name of the dependent variable. Automatically set if y is str or int reg_vars (list, optional) – Optionally specify variable names or indices of data to be used in the regression saved_reg (MultiVariableRegression, optional) – If supplied, predicted values (y-hat) will be calculated with DVHAStats.data and the regression from saved_reg. This is useful if testing a regression model on new data. back_elim (bool) – Automatically perform backward elimination if True back_elim_p (float) – p-value threshold for backward elimination
Returns:	A LinearRegUI class object.
Return type:	LinearRegUI

non_const_data¶

Return self.data excluding any constant variables

Returns:	Data with constant variables removed. This does not alter the data property.
Return type:	np.ndarray

observations¶

Number of observations in data

Returns:	Number of rows in data
Return type:	int

pca(n_components=0.95, transform=True, **kwargs)[source]¶

Return an sklearn PCA-like object, see PCA object for details

Parameters:	n_components (int, float, None or str) – Number of components to keep. if n_components is not set all components are kept: n_components == min(n_samples, n_features) If n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension. Use of n_components == ‘mle’ will interpret svd_solver == ‘auto’ as svd_solver == ‘full’. If 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If svd_solver == ‘arpack’, the number of components must be strictly less than the minimum of n_features and n_samples. transform (bool) – Fit the model and apply the dimensionality reduction kwargs (any) – Provide any keyword arguments for sklearn.decomposition.PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Returns:	A principal component analysis object inherited from sklearn.decomposition.PCA
Return type:	PCAUI

risk_adjusted_control_chart(y, std=3, ucl_limit=None, lcl_limit=None, saved_reg=None, y_name=None, reg_vars=None, back_elim=False, back_elim_p=0.05)[source]¶

Calculate control limits for a Risk-Adjusted Control Chart

Parameters:

y (list, np.ndarray) – 1-D Input data (dependent data)
std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control.
ucl_limit (float, optional) – Limit the upper control limit to this value
lcl_limit (float, optional) – Limit the lower control limit to this value
saved_reg (MultiVariableRegression, optional) – Optionally provide a previously calculated regression
y_name (int, str, optional) – Optionally provide name of the dependent variable. Automatically set if y is str or int
reg_vars (list, optional) – Optionally specify variable names or indices of data to be used in the regression
saved_reg – If supplied, predicted values (y-hat) will be calculated with DVHAStats.data and the regression from saved_reg. This is useful if testing a regression model on new data.
back_elim (bool) – Automatically perform backward elimination if True
back_elim_p (float) – p-value threshold for backward elimination

show(var_name=None, plot_type='trend', **kwargs)[source]¶

Display a plot of var_name with matplotlib

Parameters:	var_name (str, int, None) – The name (str) or index (int) of the variable to plot. If None and plot_type=”boxplot”, all variables will be plotted. plot_type (str) – Either “trend”, “hist”, “box” kwargs (any) – If plot_type is “hist”, pass any of the matplotlib hist key word arguments
Returns:	The number of the newly created matplotlib figure
Return type:	int

univariate_control_chart(var_name, std=3, ucl_limit=None, lcl_limit=None, box_cox=False, box_cox_alpha=None, box_cox_lmbda=None, const_policy='propagate')[source]¶

Calculate control limits for a standard univariate Control Chart

Parameters:	var_name (str, int) – The name (str) or index (int) of teh variable to plot std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control ucl_limit (float, optional) – Limit the upper control limit to this value lcl_limit (float, optional) – Limit the lower control limit to this value box_cox (bool, optional) – Set to true to perform a Box-Cox transformation on data prior to calculating the control chart statistics box_cox_alpha (float, optional) – If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0. box_cox_lmbda (float, optional) – If lmbda is not None, do the transformation for that value. If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument. const_policy (str) – {‘propagate’, ‘raise’, ‘omit’} Defines how to handle when data is constant. The following options are available (default is ‘propagate’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: remove NaN data
Returns:	stats.ControlChart class object
Return type:	stats.ControlChart

univariate_control_charts(**kwargs)[source]¶

Calculate Control charts for all variables

Parameters:	kwargs (any) – See univariate_control_chart for keyword parameters
Returns:	ControlChart class objects stored in a dictionary with var_names and indices as keys (can use var_name or index)
Return type:	dict

variable_count¶

Number of variables in data

Returns:	Number of columns in data
Return type:	int

class dvhastats.ui.DVHAStatsBaseClass[source]¶

Bases: object

Base Class for DVHAStats objects and child objects

close(figure_number)[source]¶: Close a plot by figure_number

class dvhastats.ui.HotellingT2UI(data, alpha=0.05, plot_title=None)[source]¶

Bases: dvhastats.ui.DVHAStatsBaseClass, dvhastats.stats.HotellingT2

Hotelling’s t-squared statistic for multivariate hypothesis testing

Parameters:	data (np.ndarray) – A 2-D array of data to perform multivariate analysis. (e.g., DVHAStats.data) alpha (float) – The significance level used to calculate the upper control limit (UCL) plot_title (str, optional) – Over-ride the plot title

show()[source]¶

Display the multivariate control chart with matplotlib

Returns:	The number of the newly created matplotlib figure
Return type:	int

class dvhastats.ui.LinearRegUI(X, y, saved_reg=None, var_names=None, y_var_name=None, back_elim=False, back_elim_p=0.05)[source]¶

Bases: dvhastats.ui.DVHAStatsBaseClass, dvhastats.stats.MultiVariableRegression

A MultiVariableRegression class UI object

Parameters:

y (np.ndarray, list) – Dependent data based on DVHAStats.data
saved_reg (MultiVariableRegression, optional) – If supplied, predicted values (y-hat) will be calculated with DVHAStats.data and the regression from saved_reg. This is useful if testing a regression model on new data.
var_names (list, optional) – Optionally provide names of the independent variables
y_var_name (int, str, optional) – Optionally provide name of the dependent variable
back_elim (bool) – Automatically perform backward elimination if True
back_elim_p (float) – p-value threshold for backward elimination

show(plot_type='residual')[source]¶

Create a Residual or Probability Plot

Parameters:	plot_type (str) – Either “residual” or “prob”
Returns:	The number of the newly created matplotlib figure
Return type:	int

class dvhastats.ui.PCAUI(X, var_names=None, n_components=0.95, transform=True, **kwargs)[source]¶

Bases: dvhastats.ui.DVHAStatsBaseClass, dvhastats.stats.PCA

Hotelling’s t-squared statistic for multivariate hypothesis testing

Parameters:

X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
var_names (str, optional) – Names of the independent variables in X
n_components (int, float, None or str) – Number of components to keep. if n_components is not set all components are kept: n_components == min(n_samples, n_features) If n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension. Use of n_components == ‘mle’ will interpret svd_solver == ‘auto’ as svd_solver == ‘full’. If 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If svd_solver == ‘arpack’, the number of components must be strictly less than the minimum of n_features and n_samples.
transform (bool) – Fit the model and apply the dimensionality reduction
kwargs (any) – Provide any keyword arguments for sklearn.decomposition.PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

show(plot_type='feature_map', absolute=True)[source]¶

Create a heat map of PCA components

Parameters:	plot_type (str) – Select a plot type to display. Options include: feature_map. absolute (bool) – Heat map will display the absolute values in PCA components if True
Returns:	The number of the newly created matplotlib figure
Return type:	int

class dvhastats.ui.RiskAdjustedControlChartUI(X, y, std=3, ucl_limit=None, lcl_limit=None, x=None, y_name=None, var_names=None, saved_reg=None, plot_title=None, back_elim=False, back_elim_p=0.05)[source]¶

Bases: dvhastats.ui.DVHAStatsBaseClass, dvhastats.stats.RiskAdjustedControlChart

Risk-Adjusted Control Chart using a Multi-Variable Regression

Parameters:

X (array-like) – Input array (independent data)
y (list, np.ndarray) – 1-D Input data (dependent data)
std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control.
ucl_limit (float, optional) – Limit the upper control limit to this value
lcl_limit (float, optional) – Limit the lower control limit to this value
x (list, np.ndarray, optional) – x-axis values
plot_title (str, optional) – Over-ride the plot title
saved_reg (MultiVariableRegression, optional) – Optionally provide a previously calculated regression
var_names (list, optional) – Optionally provide names of the variables
back_elim (bool) – Automatically perform backward elimination if True
back_elim_p (float) – p-value threshold for backward elimination

show()[source]¶

Display the risk-adjusted control chart with matplotlib

Returns:	The number of the newly created matplotlib figure
Return type:	int

stats module¶

Statistical calculations and class objects

class dvhastats.stats.ControlChart(y, std=3, ucl_limit=None, lcl_limit=None, x=None)[source]¶

Bases: object

Calculate control limits for a standard univariate Control Chart”

Parameters:	y (list, np.ndarray) – Input data (1-D) std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control. ucl_limit (float, optional) – Limit the upper control limit to this value lcl_limit (float, optional) – Limit the lower control limit to this value

avg_moving_range¶

Avg moving range based on 2 consecutive points

Returns:	Average moving range. Returns NaN if arr is empty.
Return type:	np.ndarray, np.nan

center_line¶

Center line of charting data (i.e., mean value)

Returns:	Mean value of y with np.mean() or np.nan if y is empty
Return type:	np.ndarray, np.nan

chart_data¶

JSON compatible dict for chart generation

Returns:	Data used for Histogram visuals. Keys include ‘x’, ‘y’, ‘out_of_control’, ‘center_line’, ‘lcl’, ‘ucl’
Return type:	dict

control_limits¶

Calculate the lower and upper control limits

Returns:	lcl (float) – Lower Control Limit (LCL) ucl (float) – Upper Control Limit (UCL)

out_of_control¶

Get the indices of out-of-control observations

Returns:	An array of indices that are not between the lower and upper control limits
Return type:	np.ndarray

out_of_control_high¶

Get the indices of observations > ucl

Returns:	An array of indices that are greater than the upper control limit
Return type:	np.ndarray

out_of_control_low¶

Get the indices of observations < lcl

Returns:	An array of indices that are less than the lower control limit
Return type:	np.ndarray

sigma¶

UCL/LCL = center_line +/- sigma * std

Returns:	sigma or np.nan if arr is empty
Return type:	np.ndarray, np.nan

class dvhastats.stats.CorrelationMatrix(X, corr_type='Pearson')[source]¶

Bases: object

Pearson-R correlation matrix

Parameters:	X (np.ndarray) – Input data (2-D) with N rows of observations and p columns of variables. corr_type (str) – Either “Pearson” or “Spearman”

chart_data¶

JSON compatible dict for chart generation

Returns:	Data used for Histogram visuals. Keys include ‘corr’, ‘p’, ‘norm’, ‘norm_p’
Return type:	dict

normality¶

The normality and normality p-value of the input array

Returns:	statistic (np.ndarray) – Normality calculated with scipy.stats.normaltest p-value (np.ndarray) – A 2-sided chi squared probability for the hypothesis test.

class dvhastats.stats.Histogram(y, bins, nan_policy='omit')[source]¶

Bases: object

Basic histogram plot using matplotlib histogram calculation

Parameters:

y (array-like) – Input array.
bins (int, list, str, optional) – If bins is an int, it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines a monotonically increasing array of bin edges, including the rightmost edge, allowing for non-uniform bin widths. If bins is a string, it defines the method used to calculate the optimal bin width, as defined by histogram_bin_edges. ‘auto’ - Maximum of the ‘sturges’ and ‘fd’ estimators. Provides good all around performance. ‘fd’ - (Freedman Diaconis Estimator) Robust (resilient to outliers) estimator that takes into account data variability and data size. ‘doane’ - An improved version of Sturges’ estimator that works better with non-normal datasets. ‘scott’ - Less robust estimator that that takes into account data variability and data size. ‘stone’ - Estimator based on leave-one-out cross-validation estimate of the integrated squared error. Can be regarded as a generalization of Scott’s rule. ‘rice’ - Estimator does not take variability into account, only data size. Commonly overestimates number of bins required. ‘sturges’ - R’s default method, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets. ‘sqrt’ - Square root (of data size) estimator, used by Excel and other programs for its speed and simplicity.
nan_policy (str) – Value must be one of the following: ‘propagate’, ‘raise’, ‘omit’ Defines how to handle when input contains nan. The following options are available (default is ‘omit’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: performs the calculations ignoring nan values

chart_data¶

JSON compatible dict for chart generation

Returns:	Data used for Histogram visuals. Keys include ‘x’, ‘y’, ‘mean’, ‘median’, ‘std’, ‘normality’, ‘normality_p’
Return type:	dict

hist_data¶

Get the histogram data

Returns:	hist (np.ndarray) – The values of the histogram center (np.ndarray) – The centers of the bins

mean¶

The mean value of the input array

Returns:	Mean value of y with np.mean()
Return type:	np.ndarray

median¶

The median value of the input array

Returns:	Median value of y with np.median()
Return type:	np.ndarray

normality¶

The normality and normality p-value of the input array

Returns:	statistic (float) – Normality calculated with scipy.stats.normaltest p-value (float) – A 2-sided chi squared probability for the hypothesis test.

std¶

The standard deviation of the input array

Returns:	Standard deviation of y with np.std()
Return type:	np.ndarray

class dvhastats.stats.HotellingT2(data, alpha=0.05, const_policy='raise')[source]¶

Bases: object

Hotelling’s t-squared statistic for multivariate hypothesis testing

Parameters:

data (np.ndarray) – A 2-D array of data to perform multivariate analysis. (e.g., DVHAStats.data)
alpha (float) – The significance level used to calculate the upper control limit (UCL)
const_policy (str) – {‘raise’, ‘omit’} Defines how to handle when data is constant. The following options are available (default is ‘raise’): ‘raise’: throws an error ‘omit’: exclude constant variables from calculation

Q¶

Calculate Hotelling T^2 statistic (Q) from a 2-D numpy array

Returns:	A numpy array of Hotelling T^2 (1-D of length N)
Return type:	np.ndarray

center_line¶

Center line for the control chart

Returns:	Median value of beta distribution.
Return type:	float

chart_data¶

JSON compatible dict for chart generation

Returns:	Data used for Histogram visuals. Keys include ‘x’, ‘y’, ‘out_of_control’, ‘center_line’, ‘lcl’, ‘ucl’
Return type:	dict

control_limits¶

Lower and Upper control limits

Returns:	lcl (float) – Lower Control Limit (LCL). This is fixed to 0 for Hotelling T2 ucl (float) – Upper Control Limit (UCL)

get_control_limit(x)[source]¶

Calculate a Hotelling T^2 control limit using a beta distribution

Parameters:	x (float) – Value where the beta function is evaluated
Returns:	The control limit for a beta distribution
Return type:	float

observations¶

Number of observations in data

Returns:	Number of rows in data
Return type:	int

out_of_control¶

Indices of out-of-control observations

Returns:	An array of indices that are greater than the upper control limit. (NOTE: Q is never negative)
Return type:	np.ndarray

ucl¶

Upper control limit

Returns:	ucl – Upper Control Limit (UCL)
Return type:	float

variable_count¶

Number of variables in data

Returns:	Number of columns in data
Return type:	int

class dvhastats.stats.MultiVariableRegression(X, y, saved_reg=None, var_names=None, y_var_name=None, back_elim=False, back_elim_p=0.05)[source]¶

Bases: object

Multi-variable regression using scikit-learn

Parameters:

X (array-like) – Independent data
y (array-like) – Dependent data
saved_reg (MultiVariableRegression, optional) – Optionally provide a previously calculated regression
var_names (list, optional) – Optionally provide names of the variables
y_var_name (int, str, optional) – Optionally provide name of the dependent variable
back_elim (bool) – Automatically perform backward elimination if True
back_elim_p (float) – p-value threshold for backward elimination

backward_elimination(p_value=0.05)[source]¶

Remove insignificant variables from regression

p_value : float: Iteratively remove the least significant variable until all variables have p-values less than p_value or only one variable remains.

chart_data¶

JSON compatible dict for chart generation

Returns:	Data used for residual visuals. Keys include ‘x’, ‘y’, ‘pred’, ‘resid’, ‘coef’, ‘r_sq’, ‘mse’, ‘std_err’, ‘t_value’, ‘p_value’
Return type:	dict

coef¶

Coefficients for the regression

Returns:	An array of regression coefficients (i.e., y_intercept, 1st var slope, 2nd var slope, etc.)
Return type:	np.ndarray

df_error¶

Error degrees of freedom

Returns:	Degrees of freedom for the error
Return type:	int

df_model¶

Model degrees of freedom

Returns:	Degrees of freedom for the model
Return type:	int

f_p_value¶

p-value of the f-statistic

Returns:	p-value of the F-statistic of beta coefficients using scipy
Return type:	float

f_stat¶

The F-statistic of the regression

Returns:	F-statistic of beta coefficients using regressors.stats
Return type:	float

mse¶

Mean squared error of the linear regression

Returns:	A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
Return type:	float, nd.array

prob_plot¶

Calculate quantiles for a probability plot

Returns:	Data for generating a probablity plot. Keys include: ‘x’, ‘y’, ‘y_intercept’, ‘slope’, ‘x_trend’, ‘y_trend’
Return type:	dict

r_sq¶

R^2 (coefficient of determination) regression score function.

Returns:	The R^2 score
Return type:	float

residuals¶

Residuals of the prediction and sample data

Returns:	y - predictions
Return type:	np.ndarray

slope¶

The slope of the linear regression

Returns:	Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.
Return type:	np.ndarray

y_intercept¶

The y-intercept of the linear regression

Returns:	Independent term in the linear model.
Return type:	float

class dvhastats.stats.PCA(X, var_names=None, n_components=0.95, transform=True, **kwargs)[source]¶

Bases: sklearn.decomposition._pca.PCA

Principal Component Analysis with sklearn.decomposition.PCA

Parameters:

X (np.ndarray) – Training data (2-D), where n_samples is the number of samples and n_features is the number of features. shape (n_samples, n_features)
var_names (list, optional) – Optionally provide names of the features
n_components (int, float, None or str) – Number of components to keep. if n_components is not set all components are kept: n_components == min(n_samples, n_features) If n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension. Use of n_components == ‘mle’ will interpret svd_solver == ‘auto’ as svd_solver == ‘full’. If 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If svd_solver == ‘arpack’, the number of components must be strictly less than the minimum of n_features and n_samples.
transform (bool) – Fit the model and apply the dimensionality reduction
kwargs (any) – Provide any keyword arguments for sklearn.decomposition.PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

component_labels¶

Get component names

Returns:	Labels for plotting. (1st Comp, 2nd Comp, 3rd Comp, etc.)
Return type:	list

feature_map_data¶

Used for feature analysis heat map

Returns:	Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance.
Return type:	np.ndarray Shape (n_components, n_features)

class dvhastats.stats.RiskAdjustedControlChart(X, y, std=3, ucl_limit=None, lcl_limit=None, x=None, saved_reg=None, var_names=None, back_elim=False, back_elim_p=0.05)[source]¶

Bases: dvhastats.stats.ControlChart

Calculate a risk-adjusted univariate Control Chart (with linear MVR)

Parameters:

X (array-like) – Independent data
y (list, np.ndarray) – Input data (1-D)
std (int, float, optional) – Number of standard deviations used to calculate if a y-value is out-of-control.
ucl_limit (float, optional) – Limit the upper control limit to this value
lcl_limit (float, optional) – Limit the lower control limit to this value
saved_reg (MultiVariableRegression, optional) – Optionally provide a previously calculated regression
var_names (list, optional) – Optionally provide names of the variables
back_elim (bool) – Automatically perform backward elimination if True
back_elim_p (float) – p-value threshold for backward elimination

dvhastats.stats.avg_moving_range(arr, nan_policy='omit')[source]¶

Calculate the average moving range (over 2-consecutive point1)

Parameters:	arr (array-like (1-D)) – Input array. Must be positive 1-dimensional. nan_policy (str, optional) – Value must be one of the following: {‘propagate’, ‘raise’, ‘omit’} Defines how to handle when input contains nan. The following options are available (default is ‘omit’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: performs the calculations ignoring nan values
Returns:	Average moving range. Returns NaN if arr is empty
Return type:	np.ndarray, np.nan

dvhastats.stats.box_cox(arr, alpha=None, lmbda=None, const_policy='propagate')[source]¶

Parameters:	arr (np.ndarray) – Input array. Must be positive 1-dimensional. lmbda (None, scalar, optional) – If lmbda is not None, do the transformation for that value. If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument. alpha (None, float, optional) – If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0. const_policy (str) – {‘propagate’, ‘raise’, ‘omit’} Defines how to handle when data is constant. The following options are available (default is ‘propagate’): ‘propagate’: returns nan ‘raise’: throws an error
Returns:	box_cox – Box-Cox power transformed array
Return type:	np.ndarray

dvhastats.stats.get_lin_reg_p_values(X, y, predictions, y_intercept, slope)[source]¶

Get p-values of a linear regression using sklearn based on https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression

Parameters:

X (np.ndarray) – Independent data
y (np.ndarray, list) – Dependent data
predictions (np.ndarray, list) – Predictions using the linear regression. (Output from linear_model.LinearRegression.predict)
y_intercept (float, np.ndarray) – The y-intercept of the linear regression
slope (float, np.ndarray) – The slope of the linear regression

Returns:

p_value (np.ndarray) – p-value of the linear regression coefficients
std_errs (np.ndarray) – standard errors of the linear regression coefficients
t_value (np.ndarray) – t-values of the linear regression coefficients

dvhastats.stats.get_ordinal(n)[source]¶

Convert number to its ordinal (e.g., 1 to 1st)

Parameters:	n (int) – Number to be converted to ordinal
Returns:	the ordinal of n
Return type:	str

dvhastats.stats.is_arr_constant(arr)[source]¶

Determine if data by var_name is constant

Parameters:	arr (array-like) – Input array or object that can be converted to an array
Returns:	True if all values the same (i.e., no variation)
Return type:	bool

dvhastats.stats.is_nan_arr(arr)[source]¶

Check if array has only NaN elements

Parameters:	arr (np.ndarray) – Input array
Returns:	True if all elements are np.nan
Return type:	bool

dvhastats.stats.moving_avg(y, avg_len, x=None, weight=None)[source]¶

Calculate the moving (rolling) average of a set of data

Parameters:

y (array-like) – data (1-D) to be averaged
avg_len (int) – Data is averaged over this many points (current value and avg_len - 1 prior points)
x (np.ndarray, list, optional) – Optionally specify the x-axis values. Otherwise index+1 is used.
weight (np.ndarray, list, optional) – A weighted moving average is calculated based on the provided weights. weight must be of same length as y. Weights of one are assumed by default.

Returns:

x (np.ndarray) – Resulting x-values for the moving average
moving_avg (np.ndarray) – moving average values

dvhastats.stats.pearson_correlation_matrix(X)[source]¶

Calculate a correlation matrix of Pearson-R values

Parameters:	X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
Returns:	r (np.ndarray) – Array (2-D) of Pearson-R correlations between the row indexed and column indexed variables p (np.ndarray) – Array (2-D) of p-values associated with r

dvhastats.stats.process_nan_policy(arr, nan_policy)[source]¶

Calculate the average moving range (over 2-consecutive point1)

Parameters:	arr (array-like (1-D)) – Input array. Must be positive 1-dimensional. nan_policy (str) – Value must be one of the following: {‘propagate’, ‘raise’, ‘omit’} Defines how to handle when input contains nan. The following options are available (default is ‘omit’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: performs the calculations ignoring nan values
Returns:	Input array evaluated per nan_policy
Return type:	np.ndarray, np.nan

dvhastats.stats.remove_const_column(arr)[source]¶

Remove all columns with zero variance

Parameters:	arr (np.ndarray) – Input array (2-D)
Returns:	Input array with columns of a constant value removed
Return type:	np.ndarray

dvhastats.stats.remove_nan(arr)[source]¶

Remove indices from 1-D array with values of np.nan

Parameters:	arr (np.ndarray (1-D)) – Input array. Must be positive 1-dimensional.
Returns:	arr with NaN values deleted
Return type:	np.ndarray

dvhastats.stats.spearman_correlation_matrix(X, nan_policy='omit')[source]¶

Calculate a Spearman correlation matrix

Parameters:

X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
nan_policy (str) – Value must be one of the following: ‘propagate’, ‘raise’, ‘omit’ Defines how to handle when input contains nan. The following options are available (default is ‘omit’): ‘propagate’: returns nan ‘raise’: throws an error ‘omit’: performs the calculations ignoring nan values

Returns:

correlation (float or ndarray (2-D square)) – Spearman correlation matrix or correlation coefficient (if only 2 variables are given as parameters. Correlation matrix is square with length equal to total number of variables (columns or rows) in a and b combined.
p-value (float) – The two-sided p-value for a hypothesis test whose null hypothesis is that two sets of data are uncorrelated, has same dimension as rho.

plot module¶

Basic plotting class objects for DVHA-Stats based on matplotlib

class dvhastats.plot.BoxPlot(data, title='Box and Whisker', xlabel='', ylabel='', xlabels=None, **kwargs)[source]¶

Bases: dvhastats.plot.DistributionChart

Box and Whisker plotting class object

Parameters:

data (array-like) – Input array (1-D or 2-D)
title (str, optional) – Set the plot title
xlabel (str, optional) – Set the x-axis title
xlabels (array-like, optional) – Set the xtick labels (e.g., variable names for each box plot)
ylabel (str, optional) – Set the y-axis title
kwargs (any, optional) – Any keyword argument may be set per matplotlib histogram: https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.boxplot.html

class dvhastats.plot.Chart(title=None, fig_init=True)[source]¶

Bases: object

Base class for charts

Parameters:	title (str, optional) – Set the title suptitle fig_init (bool) – Automatically call pyplot.figure, store in Chart.figure

activate()[source]¶: Activate this figure

close()[source]¶: Close this figure

show()[source]¶: Display this figure

class dvhastats.plot.ControlChart(y, out_of_control, center_line, lcl=None, ucl=None, title='Control Chart', xlabel='Observation', ylabel='Charting Variable', line_color='black', line_width=0.75, center_line_color='black', center_line_width=1.0, center_line_style='--', limit_line_color='red', limit_line_width=1.0, limit_line_style='--', **kwargs)[source]¶

Bases: dvhastats.plot.Plot

ControlChart class object

Parameters:

y (np.ndarray, list) – Charting data
out_of_control (np.ndarray, list) – The indices of y that are out-of-control
center_line (float, np.ndarray) – The center line value (e.g., np.mean(y))
lcl (float, optional) – The lower control limit (LCL). Line omitted if lcl is None.
ucl (float, optional) – The upper control limit (UCL). Line omitted if ucl is None.
title (str) – Set the plot title
xlabel (str) – Set the x-axis title
ylabel (str) – Set the y-axis title
line_color (str, optional) – Specify the line color
line_width (float, int) – Specify the line width
kwargs (any) – Any additional keyword arguments applicable to the Plot class

add_center_line(color=None, line_width=None, line_style=None)[source]¶: Add the center line to the plot

add_control_limit_line(limit, color=None, line_width=None, line_style=None)[source]¶: Add a control limit line to plot

add_scatter()[source]¶: Set scatter data, add in- and out-of-control circles

class dvhastats.plot.DistributionChart(data, title='Chart', xlabel='Bins', ylabel='Counts', **kwargs)[source]¶

Bases: dvhastats.plot.Chart

Distribution plotting class object (base for histogram / boxplot

Parameters:	data (array-like) – Input array (1-D or 2-D) title (str) – Set the plot title xlabel (str) – Set the x-axis title ylabel (str) – Set the y-axis title kwargs (any) – Any keyword argument may be set per matplotlib histogram: https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.hist.html

class dvhastats.plot.HeatMap(X, xlabels=None, ylabels=None, title=None, cmap='viridis', show=True)[source]¶

Bases: dvhastats.plot.Chart

Create a heat map using matplotlib.pyplot.matshow

Parameters:

X (np.ndarray) – Input data (2-D) with N rows of observations and p columns of variables.
xlabels (list, optional) – Optionally set the variable names with a list of str
ylabels (list, optional) – Optionally set the variable names with a list of str
title (str, optional) – Set the title suptitle
cmap (str) – matplotlib compatible color map
show (bool) – Automatically show the figure

class dvhastats.plot.Histogram(data, bins=10, title='Histogram', xlabel='Bins', ylabel='Counts', **kwargs)[source]¶

Bases: dvhastats.plot.DistributionChart

Histogram plotting class object

Parameters:

data (array-like) – Input array (1-D)
bins (int, sequence, str) – default: rcParams[“hist.bins”] (default: 10) If bins is an integer, it defines the number of equal-width bins in the range. If bins is a sequence, it defines the bin edges, including the left edge of the first bin and the right edge of the last bin; in this case, bins may be unequally spaced. All but the last (righthand-most) bin is half-open. In other words, if bins is: [1, 2, 3, 4] then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4. If bins is a string, it is one of the binning strategies supported by numpy.histogram_bin_edges: ‘auto’, ‘fd’, ‘doane’, ‘scott’, ‘stone’, ‘rice’, ‘sturges’, or ‘sqrt’.
title (str) – Set the plot title
xlabel (str) – Set the x-axis title
ylabel (str) – Set the y-axis title
kwargs (any) – Any keyword argument may be set per matplotlib histogram: https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.hist.html

class dvhastats.plot.PCAFeatureMap(X, features=None, cmap='viridis', show=True, title='PCA Feature Heat Map')[source]¶

Bases: dvhastats.plot.HeatMap

Specialized Heat Map for PCA feature evaluation

Parameters:	X (np.ndarray) – Input data (2-D) with N rows of observations and p columns of variables. features (list, optional) – Optionally set the feature names with a list of str title (str, optional) – Set the title suptitle cmap (str) – matplotlib compatible color map show (bool) – Automatically show the figure

get_comp_labels(n_components)[source]¶: Get ylabels for HeatMap

static get_ordinal(n)[source]¶

Convert number to its ordinal (e.g., 1 to 1st)

Parameters:	n (int) – Number to be converted to ordinal
Returns:	the ordinal of n
Return type:	str

class dvhastats.plot.Plot(y, x=None, show=True, title='Chart', xlabel='Independent Variable', ylabel='Dependent Variable', line=True, line_color=None, line_width=1.0, line_style='-', scatter=True, scatter_color=None)[source]¶

Bases: dvhastats.plot.Chart

Generic plotting class with matplotlib

Parameters:

y (np.ndarray, list) – The y data to be plotted (1-D only)
x (np.ndarray, list, optional) – Optionally specify the x-axis values. Otherwise index+1 is used.
show (bool) – Automatically plot the data if True
title (str) – Set the plot title
xlabel (str) – Set the x-axis title
ylabel (str) – Set the y-axis title
line (bool) – Plot the data as a line series
line_color (str, optional) – Specify the line color
line_width (float, int) – Specify the line width
line_style (str) – Specify the line style
scatter (bool) – Plot the data as a scatter plot (circles)
scatter_color (str, optional) – Specify the scatter plot circle color

add_default_line()[source]¶: Add line data to figure

add_line(y, x=None, line_color=None, line_width=None, line_style=None)[source]¶

Add another line with the provided data

Parameters:	y (np.ndarray, list) – The y data to be plotted (1-D only) x (np.ndarray, list, optional) – Optionally specify the x-axis values. Otherwise index+1 is used. line_color (str, optional) – Specify the line color line_width (float, int) – Specify the line width line_style (str) – Specify the line style

add_scatter()[source]¶: Add scatter data to figure

dvhastats.plot.get_new_figure_num()[source]¶

Get a number for a new matplotlib figure

Returns:	Figure number
Return type:	int

utilities module¶

Common functions for the DVHA-Stats.

dvhastats.utilities.apply_dtype(value, dtype)[source]¶

Convert value with the provided data type

Parameters:	value (any) – Value to be converted dtype (function, None) – python reserved types, e.g., int, float, str, etc. However, dtype could be any callable that raises a ValueError on failure.
Returns:	The return of dtype(value) or numpy.nan on ValueError
Return type:	any

dvhastats.utilities.csv_to_dict(csv_file_path, delimiter=', ', dtype=None, header_row=True)[source]¶

Read in a csv file, return data as a dictionary

Parameters:	csv_file_path (str) – File path to the CSV file to be processed. delimiter (str) – Specify the delimiter used in the csv file (default = ‘,’) dtype (callable, type, optional) – Optionally force values to a type (e.g., float, int, str, etc.). header_row (bool, optional) – If True, the first row is interpreted as column keys, otherwise row indices will be used
Returns:	CSV data as a dict, using the first row values as keys
Return type:	dict

dvhastats.utilities.dict_to_array(data, key_order=None)[source]¶

Convert a dict of data to a numpy array

Parameters:	data (dict) – Dictionary of data to be converted to np.array. key_order (None, list of str) – Optionally the order of columns
Returns:	A dictionary with keys of ‘data’ and ‘columns’, pointing to a numpy array and list of str, respectively
Return type:	dict

dvhastats.utilities.get_sorted_indices(list_data)[source]¶

Get original indices of a list after sorting

Parameters:	list_data (list) – Any python sortable list
Returns:	list_data indices of sorted(list_data)
Return type:	list

dvhastats.utilities.import_data(data, var_names=None)[source]¶

Generalized data importer for np.ndarray, dict, and csv file

Parameters:	data (numpy.array, dict, str) – Input data (2-D) with N rows of observations and p columns of variables. The CSV file must have a header row for column names. var_names (list of str, optional) – If data is a numpy array, optionally provide the column names.
Returns:	A tuple: data as an array and variable names as a list
Return type:	np.ndarray, list

dvhastats.utilities.is_numeric(val)[source]¶

Check if value is numeric (float or int)

Parameters:	val (any) – Any value
Returns:	Returns true if float(val) doesn’t raise a ValueError
Return type:	bool

dvhastats.utilities.sort_2d_array(arr, index, mode='col')[source]¶

Sort a 2-D numpy array

Parameters:	arr (np.ndarray) – Input 2-D array to be sorted index (int, list) – Index of column or row to sort arr. If list, will sort by each index in the order provided. mode (str) – Either ‘col’ or ‘row’

dvhastats.utilities.str_arr_to_date_arr(arr, date_parser_kwargs=None, force=False)[source]¶

Convert an array of datetime strings to a list of datetime objects

Parameters:	arr (array-like) – Array of datetime strings compatible with dateutil.parser.parse date_parser_kwargs (dict, optional) – Keyword arguments to be passed into dateutil.parser.parse force (bool) – If true, failed parsings will result in original value. If false, dateutil.parser.parse’s error will be raised on failures.
Returns:	list of datetime objects
Return type:	list

dvhastats.utilities.widen_data(data_dict, uid_columns, x_data_cols, y_data_col, date_col=None, sort_by_date=True, remove_partial_columns=False, multi_val_policy='first', dtype=None, date_parser_kwargs=None)[source]¶

Convert a narrow data dictionary into wide format (i.e., from one row per dependent value to one row per observation)

Parameters:	data_dict (dict) – Data to be converted. The length of each array must be uniform. uid_columns (list) – Keys of data_dict used to create an observation uid x_data_cols (list) – Keys of columns representing independent data y_data_col (int, str) – Key of data_dict representing dependent data date_col (int, str, optional) – Key of date column sort_by_date (bool, optional) – Sort output by date (date_col required) remove_partial_columns (bool, optional) – If true, any columns that have a blank row will be removed multi_val_policy (str) – Either ‘first’, ‘last’, ‘min’, ‘max’. If multiple values are found for a particular combination of x_data_cols, one value will be selected based on this policy. dtype (function) – python reserved types, e.g., int, float, str, etc. However, dtype could be any callable that raises a ValueError on failure. date_parser_kwargs (dict, optional) – Keyword arguments to be passed into dateutil.parser.parse
Returns:	data_dict reformatted to one row per UID
Return type:	dict