Usage¶

To use dvha-stats in a project:

Statistical data can be easily accessed with dvhastats.ui.DVHAStats class.

Getting Started¶

Before attempting the examples below, run these lines first:

>>> from dvhastats.ui import DVHAStats
>>> s = DVHAStats("your_data.csv")  # use s = DVHAStats() for test data

This assumes that your csv is formatted such that it contains one row per observation (i.e., wide format). If your csv contains multivariate data with one row per dependent value (i.e., narrow format), you can use dvhastats.utilities.widen_data(). See Reformatting CSV for an example.

Basic Plotting¶

>>> s.var_names
['V1', 'V2', 'V3', 'V4', 'V5', 'V6']

>>> s.get_data_by_var_name('V1')
array([56.5, 48.1, 48.3, 65.1, 47.1, 49.9, 49.5, 48.9, 35.5, 44.5, 40.3,
       43.5, 43.7, 47.5, 39.9, 42.9, 37.9, 48.7, 41.3, 47.1, 35.9, 46.5,
       45.1, 24.3, 43.5, 45.1, 46.3, 41.1, 35.5, 41.1, 37.3, 42.1, 47.1,
       46.5, 43.3, 45.9, 39.5, 50.9, 44.1, 40.1, 45.7, 20.3, 46.1, 43.7,
       43.9, 36.5, 45.9, 48.9, 44.7, 38.1,  6.1,  5.5, 45.1, 46.5, 48.9,
       48.1, 45.7, 57.1, 35.1, 46.5, 29.5, 41.5, 53.3, 45.3, 41.9, 45.9,
       43.1, 43.9, 46.1])

>>> s.show('V1')  # or s.show(0), can provide index or var_name

Histogram¶

Calculation with numpy.

>>> h = s.histogram('V1')
>>> hist, center = h.hist_data
>>> hist
array([ 2,  0,  0,  0,  0,  1,  1,  0,  1,  0,  5,  4,  9, 15, 17, 10,  1,
    1,  1,  0,  1]
>>> center
array([ 6.91904762,  9.75714286, 12.5952381 , 15.43333333, 18.27142857,
       21.10952381, 23.94761905, 26.78571429, 29.62380952, 32.46190476,
       35.3       , 38.13809524, 40.97619048, 43.81428571, 46.65238095,
       49.49047619, 52.32857143, 55.16666667, 58.0047619 , 60.84285714,
       63.68095238])

Calculation with matplotlib.

>>> s.show(0, plot_type="hist")  # histogram recalculated using matplotlib

Box & Whisker Plot¶

Calculation with matplotlib

>>> s.show(0, plot_type="box")

>>> s.show(plot_type="box")

Pearson-R Correlation Matrix¶

Calculation with scipy.

>>> pearson_mat = s.correlation_matrix()
>>> pearson_mat.corr  # correlation array
array([[1.        , 0.93160407, 0.72199862, 0.56239953, 0.51856243, 0.49619153],
       [0.93160407, 1.        , 0.86121347, 0.66329274, 0.5737434 , 0.51111648],
       [0.72199862, 0.86121347, 1.        , 0.88436716, 0.7521324 ,  0.63030588],
       [0.56239953, 0.66329274, 0.88436716, 1.        , 0.90411476, 0.76986654],
       [0.51856243, 0.5737434 , 0.7521324 , 0.90411476, 1.        , 0.9464186 ],
       [0.49619153, 0.51111648, 0.63030588, 0.76986654, 0.9464186 , 1.        ]])
>>> pearson_mat.p  # p-values
array([[0.00000000e+00, 3.70567507e-31, 2.54573222e-12, 4.92807604e-07, 5.01004755e-06, 1.45230750e-05],
       [3.70567507e-31, 0.00000000e+00, 2.27411745e-21, 5.28815300e-10, 2.55750429e-07, 7.19979746e-06],
       [2.54573222e-12, 2.27411745e-21, 0.00000000e+00, 7.41613930e-24, 9.37849945e-14, 6.49207976e-09],
       [4.92807604e-07, 5.28815300e-10, 7.41613930e-24, 0.00000000e+00, 1.94118606e-26, 1.06898267e-14],
       [5.01004755e-06, 2.55750429e-07, 9.37849945e-14, 1.94118606e-26, 0.00000000e+00, 1.32389842e-34],
       [1.45230750e-05, 7.19979746e-06, 6.49207976e-09, 1.06898267e-14, 1.32389842e-34, 0.00000000e+00]])
>>> pearson_mat.show()

Spearman Correlation Matrix¶

Calculation with scipy.

>>> spearman_mat = s.correlation_matrix("Spearman")
>>> spearman_mat.show()

Univariate Control Chart¶

>>> ucc = s.univariate_control_charts()
>>> ucc['V1']
center_line: 42.845
control_limits: 22.210, 63.480
out_of_control: [ 3 41 50 51]

>>> ucc['V1'].show()  # or ucc[0].show(), can provide index or var_name

Multivariate Control Chart¶

>>> ht2 = s.hotelling_t2()
>>> ht2
Q: [ 5.75062092  3.80141786  3.67243782 18.80124504  2.03849294 18.15447155
     4.54475048 10.40783971  3.60614333  4.03138994  6.45171623  4.60475303
     2.29185301 15.7891342   3.0102578   6.36058098  5.56477106  3.92950273
     1.70534379  2.14021007  7.3839626   1.16554558  7.89636669 20.13613585
     3.76034723  0.93179106  2.05542886  2.65257506  1.31049764  1.59880892
     2.13839258  3.33331329  4.01060102  2.71837612 10.0744586   4.50776545
     1.87955428  7.13423455  4.1773818   3.70446025  3.49570988 11.52822658
     5.874624    2.34515306  2.71884639  2.58457841  3.2591779   4.69554484
     9.1358149   2.64106059 21.21960037 22.6229493   1.55545875  2.29606726
     3.96926714  2.69041382  1.47639788 17.83532339  4.03627833  1.78953536
    15.7485067   1.56110637  2.53753085  2.04243193  6.20630748 14.39527077
     9.88243129  3.70056854  4.92888799]
center_line: 5.375
control_limits: 0, 13.555
out_of_control: [ 3  5 13 23 50 51 57 60 65]

>>> ht2.show()ht

Box-Cox Transformation (for non-normal data)¶

Calculation with scipy.

>>> bc = s.box_cox_by_index(0)
>>> bc
array([3185.2502073 , 2237.32503551, 2257.79294148, 4346.90639712,
       2136.50469314, 2425.19594298, 2382.73410297, 2319.80580872,
       1148.63472597, 1886.15962058, 1517.3226398 , 1794.37742725,
       1812.53465647, 2176.52932216, 1484.4619302 , 1740.50195077,
       1326.0093692 , 2299.03324672, 1601.1904051 , 2136.50469314,
       1177.23656545, 2077.22485894, 1942.42664844,  499.72380601,
       1794.37742725, 1942.42664844, 2057.66647538, 1584.22036354,
       1148.63472597, 1584.22036354, 1280.36568471, 1670.05579771,
       2136.50469314, 2077.22485894, 1776.31962594, 2018.85154453,
       1451.99231252, 2533.13894266, 1849.14775291, 1500.84335095,
       1999.59482773,  336.62160027, 2038.20873211, 1812.53465647,
       1830.79140224, 1220.85798302, 2018.85154453, 2319.80580872,
       1904.81531264, 1341.41740006,   23.64034973,   18.74313335,
       1942.42664844, 2077.22485894, 2319.80580872, 2237.32503551,
       1999.59482773, 3259.95515527, 1120.41519999, 2077.22485894,
        764.99904232, 1618.25887705, 2802.6765172 , 1961.38246534,
       1652.69148146, 2018.85154453, 1758.36116355, 1830.79140224,
       2038.20873211])

Multivariate Control Chart (w/ non-normal data)¶

>>> ht2_bc = s.hotelling_t2(box_cox=True)
>>> ht2_bc.show()

Multi-Variable Linear Regression¶

Calculation with sklearn.

>>> mvr = s.linear_reg("V1")
>>> mvr

Multi-Variable Regression results/model
R²: 0.906
MSE: 7.860
f-stat: 121.632
f-stat p-value: 1.000
+-------+------------+-----------+---------+---------+
|       |       Coef | Std. Err. | t-value | p-value |
+-------+------------+-----------+---------+---------+
| y-int |  1.262E+01 | 1.326E+00 |   9.518 |   0.000 |
|   V2  |  1.107E+00 | 7.547E-02 |  14.664 |   0.000 |
|   V3  | -4.442E-01 | 1.135E-01 |  -3.914 |   0.000 |
|   V4  |  1.786E-01 | 1.340E-01 |   1.333 |   0.187 |
|   V5  | -1.789E-01 | 2.538E-01 |  -0.705 |   0.483 |
|   V6  |  2.833E-01 | 2.355E-01 |   1.203 |   0.233 |
+-------+------------+-----------+---------+---------+

>>> mvr.show()

>>> mvr.show("prob")

>>> mvr2 = s.linear_reg("V1", back_elim=True)
>>> mvr2

Multi-Variable Regression results/model
R²: 0.903
MSE: 8.096
f-stat: 202.431
f-stat p-value: 1.000
+-------+------------+-----------+---------+---------+
|       |       Coef | Std. Err. | t-value | p-value |
+-------+------------+-----------+---------+---------+
| y-int |  1.276E+01 | 1.321E+00 |   9.656 |   0.000 |
|   V2  |  1.070E+00 | 6.700E-02 |  15.967 |   0.000 |
|   V3  | -3.318E-01 | 6.852E-02 |  -4.843 |   0.000 |
|   V6  |  2.000E-01 | 7.542E-02 |   2.652 |   0.010 |
+-------+------------+-----------+---------+---------+

Risk-Adjusted Control Chart¶

>>> ra_cc = s.risk_adjusted_control_chart("V1", back_elim=True)
>>> ra_cc.show()

Principal Component Analysis (PCA)¶

Calculation with sklearn.

>>> pca = s.pca()
>>> pca.feature_map_data
array([[ 0.35795147,  0.44569046,  0.51745294,  0.48745318,  0.34479542, 0.22131141],
       [-0.52601728, -0.51017406, -0.02139406,  0.4386136 ,  0.43258992, 0.28819198],
       [ 0.42660699,  0.01072412, -0.5661977 , -0.24404558,  0.39945093, 0.52743943]])
>>> pca.show()

Reformatting CSV¶

Below is an example of how to reformat a “narrow” csv (one row per dependent variable value) to a “wide” format (one row per observation). Please see dvhastats.utilities.widen_data for additional documentation.

Let’s assume the contents of your csv file looks like:

patient,plan,field id,image type, date, DD(%), DTA(mm),Threshold(%),Gamma Pass Rate(%)
ANON1234,Plan_name,3,field,6/13/2019 7:27,3,2,10,99.94708217
ANON1234,Plan_name,3,field,6/13/2019 7:27,3,3,5,99.97934552
ANON1234,Plan_name,3,field,6/13/2019 7:27,3,3,10,99.97706894
ANON1234,Plan_name,3,field,6/13/2019 7:27,2,3,5,99.88772435
ANON1234,Plan_name,4,field,6/13/2019 7:27,3,2,10,99.99941874
ANON1234,Plan_name,4,field,6/13/2019 7:27,3,3,5,100
ANON1234,Plan_name,4,field,6/13/2019 7:27,3,3,10,100
ANON1234,Plan_name,4,field,6/13/2019 7:27,2,3,5,99.99533258

We can see that all data here is of the same patient, plan, and date. In this example, we want to evaluate the variation of Gamma Pass Rate(%) as a function of DD(%), DTA(mm), and Threshold(%). So, in this context, we really only want two rows of data, one for each field id (i.e., 3 or 4).

>>> from dvhastats.utilities import csv_to_dict, widen_data
>>> data_dict = csv_to_dict("path_to_csv_file.csv")
>>> uid_columns = ['patient', 'plan', 'field id']  # only field id really needed in this case
>>> x_data_cols = ['DD(%)', 'DTA(mm)', 'Threshold(%)']
>>> y_data_col = 'Gamma Pass Rate(%)'
>>> wide_data = widen_data(data_dict, uid_columns, x_data_cols, y_data_col)
>>> wide_data
    {'uid': ['ANON1234Plan_name3', 'ANON1234Plan_name4'],
     '2/3/5': ['99.88772435', '99.99533258'],
     '3/2/10': ['99.94708217', '99.99941874'],
     '3/3/10': ['99.97706894', '100'],
     '3/3/5': ['99.97934552', '100']}