TITLE:
An Empirical Study of Downstream Analysis Effects of Model Pre-Processing Choices
AUTHORS:
Jessica M. Rudd, Herman “Gene” Ray
KEYWORDS:
Empirical Analysis, Bias-Variance Decomposition, Mean Squared Error, Downstream Analysis Effects, Empirical Risk
JOURNAL NAME:
Open Journal of Statistics,
Vol.10 No.5,
October
27,
2020
ABSTRACT: This study uses an empirical analysis to quantify the downstream analysis effects of data pre-processing
choices. Bootstrap data simulation is used to measure the bias-variance
decomposition of an empirical risk function, mean square error (MSE). Results
of the risk function decomposition are used to measure the effects of model
development choices on model bias,
variance, and irreducible error. Measurements of bias and variance are then
applied as diagnostic procedures for model pre-processing and development. Best
performing model-normalization-data structure combinations were found to
illustrate the downstream analysis effects of these model development choices. In additions,
results found from simulations were verified and expanded to include additional
data characteristics (imbalanced, sparse) by testing on benchmark datasets
available from the UCI Machine Learning Library. Normalization results on
benchmark data were consistent with those found using simulations, while also
illustrating that more complex and/or non-linear models provide better
performance on datasets with additional complexities. Finally, applying the
findings from simulation experiments to previously tested applications led to
equivalent or improved results with less model development overhead and processing
time.