ABSTRACT
Challenges in Big Data analysis arise due to the way
the data are recorded, maintained, processed and stored. We demonstrate that a
hierarchical, multivariate, statistical machine learning algorithm, namely
Boosted Regression Tree (BRT) can address Big Data challenges to drive decision
making. The challenge of this study is lack of interoperability since the data,
a collection of GIS shapefiles, remotely sensed imagery, and aggregated and interpolated
spatio-temporal information, are stored in monolithic hardware components. For
the modelling process, it was necessary to create one common input file. By
merging the data sources together, a structured but noisy input file, showing
inconsistencies and redundancies, was created. Here, it is shown that BRT can
process different data granularities, heterogeneous data and missingness. In
particular, BRT has the
advantage of dealing with missing data by default by allowing a split on
whether or not a value is missing as well as what the value is. Most
importantly, the BRT offers a wide range of possibilities regarding the
interpretation of results and variable selection is automatically performed by
considering how frequently a variable is used to define a split in the tree. A
comparison with two similar regression models (Random Forests and Least
Absolute Shrinkage and Selection Operator, LASSO) shows that BRT outperforms these in this instance. BRT
can also be a starting point for sophisticated hierarchical modelling in real
world scenarios. For example, a single or ensemble approach of BRT could be
tested with existing models in order to improve results for a wide range of
data-driven decisions and applications.