TITLE:
Multivariate Statistical Analysis of Large Datasets: Single Particle Electron Microscopy
AUTHORS:
Marin van Heel, Rodrigo V. Portugal, Michael Schatz
KEYWORDS:
Single Particle Cryo-EM, Multivariate Statistical Analysis, Unsupervised Classification, Modulation Distance, Manifold Separation
JOURNAL NAME:
Open Journal of Statistics,
Vol.6 No.4,
August
31,
2016
ABSTRACT: Biology is a challenging and complicated
mess. Understanding this challenging complexity is the realm of the biological
sciences: Trying to make sense of the massive, messy data in terms of discovering
patterns and revealing its underlying general rules. Among the most powerful
mathematical tools for organizing and helping to structure complex,
heterogeneous and noisy data are the tools provided by multivariate statistical
analysis (MSA) approaches. These eigenvector/eigenvalue data-compression
approaches were first introduced to electron microscopy (EM) in 1980 to help
sort out different views of macromolecules in a micrograph. After 35 years of
continuous use and developments, new MSA applications are still being proposed
regularly. The speed of computing has increased dramatically in the decades
since their first use in electron microscopy. However, we have also seen a
possibly even more rapid increase in the size and complexity of the EM data
sets to be studied. MSA computations had thus become a very serious bottleneck
limiting its general use. The parallelization of our programs—speeding up the
process by orders of magnitude—has opened whole new avenues of research. The
speed of the automatic classification in the compressed eigenvector space had also
become a bottleneck which needed to be removed. In this paper we explain the
basic principles of multivariate statistical eigenvector-eigenvalue data
compression; we provide practical tips and application examples for those
working in structural biology, and we provide the more experienced researcher
in this and other fields with the formulas associated with these powerful MSA
approaches.