Construction and Update of an Online Ensemble Score Involving Linear Discriminant Analysis and Logistic Regression

Beno&#238; t Lalloué; Jean-Marie Monnez; Eliane Albuisson

doi:10.4236/am.2022.132018

Applied Mathematics > Vol.13 No.2, February 2022

Construction and Update of an Online Ensemble Score Involving Linear Discriminant Analysis and Logistic Regression

Benoît Lalloué^1,2, Jean-Marie Monnez^1,2*, Eliane Albuisson^3,4,5
¹IECL (Institut Elie Cartan de Lorraine), Inria (Project-Team BIGS), Centre National de la Recherche Scientifique (CNRS), Université de Lorraine, Nancy, France.
²Inserm U1116, Centre d’Investigation Clinique Plurithématique 1433, Université de Lorraine, Nancy, France.
³IECL (Institut Elie Cartan de Lorraine), Centre National de la Recherche Scientifique (CNRS), Université de Lorraine, Nancy, France.
⁴Délégation à la Recherche Clinique et à l’Innovation(DRCI), CHRU de Nancy, Vandœuvre-lès-Nancy, France.
⁵Département Grand-Est de Recherche en Soins Primaires, Faculté de Médecine, Vandœuvre-lès-Nancy, France.
DOI: 10.4236/am.2022.132018 PDF HTML XML 145 Downloads 586 Views

Abstract

The present aim is to update, upon arrival of new learning data, the parameters of a score constructed with an ensemble method involving linear discriminant analysis and logistic regression in an online setting, without the need to store all of the previously obtained data. Poisson bootstrap and stochastic approximation processes were used with online standardized data to avoid numerical explosions, the convergence of which has been established theoretically. This empirical convergence of online ensemble scores to a reference “batch” score was studied on five different datasets from which data streams were simulated, comparing six different processes to construct the online scores. For each score, 50 replications using a total of 10N observations (N being the size of the dataset) were performed to assess the convergence and the stability of the method, computing the mean and standard deviation of a convergence criterion. A complementary study using 100N observations was also performed. All tested processes on all datasets converged after N iterations, except for one process on one dataset. The best processes were averaged processes using online standardized data and a piecewise constant step-size.

Keywords

Learning for Big Data, Stochastic Approximation, Medicine, Ensemble Method, Online Score

Share and Cite:

Lalloué, B. , Monnez, J. and Albuisson, E. (2022) Construction and Update of an Online Ensemble Score Involving Linear Discriminant Analysis and Logistic Regression. Applied Mathematics, 13, 228-242. doi: 10.4236/am.2022.132018.

1. Introduction

When considering the problem of predicting the values of a dependent variable y, whether continuous (in the case of regression) or categorical (in the case of classification), from observed variables $x^{1}, \dots, x^{p}$ , which are themselves continuous or categorical, many different predictors can be constructed to address this problem. The principle of ensemble methods is to construct a set of “basic” individual predictors (using classical methods) whose predictions are then aggregated by average or by vote. Provided that the individual predictors are relatively good and sufficiently different from each other, ensemble methods generally yield more stable predictors than individual predictors [1].

This set of individual predictors can be constructed through different means, used separately or in combination, in order to obtain differences between them. Various types of regressions or rules of classification can be used as well as different samples (e.g. bootstrap), different variable selection methods (random, stepwise selection, shrinkage methods, etc.) or more generally by introducing a random element in the construction of predictors. Bagging [2], boosting [3], random forests [1] or Random Generalized Linear Models (RGLM) [4] are examples of ensemble methods. Another method for constructing an ensemble score in seven steps was recently proposed in Duarte et al. [5], used in Lalloué et al. [6] and will be used as a reference in this article:

1) Selection of $n_{1}$ classification rules.

2) Generation of $n_{2}$ bootstrap samples which are the same as for the $n_{1}$ rules.

3) Choice of $n_{3}$ modalities of a random selection of variables. For each bootstrap sample, selection of m variables according to these modalities.

4) Selection of $m^{*}$ variables among m by a classical method (stepwise, shrinkage, etc.).

5) For each classification rule, construction of the $n_{2} n_{3}$ predictors corresponding to the bootstrap sample and the selected variables.

6) For each classification rule, aggregation of predictors into an intermediate score.

7) Aggregation of the $n_{1}$ intermediate scores from the previous step by averaging or voting.

Herein, we consider the case where y is a binary variable and the classification rules are linear discriminant analysis (LDA) and logistic regression (LR).

In the context of online data, i.e. a flow of data arriving continuously, one wishes to be able to update such an ensemble score when new data becomes available, without having to store all of the previously obtained data and without performing the entire analysis. To achieve this goal, stochastic approximation processes [7] [8] [9] can be used. In particular, processes that we have previously studied theoretically [10] [11] will be detailed in Section 2.

However, the theoretical guarantees of convergence already demonstrated for this type of process provide little information on the practical choices to be made in order to obtain the best performances: e.g. “classical” or averaged processes [8] [10] [11], continuously decreasing step-size or decreasing piecewise constant step-size [11] [12] or constant step-size [10], use at each step of a mini-batch of observations or all observations up to the current step in the case of LDA [10]. Therefore, Section 3 is dedicated to the empirical testing of several online ensemble scores on several datasets, using several stochastic approximation processes for each classifier and comparing the accuracy of the estimations. To avoid a numerical explosion in the presence of heterogeneous data or outliers, an online standardization of the data is used as tested in [10] [11]. Moreover, an inadequate choice of the step-size can also lead to a numerical explosion in the non-asymptotic phase of the process or slow down its convergence. Thus, several types of step-sizes are tested. A conclusion of this study is that processes which have the best performance among those tested are not the classical processes with a continuously decreasing step-size and a mini-batch of observations at each step in the case of LDA.

2. Theoretical Construction and Update of an Online Ensemble Score

In order to be able to update online the ensemble score defined in [5] based on linear discriminant analysis and logistic regression, each bootstrap sample and each predictor must be updated when new data arrive [13]. Once the predictors are updated, the intermediate scores and the resulting final ensemble score are obtained using the same aggregation rules as for the offline ensemble method.

2.1. Updating the Bootstrap Samples

Starting from a sample size of n, the usual construction of a bootstrap sample consists in drawing at random with replacement n elements of the sample. In the case of a data stream, the Poisson bootstrap method proposed by Oza and Russell [14] can be used to update a bootstrap sample: for any new data, for each bootstrap sample $b_{i} (i = 1, \dots, n_{2})$ , a realization $k_{i}$ of a random variable under a Poisson law with parameter 1 is simulated, and the new data is added $k_{i}$ times to sample $b_{i}$ . These new data can then be used to update the predictors defined using sample $b_{i}$ .

2.2. Updating the Predictors

Recursive stochastic approximation algorithms which take into account a mini-batch of new data at each step can be used to update the predictors. Such algorithms have been developed to estimate linear [10] or logistic [11] regression parameters, or to estimate the class centers in unsupervised classification [15] or the principal components of a factor analysis [16]. These algorithms do not require storing data and can, within a fixed timeframe, process more data than offline methods. Stochastic approximation algorithms able to update predictors obtained by linear discriminant analysis (LDA, equivalent to linear regression in the case of a binary dependent variable) and logistic regression (LR) are described below.

2.2.1. Updating Logistic and Linear Regressions Using a Mini-Batch of Observations at Each Step

Note that all stochastic approximation algorithms described in this section use an online standardization of the data. Indeed, in practical applications, an inadequate choice of step-size of these processes or the presence of heterogeneous data or outliers can lead to numerical explosion issues in the non-asymptotic phase of the stochastic approximation process. To avoid numerical explosions in the presence of heterogeneous data, an online standardization of the data is proposed [10] [11]; in the case of a data stream, the moments of the regression variables are a priori not known, but can be estimated online in order to perform the standardization. However, in this instance, the convergence of the stochastic approximation process is not ensured by classical theorems and was therefore proven in [10] in the case of linear regression, and in [11] in the case of logistic regression. Moreover, a too rapid decrease in step-size may reduce the speed of convergence in the non-asymptotic phase of the process. For this reason, following [12], the use of a decreasing piecewise constant step-size has been tested in [11].

Consider first the case of logistic regression. Let S be a random variable taking its values in ${0, 1}$ and $R = {(R^{1} \dots R^{p} 1)}^{'}$ with $R^{1}, \dots, R^{p}$ being random variables taking values in $ℝ$ , $m = {(E [R^{1}] \dots E [R^{p}] 0)}^{'}$ , $R^{c} = R - m$ , $σ^{k}$ the standard deviation of $R^{k}$ , $Γ$ the diagonal square matrix with diagonal elements $\frac{1}{σ^{1}}, \dots, \frac{1}{σ^{p}}, 1$ , $Z = Γ R^{c}$ the standardized R vector, $θ (p + 1, 1)$ the vector of parameters and $h (u) = \frac{e^{u}}{1 + e^{u}}$ . The vector $θ$ is the unique solution of the system of equations $E [\nabla_{x} \ln (\frac{1 + e^{Z^{'} x}}{e^{Z^{'} x S}})] = 0$ , and thus of

$E [Z (h (Z^{'} x) - S)] = 0.$ (1)

Let $((R_{n}, S_{n}), n \geq 1)$ denote an i.i.d. sample of $(R, S)$ and for $k \in {1, \dots, p}$ , ${\bar{R}}_{n}^{k}$ denote the average of the sample $(R_{1}^{k}, \dots, R_{n}^{k})$ of $R^{k}$ and ${(V_{n}^{k})}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(R_{i}^{k} - {\bar{R}}_{n}^{k})}^{2}$ its variance (both computed recursively), ${\bar{R}}_{n}$ the vector ${({\bar{R}}_{n}^{1} \dots {\bar{R}}_{n}^{p} 0)}^{'}$ and $Γ_{n}$ the diagonal matrix with diagonal elements $\frac{1}{\sqrt{\frac{n}{n - 1}} V_{n}^{1}}, \dots, \frac{1}{\sqrt{\frac{n}{n - 1}} V_{n}^{p}}, 1$ .

Assume that a mini-batch of $m_{n}$ new observatons $(R_{i}, S_{i})$ constituting an i.i.d sample of $(R, S)$ is taken into account at step n. Denote $M_{n} = \sum_{i = 1}^{n} m_{i}$ and $I_{n} = {M_{n - 1} + 1, \dots, M_{n}}$ . Define for $j \in I_{n}$ , ${\tilde{Z}}_{j} = Γ_{M_{n - 1}} (R_{j} - {\bar{R}}_{M_{n - 1}})$ the vector $R_{j}$ standardized with respect to estimations of the means and variances of the components of R at step $n - 1$ . Recursively define the stochastic approximation process $(X_{n}, n \geq 1)$ and the averaged process $({\bar{X}}_{n}, n \geq 1)$ :

$X_{n + 1} = X_{n} - a_{n} \frac{1}{m_{n}} \sum_{j \in I_{n}} {\tilde{Z}}_{j} (h ({\tilde{Z}}^{'}_{j} X_{n}) - S_{j})$ (2)

${\bar{X}}_{n + 1} = \frac{1}{n + 1} \sum_{i = 1}^{n + 1} X_{i} = {\bar{X}}_{n} - \frac{1}{n + 1} ({\bar{X}}_{n} - X_{n + 1})$ (3)

In the case of linear regression, the same type of process is used in [10] taking $h (u) = u$ .

The following theorem is established for linear regression in [10] and for logistic regression in [11]. Assume:

(H1a) There is no affine relation between the components of R. (H1b) The moments of order 4 of R exist.

(H2a) $a_{n} > 0$ , $\sum_{n = 1}^{\infty} a_{n} = \infty$ , $\sum_{n = 1}^{\infty} \frac{a_{n}}{\sqrt{n}} < \infty$ , $\sum_{n = 1}^{\infty} a_{n}^{2} < \infty$ .

Theorem. Under H1a, H1b and H2a, $(X_{n}, n \geq 1)$ and $({\bar{X}}_{n}, n \geq 1)$ converge almost surely to $θ$ .

In [10] and [11], these processes were compared to others (with or without online standardization, and with or without averaging) on real or simulated data. Empirical results showed the interest of using online standardization of the data to avoid numerical explosions as well as the better performance of averaged processes using a piecewise constant step-size (see Section 3).

2.2.2. Updating Linear Regression Using All Observations up to the Current Step

Recursively define the stochastic approximation processes $(X_{n}, n \geq 1)$ and $({\bar{X}}_{n}, n \geq 1)$ :

$X_{n + 1} = X_{n} - a_{n} \frac{1}{M_{n}} \sum_{i = 1}^{n} \sum_{j \in I_{i}} {\tilde{Z}}_{j} ({\tilde{Z}}^{'}_{j} X_{n} - S_{j}), {\tilde{Z}}_{j} = Γ_{M_{n}} (R_{j} - {\bar{R}}_{M_{n}})$ (4)

$X_{n + 1} = \frac{1}{n + 1} \sum_{i = 1}^{n + 1} X_{i} = {\bar{X}}_{n} - \frac{1}{n + 1} ({\bar{X}}_{n} - X_{n + 1})$ (5)

Note that $\frac{1}{M_{n}} \sum_{i = 1}^{n} \sum_{j \in I_{i}} {\tilde{Z}}_{j} {\tilde{Z}}^{'}_{j} = Γ_{M_{n}} (\frac{1}{M_{n}} \sum_{i = 1}^{n} \sum_{j \in I_{i}} R_{j} {R^{'}}_{j} - {\bar{R}}_{M_{n}} {\bar{R}}^{'}_{M_{n}}) Γ_{M_{n}}$ and $\frac{1}{M_{n}} \sum_{i = 1}^{n} \sum_{j \in I_{i}} {\tilde{Z}}_{j} S_{j} = Γ_{M_{n}} (\frac{1}{M_{n}} \sum_{i = 1}^{n} \sum_{j \in I_{i}} R_{j} S_{j} - {\bar{R}}_{M_{n}} {\bar{S}}_{M_{n}})$ ,

${\bar{S}}_{M_{n}} = \frac{1}{M_{n}} \sum_{i = 1}^{M_{n}} S_{i}$ . Thus, the updating does not necessitate storing previous data

since all empirical means and variances can be recursively computed. The same type of process would not be possible without storing the data for logistic regression, since in this case, ${\tilde{Z}}_{j}$ in ${\tilde{Z}}_{j} h ({\tilde{Z}}^{'}_{j} X_{n})$ should be updated for all j.

Denote by $λ_{\max}$ the largest eigenvalue of the covariance matrix of R. Assume:

(H2b) $(a_{n} = a < \frac{1}{λ_{\max}})$ or $(a_{n} \to 0, \sum_{1}^{\infty} a_{n} = \infty)$ .

Theorem. Under H1a, H1b and H2b, $(X_{n}, n \geq 1)$ and $({\bar{X}}_{n}, n \geq 1)$ converge almost surely to $θ$ .

This theorem was also proven in [10]. Empirical results again showed the interest of using online standardization of the data as well as all observations up to the current step to avoid numerical explosions and to increase the speed of convergence.

It is therefore possible to use the processes described in this section to update the predictors by linear discriminant analysis and logistic regression in the ensemble score, taking into account the sample of new data generated by the Poisson bootstrap at each step for each predictor.

3. Empirical Study of Convergence

3.1. Material and Methods

3.1.1. Datasets

Four datasets available on the Internet and one dataset derived from the EPHESUS study [17] were used, all of which have previously been utilized to test the performance of stochastic approximation processes with online standardized data in the case of online linear regression [10] and online logistic regression [11]. The Twonorm, Ringnorm, Quantum and Adult datasets are commonly used to test classification methods. Twonorm and Ringnorm, introduced by Breiman [18], contain simulated data with homogeneous variables. Quantum contains observed “clean” data, without outliers and with most of its variables on a similar scale. Adult and HOSPHF30D contain observed data with outliers, as well as heterogeneous variables of different types and scales. A summary of these datasets is provided in Table 1.

The following preprocessing was performed on the data:

· Twonorm and Ringnorm: no preprocessing.

· Quantum: a stepwise variable selection (using AIC) was performed on the 6197 observations without any missing value. The dataset with complete observations for the 12 selected variables was used.

· Adult2: from the Adult dataset, modalities of several categorical variables were merged (in order to obtain a larger number of observations for each modality) and all categorical variables were then replaced by sets of binary variables, leading to a dataset with 38 variables.

· HOSPHF30D: 13 variables were selected using a stepwise selection.

From each dataset, a data stream was simulated step by step by randomly drawing, with replacement, 100 new observations at each step. Online scores were then constructed and updated from these data streams.

3.1.2. Reference Batch Score

For each dataset, a batch ensemble score was constructed using an adapted method from Duarte et al. [5] with the following parameters:

1) Two classification rules were used: linear discriminant analysis (LDA) and logistic regression (LR).

2) A total of 100 bootstrap samples were drawn for both rules (i.e. the same samples were used by each rule).

3) All available variables were included.

4) For each classification rule, the 100 associated predictors were aggregated by arithmetic mean and the coefficients subsequently normalized such that the score varied between 0 and 100 (as described in [5], Subsection 4.4.2).

5) The aggregation between the two intermediate scores $S_{LDA}$ and $S_{LR}$ was achieved by arithmetic mean: $S = λ S_{LDA} + (1 - λ) S_{LR}$ with $λ = 0.5$ .

The score obtained for each dataset was used as a “gold standard” to assess the convergence of the tested online processes (Figure 1).

Table 1. Description of the datasets.

N_a: number of available observations; N: number of selected observations; p_a: number of available parameters; p: number of selected parameters.

Figure 1. Methodology of construction and update of the online ensemble score.

3.1.3. Tested Processes

Types of processes: Three different types of stochastic processes ( $X_{n}$ ) were used as defined below.

1) “Classical” stochastic gradient (notation C_ _ _). At step n, card $I_{n} = m_{n}$ observations $(R_{j}, S_{j})$ were taken into account and the process was updated recursively: $X_{n + 1} = X_{n} - a_{n} \frac{1}{m_{n}} \sum_{j \in I_{n}} {\tilde{Z}}_{j} (h ({\tilde{Z}}^{'}_{j} X_{n}) - S_{j})$ , with ${\tilde{Z}}_{j}$ the vector of standardized explanatory variables, $S_{j} \in {0, 1}$ , $h (u) = u$ for the LDA, and $h (u) = \frac{e^{u}}{1 + e^{u}}$ for the LR.

2) “Averaged” stochastic gradient (notation A_ _ _): ${\bar{X}}_{n + 1} = \frac{1}{n + 1} \sum_{i = 1}^{n + 1} X_{i}$ .

3) Only in the case of the LDA: a process taking into account all of the previous observations $(R_{j}, S_{j})$ at each step until the current step, $j \in I_{1} \cup \dots \cup I_{n}$ (final mention “all”) [10]: $X_{n + 1} = X_{n} - a_{n} \frac{1}{M_{n}} \sum_{i = 1}^{n} \sum_{j \in I_{i}} {\tilde{Z}}_{j} ({\tilde{Z}}^{'}_{j} X_{n} - S_{j})$ , ${\tilde{Z}}_{j} = Γ_{M_{n}} (R_{j} - {\bar{R}}_{M_{n}})$

In all cases, the explanatory variables were standardized online (notation _S _ _): the principle and practicality of this method to avoid numerical explosions have already been shown [10] [11]. Indeed, for some datasets (Adult2, HOSPHF30D), processes with raw data led to a numerical explosion, contrary to those with online standardized data.

Step-size choice: Tested step-sizes $a_{n}$ were either:

1) Continuously decreasing: $a_{n} = c / {(b + n)}^{α}$ (notation _ _ _V);

2) Constant: $a_{n} = 1 / p$ (with p the number of explanatory variables) (notation _ _ _C);

3) Piecewise constant [12]: $a_{n} = c / {(b + ⌊ \frac{n}{τ} ⌋)}^{α}$ ( $⌊ . ⌋$ being the integer part, $τ$ the size of the level) (notation _ _ _P).

In all cases, $α = 2 / 3$ was taken as suggested by Xu [8] in the case of linear regression, $b = 1$ and $c = 1$ .

Tested processes: Six couples of processes were tested (Table 2). The latter were among those which performed best in the studies published for online LDA [10] and for online LR [11], or represented “usual” processes frequently used (apart from online data standardization). A total of 100 new observations were used per step. Each process was applied to each of the streams generated from the datasets.

In the notation describing a couple of processes, the first term is for the LDA and the second for the LR. For example, AS100Call-AS100P200 is the couple formed using for the LDA an averaged process (A) with online standardization of the data (S), 100 new observations per step (100), constant step-size (C), taking into account all the observations up to the current step (all), and for the LR an averaged process (A) with online standardization of the data (S), 100 new

Table 2. List of the couples of processes studied.

All processes used online standardized data and 100 new observations per step.

observations per step (100) and a piecewise constant step-size with levels of size 200 (P200).

Note that the six couples of processes can be grouped in three pairs. In each pair, for the LDA part, one couple of processes uses 100 observations at each step and the other all observations up to the current step, the processes for the LR part being the same.

Convergence criterion: The convergence criterion used was the relative difference of the norms $\frac{‖ θ^{b} - {\hat{θ}}_{N} ‖}{‖ θ^{b} ‖}$ between the $θ^{b}$ vector of coefficients obtained for the batch score and the ${\hat{θ}}_{N}$ vector of coefficients estimated by a

process after N iterations, the variables being standardized and the score being normalized to vary between 0 and 100 [5]. Convergence was considered to have occurred when the value of this criterion was less than the arbitrary threshold of 0.05. Three indicators were compared for each couple of processes: the criterion value for the synthetic score S_LDA obtained by aggregating the LDAs, the criterion value for the synthetic score S_LR obtained by aggregating the LRs, and the criterion value for the final score S.

3.1.4. Convergence and Stability Analyses

In order to study the empirical convergence of the process, an analysis using a total of 10N observations was performed for each couple of processes. Since 100 observations are introduced at each step, the number of iterations of the process is N/10. Due to the stochastic nature of the processes studied, some variability is expected in the results. In order to evaluate this variability, the entire analysis using 10N observations was replicated 50 times for each couple of processes and for each dataset. The mean, standard deviation (SD) and relative standard deviation (RSD), i.e. the standard deviation divided by the mean, of the criterion values were studied for the intermediary and final scores. For each dataset, the average of the criterion values of all couples of processes was also studied.

For each replication and each dataset, the performance of the couples of processes were ranked from the best (lowest relative difference of the norms for the final score S) to the worst (highest relative difference of the norms for S). Thereafter, the mean rank of each couple and its associated standard deviation over the 50 replications were computed, first by dataset, and finally over all datasets.

To study the long-term convergence of the process, a single analysis using 100N observations was performed for each couple of processes. Again, for each dataset, the values of the criterion for the intermediary and final scores were studied, and the couples of processes were ranked from the best to the worst. The mean rank over all datasets was used to compare the global performance of the couples. All analyses were performed with R 3.6.2.

3.2. Results

3.2.1. Convergence and Stability Analysis for 10N Observations

When replicating each couple of processes 50 times, the mean criterion values were lower than 0.05 for all couples of processes applied on Twonorm, Ringnorm and Quantum datasets (Table 3). However, only three out of six couples of processes converged for Adult2 (AS100C-AS100P200, AS100P50all-AS100P50 and AS100Call-AS100P200) as well as for HOSPHF30D (AS100P50-AS100P50, AS100P50all-AS100P50 and AS100Call-AS100P200). Note that for Twonorm, Ringnorm and Quantum, the maximum criterion values (not shown) for all couples of processes were always lower than 0.05 (i.e. even the worst performing processes still converged), whereas it was not the case for certain couples applied on Adult2 and HOSPHF30D.

Generally, intermediate LDA scores had smaller mean values, i.e. a faster convergence, than intermediate LR scores. However, the worst performing intermediary process was the LDA process AS100P50 applied on Adult2. In most cases, the mean criterion value for the final S score was between those of the two intermediate scores S_LDA and S_LR. For some couples of processes applied on some datasets (for instance AS100Call-AS100C on Adult2 or AS100C-AS100P200 on HOSPHF30D), this led to a convergence towards the reference of the final score while one of the intermediate scores had not yet converged according to the criterion.

When studying the rankings of the couples of processes over the 50 replications, the best couple of processes overall was AS100P50all-AS100P50. This couple was consistently among the three best couples, and had the best performance for three datasets. Note that the three best couples of processes across all datasets were those using all observations until the current step for the LDA intermediary scores.

Table 3. Mean, standard deviation and relative standard deviation of the criterion after 50 replications.

*denotes criteria values < 0.05. SD: standard deviation; RSD: relative standard deviation.

The observed differences in the average criterion were greater between datasets rather than between couples of processes (Table 4). Indeed, the means of each couple of processes were the lowest for Twonorm and Ringnorm compared to the other datasets. Conversely, all couples had their worst results for HOSPHF30D. Generally, all couples of processes performed better when applied on simulated data (Twonorm and Ringnorm) rather than on observed data (Quantum, Adult2, HOSPHF30D). This was also true when comparing the standard deviations and RSDs.

When comparing the overall variability of the rankings between the couples, AS100P50all-AS100P50 and AS100Call-AS100P200, the two best performing couples of processes on average, also had the lowest standard deviations for the mean overall rank (1.17 and 1.12 respectively), while the couple with the largest standard deviation was CS100Vall-CS100V (1.72). It appears that the two best couples of processes consisted of averaged processes with a piecewise constant or a constant step-size for one process in the case of LDA. It also appears that these two processes in the case of LDA used all observations up to the current step instead of a mini-batch of observations.

3.2.2. Convergence Analysis for 100N Observations

When studying the couples of processes after using 100N observations (i.e. N iterations) in order to assess the “long-term” convergence, the final online S score was very similar (criterion value < 0.05) to the reference “batch” score for all of the couples on four of the five datasets tested (Table 5).

Table 4. Mean (SD) rank of the processes across the 50 replications, by dataset and overall (ordered by overall rank).

Table 5. Criterion value after 100N observation used for intermediary and final scores.

*denote criteria values < 0.05. First abbreviation: LDA process; Second abbreviation: LR process. Type of processes: C for classical SGD, A for ASGD. Data: R for raw data, S for online standardization of the data (1st number: number of new data per step). Step-size: V for continuously decreasing, C for constant, P for piecewise constant (2nd number: size of the steps of the piecewise constant step-size).

Only the AS100P50-AS100P50 couple applied to the Adult2 dataset did not converge after 100N iterations (criterion = 1.697). More precisely, the result for the LDA part of this couple differed substantially from its batch counterpart (criterion = 2.756), whereas the LR part appeared to converge to the batch LR part (criterion = 0.035).

For each couple of processes, the best performances were achieved for the Twonorm and Ringnorm datasets, which consist of simulated data. The worst performances were obtained for Adult2 and HOSPHF30D datasets, which contain observed data.

Although these results are not directly comparable with the average results using 10N observations presented in the previous subsection (since there was only one replication using 100N observations), it should be noted that the criterion values of all couples of processes on all datasets were lower after 100N observations than the mean values after 10N observations, except for the LDA and global scores of AS100P50-AS100P50 applied on Adult2.

When the couples of processes were ranked from best to worst for each dataset and the average ranks were calculated across all datasets (Table 5), the two worst performing couples were CS100Vall-CS100V and CS100V-CS100V, i.e. the only two couples using classical processes and a continuously decreasing step-size. The best couple was again AS100P50all-AS100P50. It appears that the two best couples of processes were the same as for 10N observations and that the two worst couples were classical processes with a continuously decreasing step-size.

4. Conclusions

This study presented the construction of an online ensemble score obtained by aggregation of two rules of classification, LDA and LR, and bagging. The online ensemble score was constructed by using Poisson bootstrap and by associating stochastic approximation processes with online standardized data of different types, averaged or not, using either a mini-batch of data at each step or all observations up to the current step in the case of LDA, and different choices of step-sizes, whose convergence has already been theoretically established. The convergence of this overall online score towards the “batch” score was studied empirically. It appears that the two best processes were averaged processes with a piecewise constant step-size or a constant step-size for one process in the case of LDA and with the use of all observations up to the current step in the case of LDA instead of a mini-batch of observations. Thus, these were not the classical processes with a continuously decreasing step-size and a mini-batch of observations at each step in the case of LDA.

This study can be extended in several directions. More than two models of classification could be taken into account. Other classification models could be used such as the probit model. Other experiments could be carried out using randomly selected variables with different modalities of random selection [5] [6]. This study can also be extended to the regression framework when y is a continuous variable.

Acknowledgements

The authors thank Mr. Pierre Pothier for editing this manuscript.

Funding

This work was supported by the investments for the Future Program under grant ANR-15-RHU-0004.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1]	Genuer, R. and Poggi, J.M. (2017) Arbres CART et Forêts aléatoires, Importance et sélection de variables. https://hal.archives-ouvertes.fr/hal-01387654
[2]	Breiman, L. (1996) Bagging Predictors. Machine Learning, 24,123-140. https://doi.org/10.1007/BF00058655
[3]	Freund, Y. and Schapire, R.E. (1996) Experiments with a New Boosting Algorithm. Proceedings of the Thirteenth International Conference on Machine Learning, 148-156.
[4]	Song, L., Langfelder, P. and Horvath, S. (2013) Random Generalized Linear Model: A Highly Accurate and Interpretable Ensemble Predictor. BMC Bioinformatics, 14, Article No. 5. https://doi.org/10.1186/1471-2105-14-5
[5]	Duarte, K., Monnez, J.M. and Albuisson, E. (2018) Methodology for Constructing a Short-Term Event Risk Score in Heart Failure Patients. Applied Mathematics, 9, 954-974. https://doi.org/10.4236/am.2018.98065
[6]	Lalloué, B., Monnez, J.M., Lucci, D. and Albuisson, E. (2021) Construction of Parsimonious Event Risk Scores by an Ensemble Method. An Illustration for Short-Term Predictions in Chronic Heart Failure Patients from the GISSI-HF Trial. Applied Mathematics, 12, 627-653. https://doi.org/10.4236/am.2021.127045
[7]	Ljung, L., Pflug, G.C. and Walk, H. (1992) Stochastic Approximation and Optimization of Random Systems. Birkhäuser, Basel. https://doi.org/10.1007/978-3-0348-8609-3
[8]	Xu, W. (2011) Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent. ArXiv11072490 Cs.
[9]	Kek, S.L., Sim, S.Y, Leong, W.J. and Teo, K.L. (2018) Discrete-Time Nonlinear Stochastic Optimal Control Problem Based on Stochastic Approximation Approach. Advances in Pure Mathematics, 8, 232-244. https://doi.org/10.4236/apm.2018.83012
[10]	Duarte, K., Monnez, J.M. and Albuisson, E. (2018) Sequential Linear Regression with Online Standardized Data. PLoS ONE, 13, e0191186. https://doi.org/10.1371/journal.pone.0191186
[11]	Lalloué, B., Monnez, J.M. and Albuisson, E. (2021) Streaming Constrained Binary Logistic Regression with Online Standardized Data. Journal of Applied Statistics. https://doi.org/10.1080/02664763.2020.1870672
[12]	Bach, F. (2014) Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression. Journal of Machine Learning Research, 15, 595-627.
[13]	Lalloué, B., Monnez, J.M. and Albuisson, E. (2019) Actualisation en ligne d’un score d’ensemble. 51e Journées de Statistique, Nancy, France, Jun 2019. https://hal.archives-ouvertes.fr/hal-02152352
[14]	Oza, N.C. and Russell, S.J. (2001) Online Bagging and Boosting. Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics, Key West, Florida, USA, 4-7 January 2001, 229-236.
[15]	Cardot, H., Cénac, P. and Monnez, J.M. (2012) A Fast and Recursive Algorithm for Clustering Large Datasets with κ-Medians. Computational Statistics & Data Analysis, 56, 1434-1449. https://doi.org/10.1016/j.csda.2011.11.019
[16]	Monnez, J.M. and Skiredj, A. (2021) Widening the Scope of an Eigenvector Stochastic Approximation Process and Application to Streaming PCA and Related Methods. Journal of Multivariate Analysis, 182, Article ID: 104694. https://doi.org/10.1016/j.jmva.2020.104694
[17]	Pitt, B., Remme, W., Zannad, F., Neaton, J., Martinez, F., Roniker, B., et al. (2003) Eplerenone, a Selective Aldosterone Blocker, in Patients with Left Ventricular Dysfunction after Myocardial Infarction. The New England Journal of Medicine, 348, 1309-1321. https://doi.org/10.1056/NEJMoa030207
[18]	Breiman, L. (1996) Bias, Variance, and Arcing Classifiers. Technical Report 460, University of California, Berkeley.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies