1. Introduction
Clinacanthusnutans (C. nutans), known as the alligator flower, or the Sabah snake grass, is a plant belonging to the genus Clinacanthus in the family of Acanthaceae. It is found primarily in southern and southwestern China as well as Malaysia, Indonesia, and Thailand [1] [2] [3]. A flurry of research carried out about the chemical composition of C. nutans, confirmed that they are rich sources of flavonoids, phenolics, steroids, triterpenoids, cerebrosides, glycoglycerolipids, glycerides, and sulfur-containing glycosides, which make them a useful folk medicine and interesting healthy food [4] [5] [6] [7]. Moreover, various compositional and health studies concluded that C. nutans herbal tea has considerable potential as the potential natural antioxidant source. In summary, C. nutans may provide beneficial effects on people’s health and represent a great economic resource.
As the demand for healthy food growing, consumer attitudes are slowly changing and C. nutans are attracting greater interest due to their benefits. Therefore, accurate determination of the origin of C. nutans seeds is scientifically important and has application in relevant medicines, health food materials, as well as establishing product quality standards [8].
Currently, the identification of the origin of C. nutans seeds and determination of C. nutans composition are performed primarily using High-Performance Liquid Chromatography (HPLC) [9] [10] [11] and Gas Chromatography-Mass Spectrometry (GC-MS) [12] [13] [14]. However, high equipment cost, complicated operation, and the need for chemical reagents have restricted their widespread use. Therefore, it is of great significance to develop a rapid, simple, and green method for identifying the origin of C. nutans seed.
Near-Infrared (NIR) spectroscopy primarily reveals the overtone bands and combination bands of fundamental vibrations of X-H functional groups (such as C-H, O-H, and N-H) [15] [16] [17] [18]. It not only provides rich qualitative and quantitative information but also is rapid, simple, and does not require chemical reagents. This rapid and simple technique has now been applied in agriculture [19], food science [20], medicine [21], and other fields. Researchers have used NIR spectroscopy combined with chemometrics to confirm the geographical area of durian and have found good application prospects [22]. Herrero Latorre, Peña Crecente, García Martín, and Barciela García [5] used NIR spectroscopy combined with pattern recognition technology to identify honey samples from different sources, developing a fast and single food authentication system to distinguish authentic PGI-Galicia honey samples and other commercial honey samples from other origins. C. nutans contains different X-H functional groups with significant absorption in the NIR region. However, there have been few reports on the geographic origin of C. nutans using NIR spectroscopy combined with chemometrics.
We collected and analyzed 81 C. nutans samples from three geographic locations including Malaysia, Hainan (China), and Guangxi (China). By combining NIR spectroscopy and chemometrics, we established a seed origin classification model for C. nutans with high classification accuracy.
2. Materials and Methods
2.1. Experimental Samples
The 81 C. nutans samples used in the study originated from Malaysia, Hainan, and Guangxi, China, of which 39 originated from Malaysia, 30 originated from Hainan, and 12 originated from Guangxi. All samples were identified by experts from the Institute of Medicinal Plant Development of Guangdong Academy of Agricultural Sciences.
2.2. Spectral Acquisition
We employed NIRS XDS Rapid Content Analyzer with dispersive grating (FOSS, Denmark) and its diffuse reflectance accessories. The spectrum acquisition range was 400 - 2500 nm, and the detectors were Si (400 - 1100 nm) and PbS (polycrystalline lead sulphide; 1100 - 2500 nm). Spectra were sampled at 2 nm intervals to obtain a range of 400 - 2500 nm. The spectral data of all C. nutans samples were collected three times and averaged, and a total of 81 spectra were obtained.
2.3. Sample Set Partitioning
Currently, sample selection methods primarily include the random sampling method, the Kenard-Stone (KS) method, the duplex method, and the sample set partitioning based on joint X-Y distance (SPXY) method. The SPXY method is a sample partitioning method based on the KS method that can be effectively applied to the analysis of the spectral calibration model [23]. Compared to the KS method, the SPXY method considers both the x and y variables when calculating the spatial distance of the sample. The formula for calculating the spatial distance of the x variable is the same as in the KS method (Equation (1)). Equation (2) gives the formula for calculating the spatial distance of the y variable.
(1)
(2)
(3)
The stepwise selection process of the SPXY method is similar to that of the KS method, except that
replaces
.
as the standardized xy distance so that the sample has the same weight in x- and y-spaces. The formula for this calculation is shown in Equation (3).
In this study, the SPXY method was used to partition the 81 C. nutans samples into a training set and a test set at a 3:1 ratio. There were 61 C. nutans samples in the training set and 20C. nutans samples in the test set. The details on sample partitioning according to region are shown in Table 1.
2.4. Algorithm
Support vector machine (SVM) is a machine learning method based on statistical learning theory. It has many unique advantages in solving small sample, non-linear, and high-dimensional pattern recognition problems [24].
The sample training set is represented by
, where
is the input vector and
is its corresponding expected output. SVM can identify the optimal hyperplane
(where
is the normal vector of the plane and b is the distance from the plane to the origin) between two categories of data. In cases of linear separability, the data are partitioned into two categories by the plane after classification, and the difference between the two categories of data are
. The classifier is:
(4)
In cases of nonlinearity, SVM maps data from low-dimensional space to high-dimensional space. The classifier is:
(5)
Here, sign{} is the sign function, ai is a Lagrange multiplier, xi is a training sample,x is a sample to be classified, and
is a kernel function. Selecting the most appropriate kernel function is the most important step in developing a high-performance SVM model, and usually includes two parts: one is to select an appropriate kernel function type, and the other is to optimize the important parameters after determining the kernel function type. Studies have found that models developed with the radial basis function (RBF) kernel selected as the kernel function parameter have good learning ability. Therefore, the RBF kernel function was used in this study to implement SVM modelling. The two important parameters of the RBF kernel function are the penalty parameter c and the kernel function parameter g. These two parameters have significant effects for
Table 1. Clinacanthusnutans sample partitioning results.
controlling the complexity, approximation error, and measurement accuracy of the model. Therefore, it is necessary to optimize these two parameters.
Commonly used parameter optimization algorithms include grid search algorithm (GS), genetic search algorithm (GA), and particle swarm optimization algorithm (PSO). GS is a traversal algorithm that tries all (c, g) parameter pairs and then finds the (c, g) parameter pair with the highest accuracy, namely the optimal parameters, through cross-validation [25]. GA is a computational model that simulates natural selection and genetic mechanisms of Darwin’s theory of evolution and is a method of searching for an optimal solution [26]. PSO is a stochastic optimization method based on populations. By imitating the swarm behavior of herds, birds, insects, and fish, each member of the group constantly changes its search mode by learning from its and other members’ experience [27].
2.5. Model Evaluation Indicators
Model evaluation is used to measure the parameter space and feature extraction effectiveness of different models. The performance of classification models is generally evaluated by the accuracy of the test set [28]. The closer the accuracy is to 1, the better the classification effectiveness of the model. Classification accuracy refers to testing of the established model using the test set in the classification model and is computed as the ratio of the number of statistical samples correctly determined to the total number of samples. In this experiment, the accuracy and the confusion matrix are used for the evaluation of the multi-classification model performance, and the calculation formula is as follows:
(6)
(7)
In the equation, TP represents the number of positive samples from the pre-training set that were correctly classified by the model, FN represents the number of positive samples from the pre-training set that were wrongly classified by the model, FP represents the number of negative samples from the pre-training set that were wrongly classified by the model, and TN represents the number of negative samples from the pre-training set that were correctly classified by the model.
3. Results and Discussion
3.1. Spectral Analysis
C. nutans has a complex composition, including saponins, phenolic compounds, flavonoids, diterpenes, and phytosteroids. These substances have different hydrogen-containing groups and can produce specific absorption bands in the NIR spectrum (780 - 2526 nm), as shown in Figure 1. The peaks at 1452 nm and 1939
Figure 1. Near-infrared (NIR) spectrum of Clinacanthusnutans.
nm are the two major absorption peaks of water in the NIR region. Of these, 1452 nm is the first overtone of O-H stretching vibration, 1939 nm is the combination frequency of O-H stretching and bending vibrations, 1771 nm is the second overtone of C=O absorption, 2100 nm is the O-H deformation vibration and C-O stretching vibration, and 2276 nm is the combination frequency of C-H stretching and deformation vibration [29].
3.2. Principal Component Analysis (PCA)
Due to collinearity between the NIR spectral signals, the information is redundant, as shown in Figure 1. The result showed a low difference among the spectral of the 81 samples. Therefore, it is necessary to reduce the dimensionality of the C. nutans NIR spectra to simplify the data. PCA is a statistical method for dimensionality reduction using orthogonal transformation to convert the original random vector related to its component into a new random vector whose component is unrelated. This reduces the dimensionality of the multidimensional variable system so that it can be converted into a low-dimensional variable system with high precision (Zou et al., 2006). Figure 2 represents a PCA score chart of NIR spectrum of C. nutans. Figure 2(a) represents a two-dimensional score plot for PC1 and PC2. Figure 2(a) shows that the samples from the three locations had a wide distribution. Compared to the C. nutans samples from Malaysia and Hainan, the samples from Guangxi were more concentrated. Figure 2(b) represents a three-dimensional score plot of the first three principal components of C. nutans showing the projection of sample points in three-dimensional space. The cumulative total variance obtained by the first three principal components was 95.52%, which indicates that the first three principal components could reflect most of the characteristic information of the original spectrum. The
Figure 2. Principal component analysis (PCA) PCA analysis plot of C. nutans. (a) 2D analysis plot; (b) 3D analysis plot. Black, Malaysian seed origin; red, Hainan seed origin; green, Guangxi seed origin.
three-dimensional score plot shows that the most dispersed distribution is the C. nutans samples from Malaysia, indicating that there is a large intragroup difference in the C. nutans samples from Malaysia. The samples from the three C. nutans seed locations exhibited large areas of overlap on the PCA score plots. Therefore, PCA analysis alone cannot be used to make a clear judgment on the origin of C. nutans seeds and further algorithmic processing of the C. nutans NIR spectra is needed in order to develop a model with high classification accuracy and good prediction accuracy.
3.3. SVM Model Analysis
The SVM has many unique advantages in solving small sample, non-linear, and high-dimensional pattern recognition issues. Thus, the SVM algorithm was used in this study to analyse the NIR spectra of C. nutans, and the three parameter optimization algorithms GS, GA, and PSO were used to optimize the two SVM parameters c and g in order to establish a classification model for C. nutans seed origin with high accuracy and good predictability.
Data pre-processing is an important factor for improving prediction precision in qualitative analysis and modelling. The acquired spectra not only contains the original information of the samples to be tested but also various external interfering information, which can result in some degree of difference between the measured and true values [30]. In order to eliminate errors as much as possible, various data processing methods must be used to reduce the impact of various interfering factors, thereby laying the foundation for subsequent data processing. In this study, multivariate scattering correction (MSC), standard normal variate transformation (SNV), first derivative, and second derivative were used for pre-treatment of spectral data. Figure 3 shows the pre-treatment average spectra of the C. nutans samples.
In order to compare the effects of different pre-treatment methods on the accuracy of the C. nutans seed origin model, SVM models with default c and g parameters (default value of c was 1, default value of g was 1/k, where k was the number of categories) were established for the four pre-treatment methods and compared with the original spectra. The model establishment results in Table 2 showed that different pre-treatment methods have different effects on the modelling results. Among them, spectra processed by the first derivative yielded the best model prediction effectiveness, with a training set accuracy of 93.44%, and a test set accuracy of 85.00%.
After determining the best pre-treatment method, the parameters c and g were optimized using GS, GA, and PSO. The parameter optimization process and cross-validation results are shown in Figure 4. Figure 4(a) is a three-dimensional plot of the GS optimization results. As cross-validation accuracy increased, the colour of the grid formed by different c and g values changed from cooler (dark blue) to warmer colour (bright yellow), and at the same time, the horizontal plane of each vertex of the grid increased accordingly. When c = 1 and g = 27.8576, the accuracy of cross-validation reached a maximum of 96.72%. Figure 4(b) shows the contour map of the GS parameter optimization results, which is obtained by projecting Figure 4(a) onto a two-dimensional plane. Figure 4(c) is a plot of GA optimization results. The best-fit curve shows that when the number of iterations is 0 - 25, the cross-validation accuracy continued to increase. When the number of iterations is 25, the accuracy reached saturation at 96.72%. At this point, c = 1.6327 and g = 55.3856. Figure 4(d) represents a plot
Figure 3. Pre-treated average spectra of C. nutans samples. (a) Multivariate scattering correction (MSC); (b) Standard normal variate (SNV); (c) First derivative; (d) Second derivative.
Table 2. Support vector machine (SVM) model classification results using different pre-treatments.
*Multivariate scattering correction (MSC); standard normal variate (SNV)
Figure 4. Results of optimizing the parameters c and g in support vector machine (SVM) models. (a) 3D plot of grid search (GS) GS optimization results; (b) Contour plot of GS optimization results; (c) Genetic algorithm (GA) GA optimization results; (d) Particle swarm optimization (PSO) PSO optimization results.
of PSO optimization results. After 50 iterations, the cross-validation accuracy was stable at 97.62%, and the optimal penalty parameter c = 0.8343 and the kernel function parameter g = 57.8741.
After optimizing c and g through the three optimization algorithms GS, GA, and PSO, the cross-validation accuracy reached a minimum of 96.72%. In the next step, the optimal values for c and g were used to establish SVM models and the test set accuracy was used to select the best SVM model. These results are shown in Table 3. The prediction accuracy of the SVM model was greatly improved after optimization of c and g. The prediction accuracy of the test sets for the three optimization algorithms reached 95.00%, of which PSO yielded the best accuracy. The value of the penalty parameter c was the smallest, therefore, the parameter pair found by PSO was selected as the optimal parameters. The penalty parameter c = 0.8343 and kernel function parameter g = 57.8741 corresponded to the best SVM model for C. nutans seed origin, with a training set accuracy of 96.36% (60/61) and a test set accuracy of 95.00% (19/20), the specific results are represented by the confusion matrix in Figure 5.
4. Conclusion
In this study, a classification model for the origin of C. nutans seeds based on
Table 3. SVM model classification results after parameter optimization.
Figure 5. Confusion matrix of SVM model classification results with PSO.
NIR spectroscopy was developed. NIR spectroscopy data were collected from 81 C. nutans samples from three geographic locations: Malaysia, Hainan (China), and Guangxi (China). PCA analysis of the acquired NIR spectra showed that the samples from the three geographic locations were dispersed and overlapped in the PCA score plot. Therefore, the NIR spectra of C. nutans were further analyzed by SVM modelling. Before the SVM model was established, the spectra were pre-treated with MSC, SNV, the first derivative, and the second derivative. The optimal parameters c and g were found using optimization algorithms. Our results show that the spectra obtained after first derivative processing achieved the best modelling results. The model parameters c = 0.8343 and g = 57.8741 resulted in a training set the accuracy of 96.36% and a test set the accuracy of 95.00%. This method for tracing C. nutans seed origin based on NIR spectroscopy combined with chemometrics has the advantages of being simple, rapid, and green.
Acknowledgements
This work was supported by the project funded by China Postdoctoral Science Foundation (2019M663360), the National Natural Science Foundation of China (61975069), the Guangdong Provincial Science and Technology Foundation (2014a020221068), the Discipline construction project of Guangdong Medical University (4SG21009G), the Funds for PhD researchers of Guangdong Medical University in 2021 (4SG21252G). We also gratefully acknowledge many of our colleagues for their stimulating discussions in this field.