Combined Use of k-Mer Numerical Features and Position-Specific Categorical Features in Fixed-Length DNA Sequence Classification ()
1. Introduction
In recent years, biological data have been generated at a tremendous rate. According to [ 1 ], the number of DNA sequences contained in GenBank repository increased dramatically from 116,461,672 to 181,336,445 between February 2010 and February 2015. The sequences in UniProt doubled during the period of just one year, from 40.4 (June 2013) to 80.7 (August 2014) million [ 2 ]. Analysis and interpretation of these data are two of the most crucial tasks in bioinformatics, and classification and prediction methods are key techniques to address such tasks.
As summarized by Xing et al. [ 3 ], the sequence classification methods can be categorized into three main groups. The first class is distance-based method, which defines distance functions to compute the similarity between two sequences. After that, some of the current classification methods, such as k-nearest neighbor classifier, are applied. The second category is feature-based methods. Before employing conventional algorithms such as decision trees and Support Vector Machine (SVM) to address the problem, sequences are converted into numerical feature vectors. In order to improve the accuracy of prediction, feature selection plays a key role in this type of methods. The last type is model-based classification, which applies hidden Markov model (HMM) and other statistical models to perform sequence classification.
With regard to the first category, in the research by Borozan et al. [ 4 ] in 2015, they exploited the complementarity between alignment-free and alignment-based similarity measures to improve biological sequence classification performance. They used five different sequence similarity measures: three of them were alignment-free and two of them were alignment-based, which revealed that their model outperformed previous models. In 2014, Chen et al. [ 5 ] also tackled the problem of categorical data in a typical distance-based manner. They defined four weighted functions for categorical features, two of them named as simple matching coefficient measures with global weights (WSMCglobal) and the other two named as simple matching coefficient measures with local weights (WSMClocal), then applied these functions to formulate new nearest neighbor classification algorithms. The classifiers were evaluated by using real datasets and biological datasets. The results showed that their proposed classifiers outperformed the traditional methods.
Moving to the second class, the application of feature selection technique and feature-based method to classify protein sequence data was carried out by Iqbal et al. [ 6 ] in 2014. The experimental results of their research showed that their model significantly improved in terms of accuracy, sensitivity, specificity, F-measure, and other performance measure metrics. In the study of Weitschek et al. [ 7 ] in 2015, they used the combination of alignment-free approaches and rule-based classifiers so as to classify biological sequences. At first, the biological sequences were converted into numerical feature vectors with alignment-free techniques, then rule-based classifiers were applied in order to assign them to their taxa.
The study about classifying occupancy, acetylation, and methylation of nucleosomes was carried out by Pham et al. [ 8 ]. Their method was also a kind of feature-based classification, which converted sequences into numerical feature vectors, then applied a conventional classification method. They adopted SVM with RBF kernel, and feature vectors were k-mer based vectors with a variety of window sizes (k = 3, 4, 5, 6, etc.). Using 10 datasets derived from a research of Pokholok et al. [ 9 ], they gained a high prediction accuracy. In order to improve prediction accuracy, a technique termed feature selection was used by Higashihara et al. [ 10 ] to solve this problem. In this research, the importance of features was first measured by MeanDecreaseGini value computed through training and prediction by random forest, then features were ranked as the order from the most to least importance. Exploiting feature selection along feature ranking, they achieved slight improvements in prediction accuracy. What is more, by searching neighbors of the best feature subset, accuracy of prediction improved further.
Although the active researches on sequence classification above, numerical and categorical features were separately studied until now. Since the numerical features like k-mer are typically position-indepen- dent and categorical features like nucleotide at a position are position-specific, we can expect that these two types of features could contribute to the classification performance in a complementary manner. In addition, it is still unclear how effective a feature selection algorithm is against the union of numerical and categorical features of sequence. In this study, we propose an effective framework for improving fixed-length DNA sequence classification by using the combination of categorical features (i.e. subsequence at a position like “A”, “AG”, etc.) and numerical features (i.e. k-mer frequency). By conducting feature selection on this mixture of features, we could also find which type of features is more effective in each dataset.
The remainder of this paper is organized as follows. In Section 2, we describe the validation datasets and how the model works. We present the experiments and results of evaluating the model in Section 3. Finally, some conclusions and discussions are given in Section 4.
2. Materials and Methods
2.1. Datasets
To show the validity of the proposed method in dealing with genetic sequence classification problem, we applied our approach to six datasets listed in Table 1.
2.1.1. UCI Datasets
We chose the benchmark datasets from UCI machine learning repository, Splice and Promoter datasets, for evaluation of our model. These datasets were used in researches [ 11 , 12 ]. The Splice dataset is about the splice-junction gene sequences. There are two types of splice junctions. The exon-intron “EI” is the part of DNA sequence ranging from the ending of an exon and the starting of an intron while intron-exon “IE” is a region of DNA between the ending of an intron and beginning of exon. The part of sequence which does not belong to “IE” and “EI” is called no junction “N”. This dataset is composed of 3175 labeled samples and each sample has the length of 60 base pair.
During RNA transcription process, transcription factors such as RNA polymerase and accessory proteins bind to the promoter region and carry out the initiation of transcription. Promoter parts are DNA sequences located adjacent to the initial sites of transcription. Promoter dataset consists of 106 labeled promoter sequences, “Positive” and “Negative”, with length of 57 base pair. “Positive” sequence contains a DNA region from promoter whereas “Negative” sequence does not include a DNA from promoter.
2.1.2. Nucleosome Benchmark Datasets
The other four datasets are about nucleosome forming and inhibiting sequences of four species (H. sapiens, C. elegans, D. melanogaster, and S. cerevisiae). The first three datasets were collected by Guo et al. [ 13 ]. These datasets were previously used in the research [ 13 - 15 ] and their details are described as follows. Human (H. sapiens) involved 2273 nucleosome-forming sequences (positive) and 2300 nucleosome-inhibiting sequences (negative). Worm (C. elegans) includes 2567 nucleosome-forming sequences (positive) and 2608 nucleosome-inhibiting sequences (negative). Fly (D. melanogaster) contains 2900 nucleosome-forming sequences (positive) and 2850 nucleosome-inhibiting sequences (negative). All the sequences in these three datasets have the same length of 147 base pair. In addition, Yeast (S. cerevisiae) consists of 1880 nucleosome-forming sequences (positive) and 1740 nucleosome-inhibiting sequences
(negative). Each of these sequences has the length of 150 base pair and this dataset was used in [ 15 , 18 , 19 ].
2.2. Features
In this study, we used the combination of the five different vectors named as 1-categorical vector (1CAT), 2-categorical vector (2CAT), 2-mer vector (2MER), 3-mer vector (3MER), and 4-mer vector (4MER). Given a biological sequence s of length n, S1, S2, ∙∙∙, Sn, where Si Î {A, C, G, T} and i = 1, 2, ∙∙∙, n, each of these vectors can be defined as follows:
2.2.1. 1-Categorical Vector (1CAT)
1CAT = (A1, A2, ∙∙∙, An), where n is the length of sequence s and Ai is a nucleotide at position ith, i = 1, 2, ∙∙∙, n. For example, with sequence s, AGGTCCTACT, 1CAT = (A, G, G, T, C, C, T, A, C, T).
2.2.2. 2-Categorical Vector (2CAT)
2CAT = (B1, B2, ∙∙∙, Bn−1), where n is the length of DNA sequence s and Bi is two consecutive nucleotides from position ith to position at (i + 1)th, i = 1, 2, ∙∙∙, n − 1. For instance, with sequence s above, 2CAT = (AG, GG, GT, TC, CC, CT, TA, AC, CT).
2.2.3. 2-Mer Vector (2MER), 3-Mer Vector (3MER), and 4-Mer Vector (4MER)
In term of biological sequence, k-mers can be defined as all possible subsequences of length k within a sequence. A k-mer is a string of k successive nucleotides contained the genetic sequence and there are 4k possible k-mers:
. The k-mer vector denoted as kMER is defined by
, where
is a number of occurrences of the
in a sequence s and i
= 1, 2, ∙∙∙, 4k. Therefore, using sequence s above, 2MER will be (0, 1, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0).
2.3. Algorithm
The proposed algorithm consists of main four steps. The flowchart of our algorithm is shown in Figure 1, and works as below:
1) Block A in Figure 1 is in charge of converting DNA sequences into feature vectors.
2) At Block B in Figure 1, feature ranking is conducted by the randomForest function for R [ 17 ].
3) Block C in Figure 1 is responsible for feature selection by performing learning and predicting with the ksvm function for R in kernlab package [ 16 ]. Each feature subset is evaluated by the average of prediction accuracies of 10-fold cross-validation.
4) The prediction performance of the proposed approach is achieved at Block D in Figure 1 where the best feature subset {f1, ∙∙∙, fk} obtained in the previous step are used to evaluate by 10-fold cross-valida- tion ten times. Herein, the best feature subset is the feature subset with the best accuracy.
2.4. Feature Selection
Nowadays, there has been a remarkable increase the number of researches exploiting feature selection techniques. They have been applied to both supervised learning and unsupervised learning. Their aims are threefold: the most important one being to avoid overfitting and improve model performance, the second advantage being to reduce computational time and space required to execute models, and the final goal being to identify which features are relevant to a problem and to gain a deeper insight into the data.
The feature selection approach used in our research is a kind of greedy algorithm, and works as two following steps.
Step 1) With pre-calculated feature set F = {f1, f2, ∙∙∙, fm} being ordered top-to-bottom as highest-to- smallest important values, we evaluate a feature subset {f1, f2, ∙∙∙, f10}, a feature subset {f1, f2, ∙∙∙, f20}, and so on, until a feature subset {f1, f2, ∙∙∙, fm} by conducting training and predicting with ksvm function [ 16 ].
Step 2) Neighbors of the feature subset with the best accuracy of prediction in the preceding step are tested.
Figure 1. The flowchart of the proposed algorithm.
3. Experiments and Results
3.1. Feature Ranking by Random Forest
Random forests are well-known ensemble learning method which can conduct both classification and regression. Apart from these tasks, the randomForest function for R in randomForest package [ 17 ] can measure the importance of all features by Mean Decrease Accuracy or Mean Decrease Gini values. In this research we adopted the Mean Decrease Accuracy value as the importance of features. The connection between rank and Mean Decrease Accuracy normalized into the range between 0 and 1 in each dataset is shown in the Figure 2.
In general, there is a sharp decrease in the importance of features in Promoter and Human datasets in the areas of top 1 - 5. This is then followed by a steady decline trend in the rest region. With Splice, Worm and Fly datasets, the importance of features fall slowly in the region of from top 1 to 16. The importance in the remainder declines gradually.
Features with high importance in validation datasets are listed in Table 2. From this table it is clear that Human, Worm, and Fly datasets have features with high importance containing mainly “A” (adenine) and “T” (thymine). However, the percentage of “C” (cytosine) and “G” (guanine) increase slightly in the Fly dataset. More observations are that TTT, TTTT, AA, AAA, AAAA, AAAT, ATTT features are highly important in Human and Worm datasets. It is similar to the case of Fly dataset where TTT, TT, TTTT,
Figure 2. Mean Decrease Accuracy along feature ranking from top 1 - 60.
Table 2. List of important features.
ATA, AAAA, TAT are so important. AAAA, TTTT, TA, AAA, TTT, TAT, ATA are highly important for Yeast dataset. This coincides with the results in the research of Higashihara et al. [ 10 ] for classification of nucleosome datasets. The research showed that T and A were both highly important. Additionally in Table 2, it was partially demonstrated that the combination of numerical and categorical might be effective. In case of Worm dataset, the first and the fourth important features are categorical (B1 and A1), and others are numerical.
For Splice and Promoter datasets, however, features in 2CAT vectors are so important. B30, B29, B31, B28 features are highly important in Splice dataset, which means that the nucleotides around the center of splice sequences play a vital role in prediction. This finding agrees with the structure of splice site sequences where splice-junctions are at the middle of sequences. B17, B16, B15, B14 are highly important in Promoter dataset. Figure 3 demonstrates the relationship between features in 2CAT vectors of Splice and Promoter datasets and Mean Decrease Accuracy normalized into the interval of [0, 1]. The figure illustrates that the highly important features in 2CAT vector of Splice dataset are located at the region of from 25 to 35. While those of Promoter dataset settled at the area of between 12 to 18.
3.2. Prediction Accuracy of Feature Subsets along Ranking
As described in step 1 in subsection 2.4, feature subsets along the ranking were assessed by SVM. With Human, Worm and Fly datasets, there are 63 different feature subsets at intervals of 10: {f1, f2, ∙∙∙, f10}, {f1, f2, ∙∙∙, f20}, {f1, f2, ∙∙∙, f30}, ∙∙∙, {f1, f2, ∙∙∙, fm} being tested. The prediction accuracy is based on the average accuracy of 10-fold cross-validation. However, there are around 45 feature subsets for Promoter dataset and Splice dataset. The results of prediction are demonstrated in Table 3.
3.3. Prediction Accuracy of Neighbors around the Best Feature Subset
With best feature subset obtained in subsection 3.2, we conducted step 2 in subsection 2.4. The results and the number of features in each feature subset are presented in Table 4.
3.4. Evaluation
3.4.1. Evaluation Metrics
The six key terms are used to computing classification evaluation metrics. Positives (P) is the number of positive samples. Negatives (N) is the number of negative samples. True positives (TP) is the number of the positive samples that were correctly classified by the classifier. True negatives (TN) is the number of the negative samples that were correctly classified by the classifier. False positives (FP) is the number of the negative samples that were incorrectly classified as positive. False negatives (FN) is the number of the positive samples that are misclassified as negative.
(1)
(a) (b)
Figure 3. Mean Decrease Accuracy of features in 2CAT vector on (a) Splice and (b) Promoter datasets.
Table 3. Prediction accuracies obtained by using either the whole set of features and the best feature subset in step 1.
Table 4. Prediction accuracies in step 2 compared with those in step 1.
(2)
(3)
#Math_14# (4)
3.4.2. Performance Evaluation of the Method
Using the best feature subsets obtained in the step 2 in subsection 2.4, we applied our model to classify the DNA sequences in the validation datasets and compared its performance with the previous research. For evaluation, we mainly carried out 10-fold cross-validation ten times, and then computed average prediction results. With Promoter data, however, we employed leave one out ten times due to the fact that the number of its samples is small, 106 samples.
For Splice and Promoter datasets, we compared the performance of proposed model with the performance of previous model conducted by Nguyen et al. [ 12 ]. The results from this research are known as the best performance prior to our research. The motivation behind this model was the desire to apply a deep learning model for text classification to DNA sequence classification. At first, the researchers translated DNA sequence into sequence of words as a text sentence, then applied the representation technique for text to this produced sequence. Lastly, two-dimensional matrices representing DNA sequences using one hot vectors were directly used as input to the convolutional neural network model.
However, for Human, Worm and Fly datasets, the performance of our model was compared with the results taken from researches [ 13 - 15 ]. The predictors iNuc-PseKNC was proposed by Guo et al. in 2014 [ 13 ] and two years later, Tahir and Hayat introduced the predictors iNuc-PseSTNC in 2016 [ 14 ]. Both predictors achieved better results on predicting nucleosome positioning than previous predictors. In late 2016, Awazu developed two models predicting nucleosome positioning called as 3LS and TNS models [ 15 ].
For Yeast dataset, we compared our results with those taken from [ 15 , 18 , 19 ]. Chen et al. [ 18 ] developed the predictor based on DNA deformation energy in 2015 while Yi et al. [ 19 ] introduced the predictor based on the nearest neighbor algorithm in 2012.
3.5. Comparison with Other Methods
As can be seen from Table 5, the prediction accuracy of our method for Promoter dataset reached 100%. This means that all samples in this dataset were correctly predicted by our proposed model. This result has not obtained by any previous methods. Our method also achieved the high prediction accuracy for Splice dataset with 96.81%.
To confirm whether the average of prediction accuracies of our method and the previous method are significantly different, we performed the two-sample t-test assuming equal variances. The p-values of these t-test comparisons are illustrated in Table 6. All the p-values are far smaller than 0.05, which means that our method outperforms the previous research in the term of prediction accuracy on these two datasets.
With Human, Worm and Fly datasets, we compared the performance of the proposed model and the performances of models in [ 13 - 15 ] on four metrics: Accuracy, sensitivity, specificity and Matthews correlation coefficient. Table 7 indicates the results of all methods in detail. From this table, the first thing to note is that our method outperformed all of competing methods on Worm dataset with Acc of 89.35%, Sen of 92.45% and MCC of 0.79. The second result worth pointing out is that on the Fly dataset our model also achieved better results than those of the other previous models with Acc of 81.75%, Sen of 79.14%, Sp of 84.40% and MCC of 0.64 except 3LS. Moreover, on Human dataset, the prediction Acc of the proposed method (86.33%) was higher than that of iNuc-PseKNC, TNS but lower than iNuc-PseSTNC and 3LS. Although iNuc-PseKNC model achieved the same MCC (0.73) with our model, Acc, Sen of our method were better than those of iNuc-PseKNC except for sensitivity. For Yeast dataset, our method and TNS completely outperformed the previous methods. Our model achieved the Acc of 100%, Sen of 100%, Sp of 100% and MCC of 1.0.
4. Discussion and ConclusionS
In this research, we proposed a simple but powerful model for solving DNA sequence classification problems. The model was tested on six different datasets: Splice, Promoter, Human, Worm, Fly, and Yeast datasets. On Splice and Promoter datasets, the experimental results show that there was a significant increase in the performance of our model. The improvements were also proved by performing the two- sample t-test assuming equal variances, and all p-values were less than 0.05. Especially, the proposed mo- del reached the accuracy of 100% on Promoter and Yeast datasets.
We also compared our model with the other four models: iNuc-PseKNC [ 13 ], iNuc-PseSTNC [ 14 ], TNS and 3LS [ 15 ]. In terms of accuracy, sensitivity and MCC, our method achieved better performance than any other competing method for predicting nucleosome positioning in worm genome. For fly genome, the proposed method also outperformed the other methods except 3LS model. For predicting nuc-
Table 5. Prediction accuracy comparison of proposed model and accuracy in [ 12 ].
Table 6. Assessment of our model and model in [ 12 ] by two-sample t-test assuming equal variances.
Table 7. Performance comparison of our model and previous models.
leosome positioning in human genome, our method performance was higher than iNuc-PseKNC and TNS, but lower than the other two models. Therefore, it can be concluded that our model is effective for DNA sequence classification.
The combination vector can reflect not only the categorical features of DNA sequence, but also the numerical features of sequence. It can characterize a genetic sequence. Moreover, we utilized the ability of executing categorical data and numerical data of random forest and SVM to solve our problem. We also made use of the advantages of random forest in automatically producing variable importance to rank features, then applied the feature ranking to conduct feature selection. The used feature selection technique is a greedy based on technique which does not learning and predicting on all possible feature subsets. This can reduce dramatically computational cost. However, one limitation of this model is that all DNA sequences in one dataset need to be the same length.
Due to the fact that our proposed model was successful in classifying DNA sequence data, in the future, the proposed model can be extended to other areas of sequence recognition like the classification protein sequence data.
Acknowledgements
In this research, the super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo. Additional computation time was provided by the super computer system in Research Organization of Information and Systems (ROIS), National Institute of Genetics (NIG). This work was supported by JSPS KAKENHI Grant Number 26330328.