1. Introduction
With the rapid development of economy-oriented society, investor sentiment has received more and more attention. The efficient market hypothesis has been at the core pillar of modern financial theory since the 1960s. According to Fama, in an efficient market, the price fully reflects all the information it can get [1]. However, financial markets are considered to be a complex non-linear system, and it is very challenging to predict stock prices in a technical way [2] [3] [4] [5]. Market anomalies were observed which contradict the EMH basic assumptions according to which the prediction of share prices should not be possible [6] [7] [8] [9]. In recent years, financial economists have been trying to study the financial behavior of investors from the perspective of human science, which has also spawned a new field of financial research—behavioral finance tracing back to the early 1990s [10] - [15]. The important branch with investor sentiment as the research object is gradually emerging as the information technology has witnessed an unprecedented boom. Single events (e.g., sport results, daylight saving anomaly) or continuous effects (e.g., weather effect, air pollution) influence people’s emotions [16] [17] [18] [19]. The prediction of share returns based on mood states can be seen as market anomaly contradicting the efficient market hypothesis [20]. These mood-related anomalies can be explained by the misattribution bias according to which people make risky decisions depending on mood states [21]. The Affect Infusion Model (AIM) can explain the relationship between positive and negative mood states and the risk-taking tendency which postulates that people in positive mood rely on positive cues to make decisions [22] [23] [24] [25].
Considering individual emotion is a vague concept, previous research made significant progress on various sentiment techniques after tracking indicators of public mood directly from social media content, such as Facebook and Twitter feeds [26] [27] [28] [29] [30]. In a seminal work, harnessing the cross-validate time series, Bollen et al. compared the ability of two mood tracking tools, namely OpinionFinder and Google-Profile of Mood States, to detect the public response on daily Twitter feeds to Dow Jones Industrial Average during Presidential election and Thanksgiving day [31].
Scholars’ research on sentiment analysis is not limited to processing text, but extends to machine learning and achieves high accuracy. Data mining techniques have been introduced for prediction of movement sign of stock market index by Leung et al. and Chen et al., Schumaker et al. predicted the S&P 500 index through SVM technology and used four text eigenvectors to represent the emotional dimension of the entire text, with an accuracy of 58.2% in the prediction results [32] [33] [34]. Hassan, Nath, and Kirley proposed and implemented a fusion model by combining the Hidden Markov Model (HMM), Artificial Neural Networks (ANN) and Genetic Algorithms (GA) to make financial market behavior forecast [35]. Kumar & Thenmozhi collected five different approaches including SVM, Random forecast, Neural network, Logit and LDA to predict Indian stock index movement based on economic variable indicators [36].
In this paper, we aim to analyze individual sentiment by addressing the accuracy of using seven machine learning algorithms in classifying financial stock comments into positive as well as negative classes. Platform Eastmoney is China’s most popular exclusive community for financial professionals with daily average flow exceeding 200 million, making it the preferred platform for domestic investors to interact. We compare the accuracy of these classifiers using the feature model: unigram TF-IDF. We assess the effects of including public mood information on the accuracy of a “baseline” prediction model rather than proposing an optimal prediction model.
2. Methods and Materials
2.1. Methods
In terms of the methodology as shown in Figure 1, we totally proceed in three phases.
2.1.1. Data Preparation and Feature Engineering
In the first phase, after data pre-processing, including word segmentation, pause word removal and tokenization, we leverage the unigram TF-IDF metric, a feature for word importance in a document that takes the product of term frequency (TF) and inverse document frequency (IDF). TF-IDF for a certain term t is defined as the multiplication of TF(t) by IDF(t). TF measures how frequently a term (feature) occurs in a comment. Since every comment may have different length, it is possible that a term would appear much more times in long blogs than shorter ones. Thus, the term frequency is often divided by the length as a way of normalization. Normalized TF for a given term t is defined as (formula 1):
(1)
where n = Numbers of term t occurs in the comments, N = Total numbers of the terms in the comments.
In contrast, IDF measures the importance of terms based on how frequently they appear across multiple comments. Intuitively, a term appears frequently in
Figure 1. Diagram outlining the methodology overview.
a single comment is important and gets a high weight. However, if the term appears in many blog posts, then it becomes less discriminative; hence, IDF deemphasizes its weight. IDF for a term t is given by:
(2)
where q = Numbers of comments with term t in it, Q = Total numbers of comments.
2.1.2. K-Fold Cross Validation with Multiple Machine Learning Algorithms
In the second phase, we deploy bag-of-words technique by manually sorting out positive and negative messages respectively. We apply K-fold cross validation to train the models where we divide the data into 5 splits and harness the first 80% for observations and the remaining 20% for test. We leverage multiple machine learning algorithms for analyzing the emotional polarity (Table 1).
Table 1. Machine learning classifiers overview.
Since the particular problem is classification-based in nature, we test out the efficacy of each classifier. Accuracy and f-score are used to evaluate the performance of proposed models. Computation of these evaluation measures requires estimating Precision and Recall which are evaluated from True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN). These parameters are defined in Equations (3)-(8). Since the prediction model has two dimensions, i.e., true of false and negative or positive, we have the verification matrix (Table 2):
(3)
(4)
(5)
(6)
where P is the precision of the model, R is recall, TP is the number of true positive, TN is the number of true negative, FP is the number of false positive, and FN is the number of false negative.
Taking the product of the two, we calculate the F1-score which is defined as:
(7)
(8)
where F1 is the F1-score of the model and A is the accuracy of the model.
2.1.3. Bivariate Correlation Analysis for the Two Time Series
In the third phase, we select the model with the best accuracy and conduct the relationship between bullish sentiment and stock market trend. The bull/bear ratio is a market-sentiment indicator which reflects how these professionals are feeling about the market, and how they are likely advising their clients to invest based on those feelings. In this paper, we define the bullish indicator as:
(9)
Table 2. Positive and negative-accuracy verification matrix.
Additionally, for bivariate correlation analysis, we usually have three methods of analysis which are the Pearson coefficient, Spearman coefficient and Kendall coefficient. Among them, we choose the Pearson correlation coefficient method measuring the linear relationship between two variables.
The Pearson correlation coefficient formula is as follows:
(10)
2.2. Data
We perform analysis on the Shanghai Composite Index. All price data and comments data are drawn from the period between April 2017 and May 2018, totaling 266 trading days. Two main datasets were used.
2.2.1. Comments Data
Comments data is collected from the financial forum of Eastmoney (http://guba.eastmoney.com/) in CSV format, containing over 480,000 messages. Besides, we manually sort out about 5000 positive messages and 5000 negative messages.
2.2.2. Price Data
Daily split-adjusted stock price data of Shanghai Composite Index is collected via Tushare, a Python module which provides stock price data in dataframe format. We focus only on the closing price data.
3. Results and Discussions
As shown in Table 3 and Figure 2, the results indicate that the chosen algorithms are clearly indicators of both the positive and negative sentiments classifications with worst case accuracy of 75% and SVC yielded the best accuracy of 88%.
We choose SVM as the basic classification algorithm for our prediction model. We calculate the time series data of sentiment indicators through the bullish
Table 3. The test accuracy for each of the learning models.
index. We combine it with the time series of stock prices in a single picture, (Figure 3). As shown in Figure 3, BI index and Shanghai composite index were selected as variables and Pearson coefficient was used for correlation test. The two series yielded statistically significant Pearson correlation coefficient of 0.689 (as shown in Table 4).
Figure 2. Diagram showing the test accuracy according to the four measurements.
Figure 3. The two merged time series graph consisting of bullish sentiment and stock market trend.
Table 4. Correlation test result (**: Correlation is significant at the 0.01 level).
4. Conclusions
This research focused on predicting the direction of stocks and stock price indices. Prediction performances of seven models namely SVM, Logistic regression, Naive Bayesian, KNN, Decision tree, Random forest and Adaboost are compared based on one year of historical data of Shanghai Composite Index from the Platform Eastmoney.
Experiments with continuous-valued data show that Adaboost model exhibits least performance with 77.2% accuracy and SVM with highest performance of 88.16% accuracy. SVM classifier has a better fitting degree for dichotomies. We divide emotions into positive emotions and negative emotions, so SVM is the most suitable classifier. Although these seven classification algorithms have achieved good fitting results, none of them is more than 90 percent accurate. On the one hand, Chinese words are more complex than English. On the other hand, most of natural language processing is mainly aimed at English, but not suitable for Chinese.
Further research will focus on extending the technical indicator’s opinion about stock price movement as “highly possible to go up”, “highly possible to go down”, “less possible to go up”, “less possible to go down” and “neutral signal” are worth exploring. This may give more accurate input to inference engine of the sentiment analysis algorithms. Besides calculating the correlation coefficient of the two time series, the research will be conducted to predict long term analysis of stock’s quarterly performance involved the ARIMA model based on exogenous variables for empirical test.
Acknowledgements
Upon the completion of the thesis, we would like to take this opportunity to express my sincere gratitude to my supervisor, Professor Patrick Houlihan, who has given us important guidance on the thesis. Without his help and encouragement, our thesis would have been impossible. Besides his help with our thesis, he has also given us much advice on the methods of doing research, which is of great value to our future academic life.
We are also obliged to the authors in the references whose thesis have broadened my scope of vision in data science and help us lay a necessary foundation for the writing of the thesis.
Last but not least, we would like to express our gratitude to all the friends and family members who have previously offered their help.
NOTES
*These are co-first authors, sorted alphabetically by last name.