Analysing Effectiveness of Sentiments in Social Media Data Using Machine Learning Techniques ()
1. Introduction
Social media and digital platforms produce massive amounts of data every second due to the growth of online shopping and e-commerce, gathering consumer reviews and opinions about a wide variety of products. These reviews, which offer important insights into customer sentiment and product quality, are increasingly being kept in online repositories and databases. This review data is an essential tool for businesses looking to better understand customer perceptions and enhance their products, as well as for potential customers. Effective analysis of this data necessitates sophisticated data mining and machine learning techniques, which can reveal patterns and trends in user feedback that would be challenging to find by hand. One popular use of data mining and machine learning is sentiment analysis, which aims to evaluate the thoughts and feelings that customers express in their reviews. Businesses can determine the perceived quality and usability of their products by analyzing customer sentiment, which is frequently divided into three categories: positive, negative, and neutral.
In order to automatically classify and predict sentiments based on review data, Machine Learning Techniques like Support Vector Machines (SVM), Naive Bayes (NB), and Random Forest (RF) are crucial to this analysis. In order to categorize the sentiments that customers express, to apply a number of machine learning algorithms to sentiment analysis of headphone review data that was gathered from online repositories.
The architecture for processing and evaluating customer reviews to ascertain sentiment polarity, particularly in the Boat Headphone dataset gathered from Flipkart, is depicted in Figure 1. Customer reviews are the first step in the process,
![]()
Figure 1. Architecture of research work.
and they are prepared by using preprocessing techniques. Lowercase conversion, stop word removal, punctuation and symbol removal, stemming, tokenization, and emoji removal are some of these methods. Following preprocessing, the data is converted into a numerical format so that machine learning algorithms can use it.
The data is fed into different classification algorithms for sentiment analysis after it has been processed. Among these algorithms are Random Forest, Decision Tree, Support Vector Machine (SVM), and Naive Bayes. A Proposed Algorithm, in this case a hybrid model known as SDA (Support Vector Machine and Decision Tree Algorithm), is presented alongside these conventional algorithms. Every algorithm determines whether customer reviews are positive, negative, or neutral by classifying their polarity.
The organization of the research work is chapter 2 discusses the literature review, chapter 3 describes the dataset, chapter 4 discusses the Materials and Methods, chapter 5 presents the experimental results, and chapter 6 discusses the conclusion are as follows.
2. Literature Survey
A literature review offers a thorough summary of the body of knowledge and theoretical frameworks pertaining to a particular topic, which forms the critical basis of scholarly research. As a result, a thorough literature review is crucial for determining the significance and novelty of new research as well as providing a critical lens through which current knowledge is evaluated and expanded upon.
A research work carried out by Loukili et al. in [1], in which that the artificial intelligence methods like Machine Learning and Natural Language Processing determine the results of various algorithms, including KNN, Random Forest, Logistic Regression, and CatBoost Classifier, indicate that LR is the model with the highest accuracy, scoring a 0.900 (or 90%). Another research carried out by Mujawar et al. in [2], in which that the sentiment analysis methods work when used on user reviews of wireless earphones from the Indonesian online retailer Tokopedia. The results show that the Naïve Bayes classifier’s superior performance across several evaluation metrics, it is determined to be the best method overall.
A research paper titled as “A combined approach of sentimental analysis using machine learning techniques”, done by Gupta et al. in [3], in which accuracy of more than 78%, the Random Forest classifier is shown to be the best-performing approach among the models tested and the most useful model for sentiment analysis in this work. Another research work carried out by Elangovan, Durai, and Varatharaj Subedha in [4], in which the Deep Belief Network (DBN) is used for sentiment classification in the suggested technique. The APGWO-DLSA approach proved to be the most effective method in the research after a series of tests showed its superior performance, reaching a maximum accuracy of 94.77% on the Cell Phones and Accessories (CPAA) dataset and 85.31% on the Amazon Products (AP) dataset.
A research work titled as “Sentiment analysis and fake amazon reviews classification using SVM supervised machine learning model” carried out by Tabany, Myasar, and Meriem Gueffal [5], in which that the SVM model acquires 70% of accuracy and it is superior to Naive Bayes, Logistic Regression, and Random Forest classifiers. The SVM’s performance was further enhanced through hyperparameter tuning, which led to 93% sentiment analysis accuracy.
Reviews of the literature give a succinct overview of previous studies and shed light on the state of knowledge today. It helps to direct the development of research questions and assist in identifying research gaps. They determine the significance and background of the new research by examining earlier studies. They also bolster methodological decisions and raise the research’s legitimacy.
3. Description of the Dataset
The Boat_Headphone Flipkart dataset 9977 instances, which was obtained from Kaggle and has attributes like user reviews and ratings, is shown in Figure 2. The dataset is split into training and testing subsets in order to assess the effectiveness of sentiment prediction models. 2019 examples make up the testing set, which is used to evaluate how well the model generalizes to previously undiscovered data. This division enables a thorough assessment of the model’s efficacy and accuracy in gauging sentiments from user reviews and ratings.
![]()
Figure 2. Sample Dataset of Boat Headphone Reviews
4. Methods and Materials
A systematic approach to sentiment analysis in customer reviews is used in this research Methods and Materials, with a focus on the Boat Headphone dataset that was gathered from Flipkart.
4.1. Preprocessing Methods
Text mining relies heavily on preprocessing techniques because they improve the quality and suitability of the data for analysis [6]. These methods help to eliminate noise and inconsistencies from text by standardizing and cleaning it, making it possible to derive more precise and insightful conclusions. Figure 3 illustrates the steps involved in preparing the text data for further analysis.
Lowercase Review: In this stage, all of the reviews’ text is changed to lowercase. It lessens variability and enhances consistency in text processing by helping to standardize the text and ensuring that terms like “Excellent” and “excellent” are treated as the same term by the analysis.
(1)
Here
is a function of
, and
and
are constants. This is the quadratic equation where the lowercase letters
Stopwords: Stopwords are frequently eliminated from texts because they don’t significantly add meaning to sentiment analysis. Examples of these words are “and,” “the,” and “is.” Eliminating these stopwords helps the model concentrate on more significant terms by lowering noise in the data.
(2)
where
is the resulting document after removing all words that belong to the stopword set
. \ denotes the set difference operation, remove from
all the elements are also in
. This yields document
, containing only the words from
that are not in the stopword set
.
Review of the Tokenized Text: Tokenization divides the text into tokens. This process is essential to convert continuous text into tokens that can be examined and used as input for machine learning models.
be a text sequence of characters
be the tokenization function that splits
into words or meaningful units.
(3)
Review of stemmed words: A stemmed word is reduced to its base or root form. As an illustration, “running” could be stemmed to “run.” By combining various word forms into a single representation, this technique can help the model identify and analyze related terms more effectively [7].
The stemming process applied to the entire set of words
is then
(4)
Here
represents the set of stemmed words, where each word
has been transformed into root from
by the function
.
Lemmatized Review: By taking into account the context and meaning of the word, lemmatization reduces words to their base or root form more precisely than stemming does. Lemmatized terms like “better” would be “good.” Text analysis and sentiment prediction are improved by this method’s more accurate text normalization.
Let
(5)
For each
in
(6)
Applying lemmatization to the entire set of words
yields
(7)
Together, these preprocessing procedures standardize, clean, and condense the text data into a format that is suitable for sentiment analysis and other natural language processing applications. The hybrid and Machine Learning approaches were used in this research to assess the effectiveness of the methods and forecast the sentiments expressed in customer reviews. These strategies include more sophisticated approaches like hybrid models that combine several techniques, as well as more conventional ones like Support Vector Machines and Naïve Bayes algorithms. The objective was to evaluate these techniques’ performance through comparative analysis and ascertain how well they classified sentiments.
4.2. Machine Learning Algorithms
By identifying patterns in the data and categorizing text into positive, negative, and neutral categories, machine learning algorithms are essential for predicting polarities in sentiment analysis tasks. Naive Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT), and Random Forest (RF) are frequently employed algorithms for this purpose. Each of these algorithms makes use of a different strategy: DT and RF create decision rules based on feature values, SVM determines the best hyperplane for classification, and NB employs probabilistic techniques.
4.2.1. Naïve Bayes Algorithm
Based on the premise that the features used for classification are conditionally independent given the class label, Naïve Bayes is a probabilistic classification algorithm. Because of its simplicity, it can function well in high-dimensional spaces and with little data [8]-[10]. Because of its efficiency and effectiveness in handling large datasets, Naïve Bayes often delivers robust performance in various applications, like text classification and spam filtering, despite its “naïve” assumption that rarely holds true in practice.
The Naïve Bayes algorithm aims to find the class
that maximizes the posterior probability
using Bayes theorem
(8)
since
is constant for all classes
(9)
To compute
, the Naive Bayes assumption assumes that the features are conditionally independent given the class.
(10)
Its primary benefits are its simplicity of use and speed in producing probabilistic predictions, which make it a preferred option for a variety of classification tasks.
Thus, the equation for predicting the class
is
(11)
This equation forms the Naïve Bayes classifier.
4.2.2. Support Vector Machine
Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression tasks. It works by finding the hyperplane that best separates data points of different classes in a high-dimensional space. The optimal hyperplane maximizes the margin, or distance, between the closest data points of each class, known as support vectors [11] [12]. SVM is effective in handling both linear and non-linear classification problems through the use of kernel functions, which map data into higher dimensions to make it linearly separable [13] [14].
The distance between the hyperplane and the closest data points is called the margin
for all
(12)
Thus SVM optimization problem can be formulated as follows,
(13)
Subject to
, for all
(14)
Once
and
are determined, a new data points
can be classified based on the sign
(15)
The predicted class for
(16)
In this way, SVM classifies new data points by finding which side of the hyperplane.
4.2.3. Decision Tree Algorithm
A supervised learning technique used for both regression and classification problems are the decision tree algorithm. It builds a model in the shape of a tree structure, with each internal node standing for a feature-based decision, each branch for the decision’s result, and each leaf node for the final classification or prediction [15]-[17].
For a node with classes
and probabilities
for each class
.
(17)
Measure the randomness in the information being processed. For a node with classes
(18)
For a dataset
with entropy
and a split on feature
resulting in subsets
the information gain
for feature
is
(19)
Recursively dividing the dataset into subsets based on feature values that best separate the data in accordance with a criterion―such as information gain or Gini impurity―builds the tree. Decision trees are helpful for comprehending the decision-making process because they are simple to interpret and visualize.
4.2.4. SDA Hybrid Algorithm (Hybrid Algorithm)
The Novel Random Decision Algorithm, also known as the SDA Hybrid Algorithm, was created to improve sentiment analysis of customer reviews. To increase classification accuracy, this technique combines elements with decision-making procedures [18]. In order to capture various facets of the sentiment landscape, the SDA Hybrid Algorithm combines multiple decision trees, each trained on randomly selected subsets of the data. With the use of both traditional algorithms’ and structured decision-making, SDA seeks to build a strong model that can manage the complexity and variability of customer reviews. This hybrid method provides a more accurate and nuanced sentiment prediction from customer feedback by reducing over fitting and enhancing generalization [19] [20].
Apply Decision Tree to segment
based on optimal splits
(20)
where each
represents a subset of data points
(21)
The overall prediction for the input features
is the combination of predictions from subset
(22)
5. Results and Discussions
The experimental findings in this research shed light on how well different hybrid and machine learning approaches perform in predicting sentiment from customer reviews [15] [16].
Figure 4 represents the outcomes of various preprocessing techniques applied
![]()
Figure 4. Results of preprocessing techniques.
to the text data. It highlights the impact of each technique on the quality of the data before it is fed into analytical models. The preprocessing steps listed―such as lowercasing, cleaning, stopword removal, tokenization, stemming, and lemmatization―are evaluated based on specific metrics, which might include text clarity, data consistency, and model performance improvements.
Table 1 displays the word count by using vectorization method, one can see how frequently particular words appear in the dataset and how frequently they appear in customer reviews. Words with corresponding counts, such as “good,” “sound,” “product,” and “quality,” are listed in the table. For instance, “good” is the most frequently occurring word with 4276 occurrences. It is followed by “sound” with 2827 occurrences and “product” with 2658.
![]()
Table 1. Frequency of words in customer reviews
While Naïve Bayes shows competitive performance, particularly in the Positive category, it performs less well when it comes to classifying sentiments that are neutral. Understanding the terms that customer use most frequently in their reviews can be greatly aided by looking at this figure, which offers insights into important themes and sentiments shown in Figure 5.
![]()
Figure 5. Frequency of words repeated in reviews.
These steps were improved performance and accuracy in sentiment analysis by preparing the dataset for the application of algorithms such as Naive Bayes, SVM, Random Forest, and hybrid models.
![]()
Figure 6. Length of reviews in different phase.
The lengths of the reviews before lowercasing, cleaning, tokenization, stopword removal, lemmatization, and stemming are displayed in the columns along with the rating and original review lengths. The impact of each preprocessing step on the text data is displayed in Figure 6.
Table 2 represents the review text’s initial length in terms of characters or words before any preprocessing was done. These metrics enable a thorough comparison of each preprocessing method’s efficacy. It aids in quantifying the amount of redundancy or noise eliminated at each step. The text is optimized for tokenization, feature extraction, and classification algorithms by offering insights into how the dataset is transformed for machine learning models.
Table 3 represents a comparison of the sentiment polarity identification performance of four distinct classification algorithms: Decision Tree, Naïve Bayes, Support Vector Machine (SVM), and SDA (a hybrid algorithm). Precision, recall, and F1-score for three sentiment categories―Negative, Neutral, and Positive―as well as the support (number of samples) for each category are used to assess each algorithm’s performance.
The existing algorithms and hybrid algorithm for sentiments extraction form the customer reviews using Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM), and the SDA hybrid algorithm counts (positive, neutral, and negative) are displayed in Figure 7.
In determining the sentiment polarity (negative, neutral, and positive) of a dataset, this table shows the effectiveness of a number of machine learning algorithms, including Decision Tree, Naïve Bayes, Random Forest, Support Vector
![]()
Table 2. Length of reviews in each preprocessing phase.
![]()
Table 3. Polarities predicted by algortihms.
![]()
Figure 7. Polarities Predicted by Algorithms.
![]()
Table 4. Performance analysis of algorithms.
Machine, and the hybrid SDA algorithm (Support Vector Machine + Decision Tree). The precision, recall, F1-score, and support metrics are used to evaluate each model’s performance, as shown in Table 4. Considered the other algorithms, the SDA (hybrid) algorithm is the best model for identifying sentiment polarity in this dataset shown in Figure 8. All sentiment classes receives nearly flawless scores on every metric, indicating that it strikes the best balance between accuracy and dependability when determining sentiment polarities.
SVM and Naïve Bayes perform reasonably well, while Random Forest and Decision Tree have lower recall, particularly for negative and neutral sentiments. The accuracy of several algorithms in a classification task is shown in the Table 5. With an accuracy of 72%, the Naïve Bayes algorithm performed moderately. With an accuracy of 82%, the Support Vector Machine (SVM) outperformed the others, indicating a strong capacity for accurate data classification.
The decision tree algorithm yields 78.20% accuracy. It was marginally less compared with SVM, but it was still quite good. At an amazing accuracy of 96.01%, the suggested approach―dubbed SDA―significantly outperformed the other algorithms.
![]()
Figure 8. Comparison of Predicting Sentiments by Algorithms
The substantially of higher accuracy is 96.01%, the SDA (Proposed method) outperformed the traditional algorithms by a wide margin. Figure 9 shows that the suggested approach is very good at identifying the underlying patterns in the data, which results in predictions that are more accurate.
6. Conclusion
Customer review data posted in social media are usefull to know the products quality and further analysis. In order to find the performance of machine learning algorithms, the boat headphone reviews are given as input to thee chosen algorithms. The existing machine learning algorithms Naïve Bayes, Support Vector Machine, Random Forest and Decision Tree and a hybrid algorithm namely SDA are utilised to find the performance and efficiency of the algorithms in terms of its precision, recall, f1-score as well as the accuracy. The polarities of the chosen dataset are identified in order to find the sentiments of the customer reviews. From the experimental results of this approach, it is found that the performance of the hybrid algorithm is better than the other existing algorithms. Hence, it is concluded that the hybrid algorithm yields better results and analysed the customer reviews for sentiments. In future some of the other machine learning algorithms are applied in the same procedure to find the sentiments as well as accuracy using different kinds of customer reviews data.