Development of Answer Validation System Using Responders’ Attributes and Crowd Ranking ()
1. Introduction
The new information era provides readily available access to information, especially with the advent of the internet. Different questions requiring correct answers are uploaded on the internet on daily basis which leads to the development of question answering (QA) systems, with the aim of providing accurate answers to explicit questions which are contrasting to document retrieval (Ojokoh & Adebisi, 2019
Several studies have been carried out on how to make better the quality of the answers provided by QA system, focusing on textual entailment, question type analysis, answer ranking by the crowd workers and domain experts and personal and community features (past history) of the answerer to determine the quality of the answers (Ríos-Gaona et al., 2012; Su et al., 2007; Ishikawa et al., 2011; Ojokoh & Ayokunle, 2012; Anderson et al., 2012; Schofield & Thielscher, 2019). Since past history alone may not be fitting enough to determine the quality of an answer, level of confidence in the answer provided is introduced in order to obtain credible answers from respondents. The proposed system is aimed at using community presence interaction as one of the basis for quality answer selection; capturing crowd specialty as part of the personal features used to validate answers; modelling the criteria used in evaluation automatically and preventing bias crowd ranking of answers by enabling them to specify their preferential schedule using Naïve Bayes Spam filter and Borda count ranking Algorithm.
The remaining part of this paper is structured as follows: Section 2 presents the review of related works. Section 3 presents the proposed system architecture, and the description of the components that make up the architecture. Section 4 is dedicated to the experimental setup and results while Section 5, concludes the paper and presents some future works.
2. Related Works
Question Answering (QA) according to Chandra et al. (2017) is a computer science discipline concerned with developing a system that automatically provide answers to questions requested by human in a natural language. QA study attempts to deal with a wide-ranging question types that consist of facts, lists, definitions, how, why, putative, semantically constrained, and cross lingual questions (Cimiano et al., 2014
Dobšovič et al. (2014) proposed and developed a CQA system “Askalot” which is focused on the area of education by implementing a functionality that encompasses the educational goal and specifics of universities, based on open source technologies. Answers to questions are verified by other students, comments are however provided by a teacher using a five-grade scale on which the assessment of the quality of question or answer can be done. Toba et al. (2014) proposed a hybrid hierarchy-of-classifiers framework to model QA pairs and integrate the question type analysis and answer quality information in an integrated framework. The quality classifier gives two probabilities each, showing the probability of good or bad-quality. They tested the framework on a dataset of about 50 thousand QA pairs from Yahoo! Answers and an effective identification of high quality answers was realized based on their evaluation of the system. Tran et al. (2015) presented a method to detect the right or possible right answers from the answer thread in Community Question Answering pools. They used multiple features for quality answer selection which exploits the surface word-based similarity between the question and answer to allot score using a regression model. Afterwards, translation probabilities were computed via IBM and Hidden Markov Models to obtain the likelihood of an answer being the translation of the question. Savenkov et al. (2016) presented a system that could be used to filter or re-rank the candidate answers by providing validation for the answers. They specifically focused on knowing the effect of time restrictions in the close real-time QA setting, thereby developing a way in which crowd will be able to create the answer candidates directly within a limited amount of time and also the way in which crowd will be able to rank sets of given answers to a question within a specified amount of time. Hung et al. (2017) developed a probabilistic model that helps to recognise the most valuable validation questions in improving results’ accuracy and detecting faulty workers in their quest to validate and control the quality of crowd answers to reduce cost incurred from utilizing experts.
Nie et al. (2017) presented a novel scheme to rank answer candidates via pairwise comparisons consisting of one offline learning and one online search component. In the online search component, a pool of candidate answers for the given question was extracted via finding its similar questions. The extracted answers were then sorted by leveraging the offline trained model to judge the preference orders.
Fan et al. (2019) proposed to enhance answer selection in CQA using multidimensional feature combination and similarity order. They made full use of the information in answers to questions to determine the similarity between questions and answers, and use the text-based description of the answer to determine its sensibility. Le et al. (2019) proposed a framework for automatically assessing answer quality by integrating different groups of features such as personal, community-based, textual, and contextual, to build a classification model and determine what constitutes answer quality. Experiments conducted on Brainly and stack overflow datasets show that the random forest model achieves high accuracy in identifying high-quality answers. Also indicating that personal and community-based features have more prediction power in assessing answer quality.
In this paper, we leverage on the fact that the performance of the crowd workers determines the quality of the result of a crowdsourcing task, and hence the need to develop an effective and reliable question answering system that is capable of validating and evaluating the answers provided by the crowd because of their varying reliability as established in past works (Hung et al., 2017; Savenkov et al., 2016). All these are important issues to be addressed in Artificial Intelligence.
3. The Proposed System
The architectural overview of the proposed system is presented in Figure 1. The subsections that follow describes each of the segments.
3.1. User Interface
The user interface module consists of four (4) components listed as follows:
1) Ask Question: This component enables the asker (that is someone who wishes to ask any computer-related questions) to post his/her questions on the platform.
2) Answer Question: This component enables experts or anyone familiar with the question asked to provide answers.
3) Rank Answers: This allow users from the crowd to rank answers provided by other users based on their knowledge of the question.
4) View Recent Questions: This component provides a view of the list of the most recently posted questions.
3.2. Database
The database is the component of the Answer Validation model that stores information about the system and its users. It stores both legitimate questions and answers from web users, and most importantly, answerers’ personal information for the purpose of validating their answers which is obtained the first time a respondent uses the system.
3.3. Naïve Bayes Spam Filter
Naïve Bayes (NB) Spam Filter, a machine learning algorithm, which is one of the powerful tools for Artificial Intelligence was used in this work to filter inconsequential and redundant messages from the collection of messages or information provided by the crowd. Every incoming text (both question and answer) pass through the trained Naïve Bayes Spam filter to determine the probability of the message being a legitimate message or spam. The NB spam filter is trained with the commonly used online spam words and spam dataset downloaded from kaggle.com. A sample is shown in Figure 2.
From Bayes’ theorem, the probability that a message with vector
belongs in category c is:
(1)
Using Naïve Bayes Spam filter, a message is classified as spam whenever
(2)
(3)
where
is a message in spam category;
is a message in ham category;
is the probability that the response x belongs to spam category,
;
is the probability that the response x belongs to ham category,
;
is the likelihood of response x given the spam category;
is the likelihood of response x given the ham category and;
T is a threshold value.
If P is greater than T, the incoming message is being classified as spam message and will be discarded else if P is less than or equal to T, the message will be accepted by the system and presented as a question or accepted as an incoming answer.
3.4. Separate Question from Answer
This is the component of the system where a legitimate message from the user is being identified as either a question or answer. If the incoming message is a
Figure 2. Spam dataset from https://www.kaggle.com/.
question, this component ensures that the question is presented at the User Interface for the answerers to provide answers, and if otherwise, the system will pass it to the next component where the criteria for quality answers will be implemented.
3.5. Criteria for Quality Answers
The quality of the result of a question answering system rest on the source of the answers provided by the system. Since the aim of the question answering system is to provide a precise answer in natural language; it is therefore important to provide quality assurance on every answer obtained from the web users, as these users can vary in reliability. The criteria employed for validation and used to ensure quality answers in this work are User attributes, Area of Specialization, Understandability and Confidence (displayed in Table 1).
3.6. Weighted Voting System
A game playing situation is applied for ranking answers using a collection of weighted players
together with a quotaq, which is the total number of votes required to pass a motion. This is used to determine the level of reliability of the users that provide answers. A player is a user attribute that is used to allot point to answerers. In a weighted voting system, a player’s weight
refers to the number of points allotted to that player and is always a positive integer value. A weighted voting system is described by specifying the voting weights,
of the players
, and the quota, q. A coalition is called winning if the sum of the players’ weights is greater or equal to the quota, and losing if otherwise. The coalitions, which are the criteria used in this work to ensure quality answers from the web users are User attributes, area of specialization, Understandability and Confidence. User attributes that are used comprises of user Course of study, Grade point, number of years of experience in computing and the general level of knowledge of computing. Point is added to the weight of the responder based on their selections from the range of value of the attributes. A user is also allowed to choose any area of specialization such as Networking, Cyber Security and hardware and repairs and so on. Users’ understandability of the given question is measured based on a five-level rating scale, as well as the Confidence which is a way in which the answerer can infer how much the system can trust the answer provided. This is also measured based on a five level rating scale. Combining these and the weighted voting system, this phase of the system is represented by:
where,
P1 is User’s personal attribute, P2 is Specialization;
P3 is Understandability, P4 is Confidence.
The totality of weights,
per Answerer is computed as:
Table 1. Weight distribution table.
(4)
where
is the weight corresponding to each player,
. The maximum weight, N obtainable by an answerer with q being the minimum weight required for an acceptable (valid) answer is expressed as:
(5)
then,
holds for equation (6)
In this work, q was obtained by calculating the 70% of N as follows:
From Equation (6), q can be said to be less than or equal to N but greater than
. This means that
. Since this work is based on quality answer validation, 70% of N was used as the quota q.
.
Therefore the quota, q will be 25. Table 2 depicts the different criteria considered in this work with the respective maximum weight obtainable.
Depending on the point obtained from each criterion by the Responders (Answerers), these points are aggregated based on their selection. The total weight of the answer is calculated to check whether the weight meets up to the quota. If the total weight of the answer is greater or equal to the quota, the answer is considered valid and is passed to the next phase which is the ranking phase,and if not the answer is discarded.
3.7. Crowd Ranking
The last phase employs a crowdsourcing ranking algorithm called Borda count. The algorithm ranks all the valid answers from phase two using a preference schedule point. It awards points to candidates based on preference schedule, then the candidate with the highest points is declared the winner. For instance, given M, the number of candidate answers, each first-place, second-place and third-place votes is worth
points respectively. Consequently, each Mth-place (that is, last-place) vote is worth 1 point. Now, suppose there are n voters, every voter ranks the M candidates according to his preference, and a candidate answer has an average rank score,
.
(7)
where
is the point assigned by n crowd (ranker).
Table 2. Maximum weight obtainable (N).
The candidate answers will be ranked according to their performance starting from the best on top of the list (answer with the highest point) to the worst (answer with the lowest point).
4. Experiments and Evaluation
4.1. Data and Tools
A dataset consisting of 185 Spam messages was downloaded from Kaggle.com and was used to train the Naïve Bayes Filter in order to distinguish between legitimate and inconsequential information provided by the crowd. The system was implemented using HTML, Python Script and Djangoweb framework.
4.2. Experimental Setup
Experiments were conducted to verify the system performance and to determine how useful and precise the answers provided were. The users of the system are allowed to post questions which will be answered by responders who are vast in the field of the question being asked. However, before the responders would be allowed to provide answers, they will be required to sigin/sign up as the case may be, verifying their Course of study, Area of specialization, Grade point, number of years of experience in Computing, general level of Computing knowledge and the level of understanding of the question. Also, the confidence level of the responder will be confirmed before posting the answer. In cases where a minimum of five different answers are provided to a particular question, they are ranked by the crowd starting from the most correct to the least correct answer. A sample of asked questions and answers provided is shown in Figure 3.
Figure 3. Sample of questions and answers.
4.3. Evaluation
The method of evaluation used in this work is based on ISO/IEC 9126 standard metrics and the Usability and User experience (UX) measurement instruments adopted in (Tan et al., 2010). The model consists of 21 subcharacteristics distributed on six main characteristics of software measurement metrics. Using the common Goal Question Metric (GQM) approach, a nomenclature for usability and UX attributes were defined and were able to identify an extensive set of questions and measures for each attribute. The metrics used for this work are shown in Table 3.
From the above stated metrics, twenty (20) questions were formed in order to evaluate the Answer Validation system by Users. Eighty five users out of One hundred sample size evaluated the system, with each question (Q1, …, Q20) answered using four-level rating scale; Very High, High, Medium and Low respectively. Ratings obtained from the Users were analyzed using weight means techniques in which weights are added (such that Very High = 4, High = 3, Medium = 2 and Low = 1) to users feedback. A sample of the questionnaire is shown in Table 4.
4.4. Results and Discussion
The ratings were analyzed and the frequency at which each point occurs was obtained. The metrics were measured and analyzed to form a continuous score in percentage (%). Table 5 illustrates the number of users out of eighty-five (85) that rated the system either Very high, High, Medium or Low based on the given questionnaire. Figure 4 and Figure 5 shows the graphical representation of the obtained results. Table 6 shows the Combination of Very High and High ratings in order to define the User ratings as High, Medium, Low. Figure 6 and Figure 7 show the Combination of Very High and High ratings for Usability and User Experience respectively.
Table 3. Usability and user experience metrics.
Table 4. Questionnaire for answer validation system evaluation.
Table 5. User rating frequency table and their percentage.
Figure 6. Combined very high and high rating for usability.
Figure 7. Combined very high and high rating for user experience.
Table 6. Combined very high and high rating.
The overall results show that the user experience evaluations of the system based on the metrics given are excellent. This is because in most case of the metrics used “Very High” and “High” (which are good scale to measure superior or improved opinion ) are rated up to 90% and above, Medium are rated less than 10% respectively.
The Relevance of the system is calculated thus:
,
where N is the total number of rate point,
and
is the sum of user that selected a given rate point for all the metrics.
.
5. Conclusion and Future Works
An answer validation system for answers using answerers attributes and crowd ranking has been developed. For the effectiveness of the system, illegitimate questions and answers were filtered out using a trained Naïve Bayes spam filter with a threshold of 0.5. Answerers’ personal attributes (such as Grade points, Area of specialization, Years of experience Level of Computing, Course of study, Question Understandability and the answer confidence level (trustworthiness)) were used to ensure high quality answers by employing a weighted system that assigns weights to individual attributes in order to know the weight of the answers for validation. Answers are ranked by the crowd to get the best four answers from the candidate answers obtained from the answerers using Borda count ranking algorithm and least best answer is discarded. The system correctness is 96.47%, Answer satisfaction is 100%, answer Validation is 97.65%, system Simplicity is 97.6%, system Feedback is 88.23% and the system efficiency is 96.47%. Future works could include more User attributes such as age, qualification and so on and ensure that there is an improvement in the system feedback so that users can receive instant live answers to their respective questions. There should be a way in which the answerers are motivated for the task performed in order to enhance their performance. In addition, the system should be more general to accommodate questions from other science related domain.