Comparison of Ontology-Based Semantic-Similarity Measures in the Biomedical Text ()
1. Introduction
Semantic similarity between concepts is a method to measure the semantic simi- larity or the semantic distance between two concepts according to a given ontology. The semantic similarity measuring techniques can be classified into three classes as follows. The first measure semantic similarity by using ontology or taxonomy (e.g. Word Net, UMLS/ICD10) to calculate the distance between the concept nodes in the ontology tree or hierarchy [1] . The second class of techniques uses training corpora and information content (IC) to estimate the semantic similarity and relatedness between two concepts. The third class simply includes the techniques that employ a combination from the first two classes. Measures of semantic similarity and relatedness are used in applications such as information extraction and retrieval, classification and ranking, detection of redundancy, detection and correction of malapropisms. In this paper, we analyze an ontology-based semantic similarity measure and apply it in the biomedical domain, using ICD10 as knowledge sources.
2. Background and Related Work
2.1. UMLS and ICD10
The Unified Medical Language System (UMLS) project started at the National Library of Medicine (NLM) in 1986.It consists of three main knowledge sources: first: Metathesaurus consists of more than 1 million biomedical concepts from over 130 sources and supports 17 languages. Second: Semantic Network contains 135 broach categories and 54 relationships between categories. Third: SPECIA- LIST Lexicon & Lexical Tools includes lexical information and programs for processing language [2] . In Metathesaurus of UMLS 2005AB (June 2005), there are 133 source vocabularies classified into 73 families. They have multiple translations (e.g., MeSH, ICPC, and ICD-10) and have many variants (American-Bri- tish equivalents, Australian extension/adaptation) [2] .
ICD10 stands for International Classification Diseases 10th revision: An international standard used to classify diseases and other health problems adopted by World Health Organization (WHO) [3] . The newest edition (ICD-10) is divided into 21 chapters: (Infections, Neoplasm, Blood Diseases, Endocrine Diseases, etc.), and denote about 14,000 classes of diseases and related problems.
2.2. ICD10 Taxonomy
The first character of the ICD code is a letter, and each letter is associated with a particular chapter, except for the letter D, which is used in both Chapter II, Neo- plasm, and Chapter III, Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism, and the letter H, which is used in both Chapter VII, Diseases of the eye and adnexa and Chapter VIII, Dis- eases of the ear and mastoid process. Four chapters (Chapters I, II, XIX, and XX) use more than one letter in the first position of their codes. Each chapter contains sufficient three-character categories to cover its content; not all available codes are used, allowing space for future revision and expansion. Chapters I-XVII relate to diseases and other morbid conditions, and Chapter XIX to injuries, poisoning and certain other consequences of external causes. The remaining chapters complete the range of subject matter nowadays included in diagnostic data. Chapter XVIII covers Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified. Chapter XX, External causes of morbidity and mortality, was traditionally used to classify causes of injury and poisoning, but, since the Ninth Revision, has also provided for any recorded external cause of diseases and other morbid conditions. Finally, Chapter XXI, Factors influencing health status and contact with health services, is intended for the classification of data explaining the reason for contact with health-care services of a person not currently sick, or the circumstances in which the patient is receiving care at that particular time or otherwise having some bearing on that person’s care. The chapters are subdivided into homogeneous “blocks” of three-character categories. Most of the three character categories are subdivided by means of a fourth, numeric character after a decimal point, allowing up to 10 subcategories [3] . We have used the following taxonomy in order to measure the similarity between the health domain types which is being re- presented as nodes in the taxonomy. In our experiment, the similarity is measured using different types of semantic measures. From the evaluation result, the best measure will be used in our benchmark model. Figure 1 below describes ICD10 nodes:
2.3. Related Work
Several methods and a lot of research work for determining semantic similarity measures have been proposed in the last few decades, the similarity between two concepts/nodes is also called as relatedness. To measure the relatedness of two concepts C1 and C2, researchers have used many methods in these two appro- aches that can be classified into two categories: the first one is based on the graph, using distance concept’s that mainly considers the lengths of the paths connecting the concepts.
Rada [5] defined the conceptual distance between two words in the “is-a” hierarchy relationships as the length of the shortest path connecting the two
Figure 1. Fragment of the ICD-10 taxonomy [4] .
words. In this measure the semantic distance is computed by counting the number of edges between concepts in the taxonomy. The experiments were conducted using MeSH (Medical Subject Headings-Biomedical ontology) onto- logy.
Resnik [6] is the first person who employed Information content to calculate semantic similarity in the “IS-A” hierarchy relationship. He proposed that the more information content two words share, the more similar they are. Following Resnik’s method, several other measures were proposed.
Lin [7] the authors of this work have proposed a measure based on an ontology restricted to hierarchic links and a corpus. This similarity takes into account the information shared by two concepts like Resnik, but the difference between them is in the definition. The definition contains the same components as Resnik measure but the combination is not a difference.
3. Semantic Similarity Measures for Single Ontology
In this paper, we focus only on these semantic similarity measures that used ontology as a primary information source. The main semantic measures could be classified into Structure-based measures and Information content (IC) measures:
3.1. Ontology Structure-Based Similarity Measures
Most of the measures that are based on the structure of the ontology are actually based on: path length/distance (shortest path length) between the two concept nodes, and depth of concept nodes in the ontology/is-a hierarchy tree, e.g. some of the measures are based on Word Net include: path length, Wu & palmer, Lea- cock & Chodorow, Resink, and Lin et al. [8] [9] .
3.1.1. Path Length Based Measures (Shortest Path)
In this method, the similarity measurement among concepts is determined according to the path distance, which separates the concepts on the taxonomy or ontology structure. In this measure the distance between two concepts C1, C2 is computed as the shortest path linking them as estimate distance.
(1)
Also, asimple edge-counting measure proposed by Rada [5] :
(2)
where:
N1 and N2 are the minimum numbers of taxonomical links from c1 to c2 to their LCS, respectively. The similarities between two concepts C1 and C2 can be formulated as follows:
(3)
where:
Max is the maximum depth of the taxonomy.
Length (c1, c2) is the shortest path length C1 and C2.
3.1.2. Wu and Palmer Measure
This similarity measure considers the position of C1 and C2 to the position of the most specific common concept C. Several parents can be shared by C1 and C2 by multiple paths. The most specific common concept is the closest common ancestor C (the common parent related with the minimum number of is-a links with concepts C1 and C2) [10] .
(4)
where:
N1 and N2 are the distance from the specific common concept to concept C1 and C2 respectively. N3 is the depth of the least common subsumer (The least common subsumer, LCS(C1, C2), of two concept nodes C1 and C2 are the lowest nodes that can be a parent for C1 and C2. For example, in Figure 1, (LCS (A00.0, A00.9) = A00 and LCS (A00.0, A09.0) = A00 - A09) of two concepts nodes, and N1, N2 are the path lengths from each concept node to LCS, respectively. From our taxonomy (Figure 1), we can calculate the similarity between concepts C1 and C2as following:
3.1.3. Leacock and Chodorow Measure
In this method, the similarity between two concepts is determined by discovering the shortest path length, which connects these two concepts in the taxonomy/ontology. The similarity is calculated as the negative algorithm of this value. The similarities between two concepts C1 and C2 can be formulated as follows [6] :
(5)
max_depth is longest of the shortest path linking concept to concept, which subsumed all others.
From our taxonomy (Figure 1), we can calculate the similarity between concepts C1 and C2 as following:
3.2. Information Content (IC) Measures
Following is the standard argumentation of information theory [Ross, 1976], the information content of a concept c can be quantified as the negative log like lihood [11] [12] .
(6)
From our taxonomy (Figure 1), we can calculate the similarity between concepts C1 and C2 as following:
3.2.1. Resink Measure
In this measure, the similarity of two concepts (c1, c2) is defined as the Information Content (IC) of their LCS, as shown in the following Equation (7):
(7)
Where:
(8)
From our taxonomy (Figure 1), we can calculate the similarity between concepts C1 and C2 as following:
Then:
3.2.2. Lin Similarity Measure
This measure depends on the relation between information content (IC) of the LCS of two concepts and the sum of the information content of the individual concepts [7] [13] .
(9)
From our taxonomy (Figure 1), we can calculate the similarity between concepts C1 and C2 as following:
3.3. Semantic Similarity in the Biomedical Domain
3.3.1. Rada Measure
Rada et al. [5] Proposed semantic distance as a potential measure of semantic similarity between two concepts in MeSH, and implemented the shortest path length measure, called CDist, based on the shortest distance between two concept nodes in the ontology. They evaluated CDist on UMLS Metathesaurus (MeSH, SNOM-ED, ICD9), and compared the CDist similarity scores to human expert scores by correlation coefficients.
3.3.2. Pedersen Measure
Pedersen et al. [1] Proposed semantic similarity and relatedness in the biomedicine domain, by applied a corpus-based context vector approach to measuring thesimilarity between concepts in SNOMED-CT. Their context vector approach is ontology-free but requires training text, for which, they used text data from Mayo Clinic corpus of medical notes.
3.3.3. Nguyen and Al-Mubaid Measure
Hisham Al-Mubaid & Nguyen [14] [15] proposed measure takes the depth of their least common subsume (LCS) and the distance of the shortest path between them. The higher similarity arises when the two conceptsare in the lower level of the hierarchy. Their similarity measure is:
(10)
where:
Depth L(c1, c2) is depth of L(c1, c2) using node counting.
L(c1, c2) is the shortest distance between c1 and c2.
D is the maximum depth of the taxonomy.
The similarity equal 1, where two concept nodes are in the same cluster/ ontology. The maximum value of this measure occurs when one of the concepts is the left-most leaf node, and the other concept is a right leaf node in the tree.
Figure 2 shows the path length between “Cholera [A00]” and “Typhoid and paratyphoid fevers [A01]” is 3 using node counting. The path length between “Cholera due to Vibrio cholerae 01, biovarcholerae [A00.0]” and “Cholera due to Vibrio cholerae 01, biovareditor [A00.1]” is also 3. Thus, the similarity in these two cases is the same by Path length measure. However, the similarity between Cholera [A00]” and “Typhoid and paratyphoid fevers [A01]” is less than the similarity between “Cholera due to Vibrio cholerae 01, biovar cholerae [A00.0]” and “Cholera due to Vibrio cholerae 01, biovareltor [A00.1]” as the latter two concepts lie at a lower level in the hierarchy tree and share more information. However, Table 1 shows that Path length (P.L.), Wu & Palmer, and Leacock & Chodorow (L.C.) produce the same semantic similarity for the two pairs [(A00,
A09) and (A00.1, A00.9)], whereas Al-Mubaid & Nyguan measure gives a higher similarity (3.0) for the pair (A00.1, A00.9) as it occurs lower down in the ontology hierarchy than (A00.1, A00.9) which received the lower similarity (1.0). Recall that, in Al-Mubaid & Nyguan Measure, Equation (10), the higher the numeric similarity result between (c1, c2) the lower the semantic similarity between (c1, c2). In Wu & Palmer measure, the path length between two concepts is not used, only depths of concepts are used, consequently, its performance is lower than Al-Mubaid & Nyguan method [15] .
4. Experiments and Results
4.1. Datasets
In the biomedical domain, there are no standard human rating sets of terms/ concepts on semantic similarity and relatedness like the M & C or R & G sets for general English [16] . To comparemethods, we borrowed and used the set of 30 concept pairs from Pedersen, Pakhomov, & Patwardhan (2005) [1] , which was annotated by 3 physicians and 9 medical index experts. Each pair was annotated on a 4 point scale: “practically synonymous, related, marginally, and unrelated.” The average correlation between physicians is 0.68, and between experts is 0.78.
In this paper, we examine only ontology-only techniques, and we use ICD10 the ontology instead of MeSH. We could find only 21 out of the 30 concept pairs in ICD10 using ICD10 browser ICD-10 Version: 2010
(http://apps.who.int/classifications/icd10/browse/2010/en) as some terms cannot be found, so we used 21 pairs in the experiments (Pedersen et al. [1] tested 29 out of the 30 concept pairs as one pair was not found in SNOMED-CT). The concept pairs in bold, in Table 2, are the ones that contain a term that was not found in ICD10 and we did not include in our experiments.
4.2. Experiments and Results
We implemented the Al-Mubaid & Nyguan’s similarity measure and conducted comparisons with four other ontology-based semantic similarity measures. All the measures use node counting for path length and for depth of concept nodes. For the pairs that have a term belongs to more than one category tree, we take into account only its position(s) in the same category with the other term. Table 3 shows for the five measures the results of correlation with human ratings of
Table 2. The test set of 30 medical term pairs sorted in the order of the averaged physi- cian’ scores.
Table 3. Absolute values of correlation of the five measures relative to human judgments.
physicians and experts with the ranks between parentheses. These correlation values (in Table 3) show that Al-Mubaid & Nyguan’s method is ranked #1 in correlation relative to experts’ judgments. But relative to physician judgments, their method scored the second. Because the expert scores are more reliable as the correlation among the expert scores (0.78) is higher than that among the physicians (0.68), and there are more experts than physicians (3 physicians & 9 experts).
5. Conclusion and Future Works
We have compared an ontology-based semantic similarity measure. The experiments presented in this paper have proven the superiority of the Al-Mubaid & Nyguan’s method relative to human judgments and compared with other ontology-based measures. In future work of this paper, we intend to explore experiment with applications of semantic relatedness measures to NLP tasks such as wordsense discrimination, information retrieval, and spelling correction, in the biomedical domain. We further use that set to compare taxonomies as well as calculate semantic similarity of two concepts within and across UMLS terminology sources. Finally, we plan to implement a web-based user interface for all these semantic similarity measures and to make it available freely to researchers over the Internet. That will be much helpful for interested researchers in the field of biomedical.