مرکز منطقه ای اطلاع رساني علوم و فناوري - Exploiting information content and semantics to accurately compute similarity of GO-based annotated entities

Abstract :

Gene Ontology (GO) annotations encode scientific knowledge that states the properties of GO-based annotated entities, e.g., proteins, enzymes, or genes. On one hand, taxonomic knowledge, which is encoded in class hierarchies, states the abstract types of the annotations. On other hand, semantics of the object properties expressed as logic axioms, determine the facts, inferred or stated, where the annotations participate in the ontology, i.e., the annotation neighborhood. Further, informativeness of a term in a corpus allows for the characterization of the specificity of the GO terms used in a dataset of GO-based annotated entities. We hypothesize that the combination of all these characteristics provides the basis for an accurate estimation of the similarity between GO-based annotated entities; we devise thus a novel semantic similarity measure named IC-OnSim. IC-OnSim considers all informativeness, the semantics encoded in the ontology, and the taxonomic information as first-class citizens, and differentiates annotations that are taxonomically similar, but that either are not equally informative or have different neighborhoods in GO. IC-OnSim determines semantics based on the neighborhood facts of the corresponding GO terms, i.e., the object properties where these GO facts participate and the justifications that support the entailment of these facts; further, informativeness of a term is measured as Information Content (IC). We empirically study the performance of IC-OnSim on benchmarks of GO-based annotated proteins published by the Collaborative Evaluation of Semantic Similarity Measures (CESSM) tool. During the evaluation, IC-OnSim exhibits the highest values of the Pearson´s correlation coefficients with respect to the gold standard similarity measures Sequence Similarity and Pfam. These results support our hypothesis that the combination of the properties of the GO terms allows for the precise computation of the relatedness of GO-based annotated entities.