• DocumentCode
    660642
  • Title

    Information Extraction for Computer Science Academic Rankings System

  • Author

    Chengkai Shi ; Jiahui Quan ; Minglu Li

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Shanghai Jiao Tong Univ., Shanghai, China
  • fYear
    2013
  • fDate
    4-6 Nov. 2013
  • Firstpage
    69
  • Lastpage
    76
  • Abstract
    Today the academic ranking for computer science is a hot and importmant problem. This paper introduces Computer Science Academic Rankings System (CSAR) which aims at academic information extracting, mining and ranking. In this paper we mainly present approaches for information extraction and normalization in CSAR. For semi-structured and unstructured web pages such as paper-view pages, we propose a method with natural language processing n-gram model and web grammar. We generate an optimal matching bipartite graph to extract authors and organizations information with maximum likelihood. CSAR also uses KM algorithm and Hungarian algorithm to find authors and emails correspondence. For information normalization, we introduce n-gram model, EM algorithm and trigram model with linear interpolation to construct part-of-speech tagger, with which to extract useful information from web source. Then TF-IDF model and string edit distance are applied to finish normalizing organization names. In experiment, our proposed approaches obtain high accuracy rate and great improvements of academic information extraction.
  • Keywords
    computer science education; expectation-maximisation algorithm; graph theory; information retrieval; natural language processing; CSAR system; Hungarian algorithm; KM algorithm; TF-IDF model; Web grammar; Web pages; computer science academic rankings system; expectation-maximization algorithm; information extraction; information normalization; linear interpolation; maximum likelihood estimation; natural language processing n-gram model; optimal matching bipartite graph; paper-view pages; part-of-speech tagger; term frequency-inverse document frequency model; trigram model; Bipartite graph; Data mining; Electronic mail; Grammar; Organizations; Social network services; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud and Service Computing (CSC), 2013 International Conference on
  • Conference_Location
    Beijing
  • Type

    conf

  • DOI
    10.1109/CSC.2013.19
  • Filename
    6693181