• DocumentCode
    1909427
  • Title

    High accuracy citation extraction and named entity recognition for a heterogeneous corpus of academic papers

  • Author

    Powley, Brett ; Dale, Robert

  • Author_Institution
    Centre for Language Technol., Macquarie Univ., Sydney, NSW
  • fYear
    2007
  • fDate
    Aug. 30 2007-Sept. 1 2007
  • Firstpage
    119
  • Lastpage
    124
  • Abstract
    Citation indices are increasingly being used not only as navigational tools for researchers, but also as the basis for measurement of academic performance and research impact. This means that the reliability of tools used to extract citations and construct such indices is becoming more critical; however, existing approaches to citation extraction still fall short of the high accuracy required if critical assessments are to be based on them. In this paper, we present techniques for high accuracy extraction of citations from academic papers, designed for applicability across a broad range of disciplines and document styles. We integrate citation extraction, reference parsing, and author named entity recognition to significantly improve performance in citation extraction, and demonstrate this performance on a cross-disciplinary heterogeneous corpus. Applying our algorithm to previously unseen documents, we demonstrate high F-measure performance of 0.98 for author named entity recognition and 0.97 for citation extraction.
  • Keywords
    citation analysis; information retrieval; text analysis; academic papers; document styles; heterogeneous corpus; high accuracy citation extraction; named entity recognition; reference parsing; textual citation indices; Australia; Automation; Citation analysis; Computer science; Data mining; Hidden Markov models; Intersymbol interference; Navigation; Paper technology; Prototypes;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering, 2007. NLP-KE 2007. International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-1610-3
  • Electronic_ISBN
    978-1-4244-1611-0
  • Type

    conf

  • DOI
    10.1109/NLPKE.2007.4368021
  • Filename
    4368021