• DocumentCode
    2530319
  • Title

    Citation recognition for scientific publications in digital libraries

  • Author

    Besagni, Dominique ; Belaïd, Abdel

  • Author_Institution
    URI, INIST, Vandoeuvre-les-Nancy, France
  • fYear
    2004
  • fDate
    2004
  • Firstpage
    244
  • Lastpage
    252
  • Abstract
    A method based on part-of-speech tagging (PoS) is used for bibliographic reference structure. This method operates on a roughly structured ASCII file, produced by OCR. Because of the heterogeneity of the reference structure, the method acts in a bottom-up way, without an a priori model, gathering structural elements from basic tags to subfields and fields. Significant tags are first grouped in homogeneous classes according to their categories and then reduced in canonical forms corresponding to record fields: "authors", "title", "conference name", "date", etc. Nonlabeled tokens are integrated in one or another field by either applying PoS correction rules or using a interor intra-field model generated from well-detected records. The designed prototype operates with a great satisfaction on different record layouts and character recognition qualities. Without manual intervention, 96.6% words are correctly attributed, and about 75,9% references are completely segmented from 2,575 references.
  • Keywords
    bibliographic systems; citation analysis; digital libraries; document image processing; optical character recognition; publishing; scientific information systems; ASCII file; OCR; bibliographic reference structure; character recognition qualities; citation recognition; digital libraries; nonlabeled tokens; part-of-speech tagging; scientific publications; Bibliometrics; Character recognition; Information analysis; Optical character recognition software; Production; Prototypes; Software libraries; Tagging; Turning; Watches;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Image Analysis for Libraries, 2004. Proceedings. First International Workshop on
  • Print_ISBN
    0-7695-2088-X
  • Type

    conf

  • DOI
    10.1109/DIAL.2004.1263253
  • Filename
    1263253