• DocumentCode
    1206109
  • Title

    Noisy text categorization

  • Author

    Vinciarelli, Alessandro

  • Author_Institution
    IDIAP Res. Inst., Martigny, Switzerland
  • Volume
    27
  • Issue
    12
  • fYear
    2005
  • Firstpage
    1882
  • Lastpage
    1895
  • Abstract
    This work presents categorization experiments performed over noisy texts. By noisy, we mean any text obtained through an extraction process (affected by errors) from media other than digital texts (e.g., transcriptions of speech recordings extracted with a recognition system). The performance of a categorization system over the clean and noisy (word error rate between ∼ 10 and ∼ 50 percent) versions of the same documents is compared. The noisy texts are obtained through handwriting recognition and simulation of optical character recognition. The results show that the performance loss is acceptable for recall values up to 60-70 percent depending on the noise sources. New measures of the extraction process performance, allowing a better explanation of the categorization results, are proposed.
  • Keywords
    feature extraction; handwriting recognition; image denoising; optical character recognition; text analysis; extraction process; handwriting recognition; noisy text categorization; optical character recognition; word error rate; Digital recording; Error analysis; Handwriting recognition; Optical character recognition software; Optical noise; Optical recording; Speech processing; Speech recognition; Text categorization; Text recognition; Index Terms- Text categorization; indexing; noisy text; offline cursive handwriting recognition; optical character recognition.; Algorithms; Artificial Intelligence; Automatic Data Processing; Documentation; Handwriting; Image Enhancement; Image Interpretation, Computer-Assisted; Information Storage and Retrieval; Models, Statistical; Pattern Recognition, Automated; Reading; Stochastic Processes;
  • fLanguage
    English
  • Journal_Title
    Pattern Analysis and Machine Intelligence, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0162-8828
  • Type

    jour

  • DOI
    10.1109/TPAMI.2005.248
  • Filename
    1524982