• DocumentCode
    1816009
  • Title

    Down on the OCR farm: how we produced searchable PDFs for 7 million documents in a student computer lab

  • Author

    Mason, Robert ; Schmidt, Heidi ; Trott, Richard

  • Author_Institution
    UCSF Libr./CKM, San Francisco, CA
  • fYear
    2005
  • fDate
    7-11 June 2005
  • Firstpage
    391
  • Lastpage
    391
  • Abstract
    The Legacy Tobacco Documents Library began with tobacco industry documents released to the public under the terms of the Master Settlement Agreement between the United States Attorneys General and five USA tobacco companies. Under terms of the agreement, approximately four million documents in digital format were produced in 2000. Roughly another three million have been produced since then and added to the collection. The documents were produced as TIF and PDF files. They were accompanied by some descriptive metadata and can be retrieved by elements such as date of production, title, author and document type. Unfortunately, metadata varies between companies, is inconsistent over time and was produced without authority control or keyword indexing. From the initial planning stages, the value of full-text searching of Legacy library documents was recognized, but prohibited by constraints on staff, software and system resources. Utilizing idle workstations in a student computer laboratory, 7 million searchable PDF documents were generated from 42 million TIF page images
  • Keywords
    document handling; file organisation; full-text databases; law administration; meta data; optical character recognition; tobacco industry; Legacy Tobacco Documents Library; Master Settlement Agreement; OCR; United States Attorneys General; descriptive metadata; full-text searching; idle workstations; searchable PDF; student computer laboratory; tobacco industry; Distributed computing; Grid computing; Image databases; Image recognition; Legged locomotion; Optical character recognition software; Optical computing; Software libraries; Software systems; Workstations; OCR; distributed computing; text-searching;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Libraries, 2005. JCDL '05. Proceedings of the 5th ACM/IEEE-CS Joint Conference on
  • Conference_Location
    Denver, CO
  • Print_ISBN
    1-58113-876-8
  • Type

    conf

  • DOI
    10.1145/1065385.1065494
  • Filename
    4118594