Title :
Down on the OCR farm: how we produced searchable PDFs for 7 million documents in a student computer lab
Author :
Mason, Robert ; Schmidt, Heidi ; Trott, Richard
Author_Institution :
UCSF Libr./CKM, San Francisco, CA
Abstract :
The Legacy Tobacco Documents Library began with tobacco industry documents released to the public under the terms of the Master Settlement Agreement between the United States Attorneys General and five USA tobacco companies. Under terms of the agreement, approximately four million documents in digital format were produced in 2000. Roughly another three million have been produced since then and added to the collection. The documents were produced as TIF and PDF files. They were accompanied by some descriptive metadata and can be retrieved by elements such as date of production, title, author and document type. Unfortunately, metadata varies between companies, is inconsistent over time and was produced without authority control or keyword indexing. From the initial planning stages, the value of full-text searching of Legacy library documents was recognized, but prohibited by constraints on staff, software and system resources. Utilizing idle workstations in a student computer laboratory, 7 million searchable PDF documents were generated from 42 million TIF page images
Keywords :
document handling; file organisation; full-text databases; law administration; meta data; optical character recognition; tobacco industry; Legacy Tobacco Documents Library; Master Settlement Agreement; OCR; United States Attorneys General; descriptive metadata; full-text searching; idle workstations; searchable PDF; student computer laboratory; tobacco industry; Distributed computing; Grid computing; Image databases; Image recognition; Legged locomotion; Optical character recognition software; Optical computing; Software libraries; Software systems; Workstations; OCR; distributed computing; text-searching;
Conference_Titel :
Digital Libraries, 2005. JCDL '05. Proceedings of the 5th ACM/IEEE-CS Joint Conference on
Conference_Location :
Denver, CO
Print_ISBN :
1-58113-876-8
DOI :
10.1145/1065385.1065494