Abstract :
We consider here the general problem of converting documents available in print-ready or image format into a structured format that reflects the logical structure of the document. One aspect of the problem involves reconstructing conventional constructs such as titles, headings, captions, footnotes, etc. In practice, another important aspect involves putting in place some automated Quality Assessment (QA) method. We propose here a method to automate the QA in the case of a homogeneous collection by considering multiple documents at once instead of focusing only on the document being processed.