Title of article :
Predicting Reading Difficulty With Statistical Language
Models
Author/Authors :
Kevyn Collins-Thompson and Jamie Callan، نويسنده ,
Issue Information :
ماهنامه با شماره پیاپی سال 2005
Abstract :
A potentially useful feature of information retrieval systems
for students is the ability to identify documents
that not only are relevant to the query but also match the
student’s reading level. Manually obtaining an estimate
of reading difficulty for each document is not feasible for
very large collections, so we require an automated technique.
Traditional readability measures, such as the
widely used Flesch-Kincaid measure, are simple to
apply but perform poorly on Web pages and other nontraditional
documents. This work focuses on building a
broadly applicable statistical model of text for different
reading levels that works for a wide range of documents.
To do this, we recast the well-studied problem of readability
in terms of text categorization and use straightforward
techniques from statistical language modeling.
We show that with a modified form of text categorization,
it is possible to build generally applicable classifiers
with relatively little training data. We apply this method
to the problem of classifying Web pages according to
their reading difficulty level and show that by using a
mixture model to interpolate evidence of a word’s frequency
across grades, it is possible to build a classifier
that achieves an average root mean squared error of
between one and two grade levels for 9 of 12 grades.
Such classifiers have very efficient implementations and
can be applied in many different scenarios. The models
can be varied to focus on smaller or larger grade ranges
or easily retrained for a variety of tasks or populations.
Journal title :
Journal of the American Society for Information Science and Technology
Journal title :
Journal of the American Society for Information Science and Technology