Abstract :
This paper describes a study of the computer essay-scoring program BETSY. While the use ofcomputers in rating written scripts has been criticised in some quarters for lacking transparency orlack of fit with how human raters rate written scripts, a number of essay rating programs areavailable commercially, many of which claim to offer comparable reliability with human raters.Much of the validation of such programs has focused on native-speaking tertiary-level studentswriting in subject content areas. Instead of content areas with native-speakers, the data for thisstudy is drawn from a representative sample of scripts from an English as a second language (ESL)Year 11 public examination in Hong Kong. The scripts (900 in total) are taken from a writing testconsisting of three topics (300 scripts per topic), each representing a different genre. Results in thestudy show good correlations between human raters’ scores and the program BETSY. A raterdiscrepancy rate, where scripts need to be re-marked because of disagreement between two raters,emerged at levels broadly comparable with those derived from discrepancies between pairedhuman raters. Little difference was apparent in the ratings of test takers on the three genres. Thepaper concludes that while computer essay-scoring programs may appear to rate inside a ‘blackbox’ with concomitant lack of transparency, they do have potential to act as a third rater, timesavingassessment tool. And as technology develops and rating becomes more transparent, so willtheir acceptability
Keywords :
assessment , computer scoring , Writing , English language