New software described in this New York Times story allows teachers to leave essay grading to the computer. It was developed by EdX, the nonprofit organization that was founded jointly by Harvard University and the Massachusetts Institute of Technology and that will give the software to other schools for free. The story says that the software “uses artificial intelligence to grade student essays and short written answers.”
Multiple-choice exams have, of course, been graded by machine for a long time, but essays are another matter. Can a computer program accurately capture the depth, beauty, structure, relevance, creativity etc., of an essay? The National Council of Teachers of English says, unequivocally, “no” in its newest position statement. Here it is:
Machine Scoring Fails the Test
[A] computer could not measure accuracy, reasoning, adequacy of evidence, good sense, ethical stance, convincing argument, meaningful organization, clarity, and veracity in your essay. If this is true I don’t believe a computer would be able to measure my full capabilities and grade me fairly. — Akash, student
[H]ow can the feedback a computer gives match the carefully considered comments a teacher leaves in the margins or at the end of your paper? — Pinar, student
(Responses to New York Times The Learning Network blog post, “How Would You Feel about a Computer Grading Your Essays?”, 5 April 2013)
Writing is a highly complex ability developed over years of practice, across a wide range of tasks and contexts, and with copious, meaningful feedback. Students must have this kind of sustained experience to meet the demands of higher education, the needs of a 21st-century workforce, the challenges of civic participation, and the realization of full, meaningful lives.
As the Common Core State Standards (CCSS) sweep into individual classrooms, they bring with them a renewed sense of the importance of writing to students’ education. Writing teachers have found many aspects of the CCSS to applaud; however, we must be diligent in developing assessment systems that do not threaten the possibilities for the rich, multifaceted approach to writing instruction advocated in the CCSS. Effective writing assessments need to account for the nature of writing, the ways students develop writing ability, and the role of the teacher in fostering that development.
Research1 on the assessment of student writing consistently shows that high-stakes writing tests alter the normal conditions of writing by denying students the opportunity to think, read, talk with others, address real audiences, develop ideas, and revise their emerging texts over time. Often, the results of such tests can affect the livelihoods of teachers, the fate of schools, or the educational opportunities for students.
In such conditions, the narrowly conceived, artificial form of the tests begins to subvert attention to other purposes and varieties of writing development in the classroom. Eventually, the tests erode the foundations of excellence in writing instruction, resulting in students who are less prepared to meet the demands of their continued education and future occupations. Especially in the transition from high school to college, students are ill served when their writing experience has been dictated by tests that ignore the ever-more complex and varied types and uses of writing found in higher education.
Note: (1) All references to research are supported by the extensive work documented in the annotated bibliography attached to this report.
These concerns — increasingly voiced by parents, teachers, school administrators, students, and members of the general public — are intensified by the use of machine-scoring systems to read and evaluate students’ writing. To meet the outcomes of the Common Core State Standards, various consortia, private corporations, and testing agencies propose to use computerized assessments of student writing. The attraction is obvious: once programmed, machines might reduce the costs otherwise associated with the human labor of reading, interpreting, and evaluating the writing of our students. Yet when we consider what is lost because of machine scoring, the presumed savings turn into significant new costs — to students, to our educational institutions, and to society. Here’s why:
- Computers are unable to recognize or judge those elements that we most associate with good writing (logic, clarity, accuracy, ideas relevant to a specific topic, innovative style, effective appeals to audience, different forms of organization, types of persuasion, quality of evidence, humor or irony, and effective uses of repetition, to name just a few). Using computers to “read” and evaluate students’ writing (1) denies students the chance to have anything but limited features recognized in their writing; and (2) compels teachers to ignore what is most important in writing instruction in order to teach what is least important.
- Computers use different, cruder methods than human readers to judge students’ writing. For example, some systems gauge the sophistication of vocabulary by measuring the average length of words and how often the words are used in a corpus of texts; or they gauge the development of ideas by counting the length and number of sentences per paragraph.
- Computers are programmed to score papers written to very specific prompts, reducing the incentive for teachers to develop innovative and creative occasions for writing, even for assessment.
- Computers get progressively worse at scoring as the length of the writing increases, compelling test makers to design shorter writing tasks that don’t represent the range and variety of writing assignments needed to prepare students for the more complex writing they will encounter in college.
- Computer scoring favors the most objective, “surface” features of writing (grammar, spelling, punctuation), but problems in these areas are often created by the testing conditions and are the most easily rectified in normal writing conditions when there is time to revise and edit. Privileging surface features disproportionately penalizes nonnative speakers of English who may be on a developmental path that machine scoring fails to recognize.
- Conclusions that computers can score as well as humans are the result of humans being trained to score like the computers (for example, being told not to make judgments on the accuracy of information).
- Computer scoring systems can be “gamed” because they are poor at working with human language, further weakening the validity of their assessments and separating students not on the basis of writing ability but on whether they know and can use machine-tricking strategies.
- Computer scoring discriminates against students who are less familiar with using technology to write or complete tests. Further, machine scoring disadvantages school districts that lack funds to provide technology tools for every student and skews technology acquisition toward devices needed to meet testing requirements.
- Computer scoring removes the purpose from written communication — to create human interactions through a complex, socially consequential system of meaning making — and sends a message to students that writing is not worth their time because reading it is not worth the time of the people teaching and assessing them.
What Are the Alternatives?
Together with other professional organizations, the National Council of Teachers of English has established research-based guidelines for effective teaching and assessment of writing, such as the Standards for the Assessment of Reading and Writing (rev. ed., 2009), the Framework for Success in Postsecondary Writing (2011), the NCTE Beliefs about the Teaching of Writing (2004), and the Framework for 21st Century Curriculum and Assessment (2008, 2013). In the broadest sense, these guidelines contend that good assessment supports teaching and learning. Specifically, high-quality assessment practices will
- encourage students to become engaged in literacy learning, to reflect on their own reading and writing in productive ways, and to set respective literacy goals;
- yield high-quality, useful information to inform teachers about curriculum, instruction, and the assessment process itself;
- balance the need to assess summatively (make final judgments about the quality of student work) with the need to assess formatively (engage in ongoing, in-process judgments about what students know and can do, and what to teach next);
- recognize the complexity of literacy in today’s society and reflect that richness through holistic, authentic, and varied writing instruction;
- at their core, involve professionals who are experienced in teaching writing, knowledgeable about students’ literacy development, and familiar with current research in literacy education.
A number of effective practices enact these research-based principles, including portfolio assessment; teacher assessment teams; balanced assessment plans that involve more localized (classroom- and district-based) assessments designed and administered by classroom teachers; and “audit” teams of teachers, teacher educators, and writing specialists who visit districts to review samples of student work and the curriculum that has yielded them. We focus briefly here on portfolios because of the extensive scholarship that supports them and the positive experience that many educators, schools, and school districts have had with them.
Engaging teams of teachers in evaluating portfolios at the building, district, or state level has the potential to honor the challenging expectations of the CCSS while also reflecting what we know about effective assessment practices. Portfolios offer the opportunity to
- look at student writing across multiple events, capturing growth over time while avoiding the limitations of “one test on one day”;
- look at the range of writing across a group of students while preserving the individual character of each student’s writing;
- review student writing through multiple lenses, including content accuracy and use of resources;
- assess student writing in the context of local values and goals as well as national standards.
Just as portfolios provide multiple types of data for assessment, they also allow students to learn as a result of engaging in the assessment process, something seldom associated with more traditional one-time assessments. Students gain insight about their own writing, about ways to identify and describe its growth, and about how others — human readers — interpret their work. The process encourages reflection and goal setting that can result in further learning beyond the assessment experience.
Similarly, teachers grow as a result of administering and scoring the portfolio assessments, something seldom associated with more traditional one-time assessments. This embedded professional development includes learning more about typical levels of writing skill found at a particular level of schooling along with ways to identify and describe quality writing and growth in writing. The discussions about collections of writing samples and criteria for assessing the writing contribute to a shared investment among all participating teachers in the writing growth of all students.
Further, when the portfolios include a wide range of artifacts from learning and writing experiences, teachers assessing the portfolios learn new ideas for classroom instruction as well as ways to design more sophisticated methods of assessing student work on a daily basis.
Several states such as Kentucky, Nebraska, Vermont, and California have experimented with the development of large-scale portfolio assessment projects that make use of teams of teachers working collaboratively to assess samples of student work. Rather than investing heavily in assessment plans that cannot meet the goals of the CCSS, various legislative groups, private companies, and educational institutions could direct those funds into refining these nascent portfolio assessment systems. This investment would also support teacher professional development and enhance the quality of instruction in classrooms — something that machine-scored writing prompts cannot offer.
In 2010, the federal government awarded $330 million to two consortia of states “to provide ongoing feedback to teachers during the course of the school year, measure annual school growth, and move beyond narrowly focused bubble tests” (United States Department of Education).
Further, these assessments will need to align to the new standards for learning in English and mathematics. This has proven to be a formidable task, but it is achievable. By combining the already existing National Assessment of Educational Progress (NAEP) assessment structures for evaluating school system performance with ongoing portfolio assessment of student learning by educators, we can cost-effectively assess writing without relying on flawed machine-scoring methods. By doing so, we can simultaneously deepen student and educator learning while promoting grass-roots innovation at the classroom level. For a fraction of the cost in time and money of building a new generation of machine assessments, we can invest in rigorous assessment and teaching processes that enrich, rather than interrupt, high-quality instruction. Our students and their families deserve it, the research base supports it, and literacy educators and administrators will welcome it.
The current trend toward machine-scoring of student work, Ericsson and Haswell argue, has created an emerging issue with implications for higher education across the disciplines, but with particular importance for those in English departments and in administration. The academic community has been silent on the issue-some would say excluded from it-while the commercial entities who develop essay-scoring software have been very active.
Machine Scoring of Student Essaysis the first volume to seriously consider the educational mechanisms and consequences of this trend, and it offers important discussions from some of the leading scholars in writing assessment.
Reading and evaluating student writing is a time-consuming process, yet it is a vital part of both student placement and coursework at post-secondary institutions. In recent years, commercial computer-evaluation programs have been developed to score student essays in both of these contexts. Two-year colleges have been especially drawn to these programs, but four-year institutions are moving to them as well, because of the cost-savings they promise. Unfortunately, to a large extent, the programs have been written, and institutions are installing them, without attention to their instructional validity or adequacy.
Since the education software companies are moving so rapidly into what they perceive as a promising new market, a wider discussion of machine-scoring is vital if scholars hope to influence development and/or implementation of the programs being created. What is needed, then, is a critical resource to help teachers and administrators evaluate programs they might be considering, and to more fully envision the instructional consequences of adopting them. And this is the resource that Ericsson and Haswell are providing here.