10/29/2009
Your Personal Assignment Help for Your Essay
The literature relevant to automated scoring of essays arises from three disciplines: rhetoric and composition, computational linguistics, and educational measurement. Each of these disciplines takes a different perspective on the rise of automated scoring and how the future of education may be impacted by it.
As was illustrated in the preceding sections of this chapter, these three perspectives are almost always at odds with each other. Computational linguists see automated scoring of essays as a welcome, long-awaited evolutionary step.
Academicians in rhetoric and composition, evidence accumulated in this study indicates, see any such use of the computer as a threat, both to their profession and to education at large. Those in the educational measurement profession see legitimacy in both views and must defend against both sides.
As a rule, the educational measurement community as a whole gains from most any implementation of automated scoring of assessments, as such advances often move the construct- representation enterprise a quantum leap forward.
Moreover, a number of large organizations in the community have a vested interest in seeing such implementations succeed.
The media attention that computer-based tests have attracted in recent years has made this all too apparent to the public. Educational measurement must, therefore, fend off the inevitable attacks by purists in the rhetoric and composition camp.
However, also as a rule, the educational measurement community takes validity quite seriously.
As members of the community focus increasing attention on the validity of interpretations and uses of computer-generated essay scores, some computational linguists may come to view those in educational measurement as self-designated "hall monitors" whose aim is to slow down the remarkably rapid evolution of a technology that the linguists created and have painstakingly nurtured.
The intractability consequently shown by both groups seems destined to continue.
And while a resolution of the professional bitterness between the developers of this technology and its critics is not yet in sight, the challenge for educational measurement is clear: given the stated interpretations and uses, evaluate the validity of essay scores generated by computer. The balance of this study is designed to address that challenge.
The Automated Essay Scoring Systems for Writers
As mentioned earlier, this study was designed to evaluate, within the constraints of the methods presented, the validity of computer-generated essay scores as substitutes for scores assigned by raters, given the score interpretation and use specified by users of such scores.
The principal objective of the study was to establish whether evidence supporting a parallelism of automated c scoring system computational processes with rater cognitive processes exists.
However, while a claim of construct equivalence of the two scoring processes hinges on this evidence, there are five other sources of evidence relevant to the validity claim overall. Ultimately, supportive evidence is required jointly from all six of these sources to substantiate a validity claim for the specified interpretations and uses of the scores.
First, an overview of the e-rater automated essay scoring system, the GRE Writing Assessment, and the study sample is provided. Then, the procedures followed are presented in six phases, paralleling the six aspects of Messick's (1995) unified construct validity concept.
Briefly, content relevance and representativeness of the e-rater models were gauged by the comprehensiveness with which the factor structure identified for each e-rater model appeared to represent the constructs of writing measured by the GRE Writing Assessment.
The factors guided the construction of factor-specific scoring rubrics and corresponding factor-specific e-rater submodels. Reflectivity of the task and domain structures was appraised from expert reviews of the rubrics and submodels. Of particular importance were experts' judgments of the likelihood that the rubrics would prompt rater engagement in desired cognitive processes, not merely in counting, and the adequacy with which the e-rater submodels reflected the factors they subsumed.
The degree of rater engagement in substantive theories and process models was evidenced from the contents of "think- aloud" protocols transcribed from verbalized mock essay scoring sessions and from the strength of correlations of factor-specific with holistic scores of essays scored by raters and by e-rater.
The degree of e-rater score convergent and discriminant correlations with external variables was evidenced from the magnitudes of correlations of e-rater scores generated across tasks within the GRE Writing Assessment program and of e-rater scores generated for essays written for a different essay test program.
Similarly, evidence of the generalizability and boundaries of score meaning was manifested from the magnitudes of correlations of generic-model with prompt-specific model e- rater scores and with rater-assigned scores across all six prompts used in the study, as well as from the consistency e-rater scores exhibited when a key distributional assumption underlying the scoring models was changed.
Finally, consequences as validity evidence were elucidated by a stratified random survey of graduate program admissions decision-makers that (a) identified actual and potential interpretations and uses of partially computergenerated essay scores.
The principal objective of the study was to establish whether evidence supporting a parallelism of automated c scoring system computational processes with rater cognitive processes exists.
However, while a claim of construct equivalence of the two scoring processes hinges on this evidence, there are five other sources of evidence relevant to the validity claim overall. Ultimately, supportive evidence is required jointly from all six of these sources to substantiate a validity claim for the specified interpretations and uses of the scores.
First, an overview of the e-rater automated essay scoring system, the GRE Writing Assessment, and the study sample is provided. Then, the procedures followed are presented in six phases, paralleling the six aspects of Messick's (1995) unified construct validity concept.
Briefly, content relevance and representativeness of the e-rater models were gauged by the comprehensiveness with which the factor structure identified for each e-rater model appeared to represent the constructs of writing measured by the GRE Writing Assessment.
The factors guided the construction of factor-specific scoring rubrics and corresponding factor-specific e-rater submodels. Reflectivity of the task and domain structures was appraised from expert reviews of the rubrics and submodels. Of particular importance were experts' judgments of the likelihood that the rubrics would prompt rater engagement in desired cognitive processes, not merely in counting, and the adequacy with which the e-rater submodels reflected the factors they subsumed.
The degree of rater engagement in substantive theories and process models was evidenced from the contents of "think- aloud" protocols transcribed from verbalized mock essay scoring sessions and from the strength of correlations of factor-specific with holistic scores of essays scored by raters and by e-rater.
The degree of e-rater score convergent and discriminant correlations with external variables was evidenced from the magnitudes of correlations of e-rater scores generated across tasks within the GRE Writing Assessment program and of e-rater scores generated for essays written for a different essay test program.
Similarly, evidence of the generalizability and boundaries of score meaning was manifested from the magnitudes of correlations of generic-model with prompt-specific model e- rater scores and with rater-assigned scores across all six prompts used in the study, as well as from the consistency e-rater scores exhibited when a key distributional assumption underlying the scoring models was changed.
Finally, consequences as validity evidence were elucidated by a stratified random survey of graduate program admissions decision-makers that (a) identified actual and potential interpretations and uses of partially computergenerated essay scores.
Respective Graduate Programs For Essay Help
The Implementation
Once the samples were selected and the respective graduate programs identified, each program was contacted by telephone and an attempt was made to reach a faculty member having admissions decision-making capacity within the program. In many cases, multiple telephone calls were required in order to locate and speak directly with an appropriate program admissions decision-maker.Each participant in the survey was read an introduction to the study and an overview of the purpose for the telephone survey. An oral informed consent statement was also read, and each participant was asked to agree to this consent statement. Lastly, each participant was asked permission to record the balance of the telephone call on audio tape for transcription later.
The participants were then asked the set of four questions pertaining to their perceptions of using essay scores generated partially by computer.
Participants from arts and sciences and social sciences programs were asked to speculate on how such scores would be interpreted and used if available, while the business program participants were asked to respond concretely, if possible, to how such scores are interpreted and used by their programs.
- Specifically, each of these decision-makers was asked to respond to the following questions:
Has your receipt of essays with scores that are based on both human scoring and computer scoring affected, as a matter of official policy at your institution, the way the essay scores are interpreted and used, such as the relative weight that is placed on the scores in the admissions process? - Might this, in your opinion, affect unofficially the interpretation or use of the scores in the admissions process?
- How comfortable do you feel personally with interpreting and using these scores? Is it any different from using totally human-generated scores?
- In light of how you use essay scores in admissions decisions, do you believe that your using an average of one human-generated score and one computergenerated score, instead of an average of two ratergenerated scores, might potentially create an unfairness to an applicant?
Methods for Professional Dissertation Proofreading and Editing
The researcher reasoned in advance that the fairly commonplace term, "characteristic of writing," might be more familiar to the raters than the more technical term, "factor," and so the former was used in all discussions.
During this initial group interview as well as throughout the day, each rater was encouraged to express any difficulties she encountered due to the phrasing of a rubric, the contrast of a factor-specific focus with the holistic scoring paradigm to which she was accustomed, or any other circumstance.
To examine in greater detail whether the raters were following processes that focused implicitly on the features analyzed by e-rater, the researcher asked each rater to think out loud while scoring essays.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
essays, applying each rubric in turn.
Intermittently, she was stopped by the researcher and asked to report retrospectively on which cognitive processes she believed she had engaged as she performed the immediately preceding scoring task. The ETS expert in think-aloud procedures examined the set of rater instructions and the setting where the think-aloud sessions would be conducted. He also observed portions of the sessions on the first day.
Next, each rater participated in what Ericsson and Simon (1993) labeled a "social verbalization" exercise, in which the rater scores essays interactively with the researcher and, in a second step, with another rater as well. Social verbalization differs from think-aloud with retrospective reporting in that social verbalization permits the participant to offer more complete explanations of thoughts as they occur, rather than mere utterances of singular, perhaps disjointed thoughts.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Each mock scoring session concluded with a debriefing session, preceded by a short normalization period of silent, independent scoring intended to get each rater "into the groove" prior to the debriefing.
In the debriefing session, the raters were asked to revisit their first impressions of the rubrics, recount their experiences with using each rubric that day, and render a summative, "last impression" of each rubric.
During this initial group interview as well as throughout the day, each rater was encouraged to express any difficulties she encountered due to the phrasing of a rubric, the contrast of a factor-specific focus with the holistic scoring paradigm to which she was accustomed, or any other circumstance.
To examine in greater detail whether the raters were following processes that focused implicitly on the features analyzed by e-rater, the researcher asked each rater to think out loud while scoring essays.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
essays, applying each rubric in turn.
Intermittently, she was stopped by the researcher and asked to report retrospectively on which cognitive processes she believed she had engaged as she performed the immediately preceding scoring task. The ETS expert in think-aloud procedures examined the set of rater instructions and the setting where the think-aloud sessions would be conducted. He also observed portions of the sessions on the first day.
Next, each rater participated in what Ericsson and Simon (1993) labeled a "social verbalization" exercise, in which the rater scores essays interactively with the researcher and, in a second step, with another rater as well. Social verbalization differs from think-aloud with retrospective reporting in that social verbalization permits the participant to offer more complete explanations of thoughts as they occur, rather than mere utterances of singular, perhaps disjointed thoughts.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Each mock scoring session concluded with a debriefing session, preceded by a short normalization period of silent, independent scoring intended to get each rater "into the groove" prior to the debriefing.
In the debriefing session, the raters were asked to revisit their first impressions of the rubrics, recount their experiences with using each rubric that day, and render a summative, "last impression" of each rubric.
Subscribe to:
Posts (Atom)