Reviewer, essay, and reviewing-process characteristics that predict errors in web-based peer review
Introduction
Peer review is defined as an educational activity where students assess the quality of work by other students of similar status. Students of similar status are often students enrolled in the same class or in the same program, who share the context but are not expert yet in the content to be peer reviewed. Although peer review sometimes is also referred to as peer assessment, we use peer review throughout this paper to focus on the reviewing activity and its characteristics. Peer review has been widely used in both K-12 and higher education across different disciplines, and now in web-based form (Li et al., 2016, Sanchez et al., 2017, Topping, 1998). It has been widely used for formative assessment purposes to guide student learning (Sanchez et al., 2017; Topping, 1998) and for summative purposes to give instructors and students summative information (Patchan, Schunn, & Clark, 2017; Suen, 2013). For example, Cho and Schunn (2007) showed how a web-based peer review system can be used by students to effectively revise papers. Peer review is popular for both the logistic reason of reducing instructors’ burden and its pedagogical benefits for disciplinary content learning (Sadler & Good, 2006), for promotion of cognitive and metacognitive skills (Topping, 1998), and for enhancing social relationships and establishing trust in a learning community (van Gennip, Segers, & Tillema, 2009, 2010).
A common observation that motivates the current work is that students and instructors are reluctant to rely on peer-provided feedback or grades for formative or summative purposes due to a concern about the accuracy and usefulness of peer feedback (e.g., Kaufman & Schunn, 2011). Indeed, there are a number of reasons to be concerned with using peer reviews even in formative feedback situations: 1) students tend not to revise when they receive very high ratings (Patchan et al., 2016); 2) harsh ratings can lead to negative self-evaluations which can then produce avoidant behaviors (Elizondo-Garcia et al., 2019); and 3) several online systems hold students accountable for the rating accuracy as a pressure to take the reviewing tasks seriously (Patchan et al., 2017).
To address the concern about peer review, a large portion of peer review research, especially at the higher education level, has been focused on reliability and validity of ratings (e.g., (Chang et al., 2011, Cho et al., 2006, Falchikov and Goldfinch, 2000, Hovardas et al., 2014, Li et al., 2016, Luo et al., 2014, Preston and Colman, 2000, Tsivitanidou et al., 2011)). Interestingly, reliability and validity of peer ratings were generally found to be at acceptable levels among those studies. A recent meta-analysis found a high average correlation between peer and instructor ratings of 0.63 (Li et al., 2016). An earlier study of 16 different higher education courses reported an inter-rater reliability among peer raters that was generally medium to high, ranging from 0.45 to 0.88 (Cho et al., 2006). However, a few other studies have found lower levels of peer reliability and validity (e.g., Chang et al., 2011; Tsivitanidou et al., 2011). The varied levels of reliability and validity may be related to how the peer review was carried out (Patchan et al., 2017, Schunn et al., 2016).
This concern about reliability and validity holds across the wide range of contexts in which peer review is applied, such as assignments/tasks/artifacts of different forms (e.g., oral presentations, written documents and reports, programming code, or design products) and with different subject disciplines. While some researchers hypothesized that peer reviews in science/engineering subject disciplines may have higher accuracy than those in social science/arts, meta-analyses showed no significant difference between science/engineering and social science/arts in terms of peer review validity measured by correlation between peer and expert ratings (Falchikov and Goldfinch, 2000, Li et al., 2016). The most prevailing factors associated with peer review accuracy were found to be on how the peer review activities were carried out, e.g., peer raters’ understanding about the rating rubrics, and whether peer reviewers and reviewees were matched at random (Li et al., 2016).
Most importantly here, even a high overall accuracy of peer review results (e.g., a correlation of 0.7 between peer and expert ratings) can mean that a non-trivial number of documents have received inaccurate grades, and indicators of which documents are likely to have been incorrectly graded are important to develop. With such information, new systems could be developed that automatically discount certain ratings, assign documents to additional reviewers, or flag reviews requiring further evaluation by instructors or teaching assistants.
To date, peer review reliability and validity issues have been well documented within what we term the macro-level lens. Peer review accuracy at the macro-level lens is defined as measurement of peer review accuracy at the level of assignment or higher, e.g., peer review accuracy in a course measured by correlation between peer ratings and instructor ratings. In those cases, one statistical number (e.g., Pearson's r) could represent accuracy in a course involving many peer reviewers/reviewees participating in a peer review activity. Studies addressing peer review accuracy at the macro level usually tackles two types of research questions: 1) what the overall reliability/validity is for the peer review implemented under certain contexts (e.g., descriptive studies: Chang et al., 2011); 2) what affects the overall reliability/validity across different peer assessment implementations (e.g., course content, course level, face-to-face vs. online reviewing, assignment type, rater training, rubric explicitly: Falchikov & Goldfinch, 2000). The second type of research are usually meta-analyses that synthesizing many studies under the first type of research (e.g., Sanchez et al., 2017). Despite wide variation in methods across those studies, the common thing is that peer review accuracy is considered as a property of the whole peer review activity, rather than analyzing variation in accuracy at the level of individual reviewer, reviewee, or review.
While research in a macro-level lens can help system designers and instructors arrange for higher overall validity of scores in the courses implementing peer review, some individual documents within a peer review task will inevitably still be incorrectly scored and instructors will seek support in addressing those mis-scored documents. It is, therefore, important to investigate what factors are associated with the accuracy of reviews of a specific document provided by a specific reviewer. This individual review level of accuracy, which we term the peer review accuracy at the micro-level, has been rarely studied. Peer review accuracy at the micro-level is defined as measurement of peer review accuracy at the individual review (or reviewer or document) level. A study involving peer review accuracy measured using a micro-level lens focuses predominantly on features that vary at the micro-level (e.g., the characteristics of the document being evaluated, the reviewer, or the review itself). A study at the macro-level might consider averages in those same features (e.g., the general characteristics of the pool of documents or reviewers), but macro-level will also consider features that can only vary at the larger grain size (e.g., features and general parameters of the assignment, level or discipline of the course, overall class/school climate).
Some recent studies that have reported peer review results with a micro-level lens mainly focused on the cognitive aspects of peer review process, e.g., examining what characteristics in peer feedback (e.g., directive/non-directive, global/local, and presence of solution or explanation in the feedback) were associated with student acceptance of the feedback, implementation of the feedback in their revisions, and quality of their revisions (e.g., Cho and MacArthur, 2010, Gao et al., 2019, Patchan and Schunn, 2016, Patchan et al., 2016, Saeed & Ghazali, 2017, Wu, Petit, & Chen, 2020), or focused on the qualitative differences between peer and expert comments (e.g., Wu, Petit, & Chen, 2015). Many interesting findings were revealed. Specifically, several characteristics of peer feedback were associated with improved higher-level revision, e.g., non-directive comments, global-level comments, presence of solution, explanation and hedges in the comments, and mitigating praise in the comments (Cho and MacArthur, 2010, Saeed & Ghazali, 2017, Wu, Petit, & Chen, 2020).
However, prior studies including micro-level measures mainly focused on cognitive or qualitative aspects of peer review comments, instead of investigating the quantitative discrepancy between peer and expert reviews or peer review errors. We only found one recent study that tackled peer review accuracy at the micro-level lens, examining occurrence of lenient and severe errors (a micro-level measure) (Liu et al., 2019). Interestingly, this study investigated how macro-level contextual variable, peer review requirement: compulsory vs. voluntary, affected micro-level lenient and severe errors in peer ratings.
By further investigating accuracy attached to individual peer reviews (particularly the process by which peers interact with documents during peer review), can we uncover the factors that are associated with those errors and eventually build interventions to efficiently address errors that do occur and further improve accuracy of all individual reviews. After all, the goal of formative assessment in education is to provide fair assessment to each individual student.
Section snippets
Review accuracy and errors at the micro-level
The aim of this study is to examine peer review errors, the opposite of peer review accuracy. In peer review contexts, accuracy normally refers to the agreement between peer reviews and expert reviews (AERA, APA, & NCME, 2014; Cho et al., 2006; Li et al., 2016). Therefore, review error, defined here as the discrepancy between peer reviews and expert reviews, is a threat to validity in which expert reviews are treated as the “gold standard” (AERA et al., 2014). By contrast, reliability-related
Participants
Participants included in the study were 818 students from ten different secondary schools. They were taking the same Advanced Placement course in writing (AP Language and Composition) taught by ten different teachers, each teaching between two and five different sections of this course at their school. This course has the largest annual enrollment among AP courses, with over 500,000 students enrolled in the course each year. Among the ten schools, four were Title I schools (high rates of
What characteristics predict web-based peer review errors?
Significant Predictors of Overall Errors. Table 4 presents the correlations among the predictors. Most correlations between the predictors were very small. There were only a couple of exceptions: review disagreement has a small negative correlation with essay quality; comment length has a small positive correlation with reviewer ability, a small negative correlation with speeded review, and a medium correlation with average sentence length. Variance inflation factor (VIF) was also checked to
Revisiting the framework of review errors
Most prior studies of overall reliability and validity of peer reviews have found acceptable levels of both reliability and validity in general. The present study went beyond this general level by separately examining the two types of review errors (i.e., leniency and severity) and possible factors that related to these two types of review errors so that more can be done to address errors that inevitably come for some students, even when overall reliability and validity is acceptable. Five
Author credit statement
Yao Xiong: Conceptualization, Data curation, Formal analysis, Visualization, original drafting, reviewing and editing; Christian D. Schunn: data acquisition, Conceptualization, Formal analysis, Visualization, original drafting, reviewing and editing.
Acknowledgement
The current project was partially funded by grant R305A120370 from the Institute of Education Sciences.
References (55)
- et al.
The impact of peer solution quality on peer-feedback provision on geometry proofs: Evidence from eye-movement analysis
Learning and Instruction
(2018) - et al.
Reliability and validity of web-based portfolio peer assessment: A case study for a senior high school's students taking computer course
Computers & Education
(2011) - et al.
Student revision with peer and expert reviewing
Learning and Instruction
(2010) - et al.
Scaffolded writing and rewriting in the discipline: A web-based reciprocal peer review system
Computers & Education
(2007) - et al.
Peer assessment for learning from a social perspective: The influence of interpersonal variables and structural features
Educational Research Review
(2009) - et al.
Peer assessment as a collaborative learning activity: The role of interpersonal variables and conceptions
Learning and Instruction
(2010) - et al.
Peer versus expert feedback: An investigation of the quality of peer feedback among secondary school students
Computers & Education
(2014) - et al.
Optimal number of response categories in rating scales: Reliability, validity, discriminating power, and respondent preferences
Acta Psychologica
(2000) - et al.
Automatic detection of inconsistencies between numerical scores and textual feedback in peer-assessment processes with machine learning
Computers & Education
(2019) - et al.
Psychology and neurobiology of simple decisions
Trends in Neurosciences
(2004)