Reviewer, essay, and reviewing-process characteristics that predict errors in web-based peer review

https://doi.org/10.1016/j.compedu.2021.104146Get rights and content

Highlights

  • Reviewer, essay, and reviewing process characteristics that predict errors in web-based peer review.

  • Review comment length predicted both severe and lenient errors but in opposite directions.

  • Review disagreement predicted increases in severe errors.

  • Reviewer ability predicted decrease in severe errors.

Abstract

Accuracy of peer review continues to be a concern for instructors in implementing computer-supported peer review in their instructional practices. A large body of literature has descriptively documented overall levels of reliability and validity of peer review and which factors across different peer review implementations impact overall reliability and validity of peer review (e.g., use of rubrics, education level, training). However, few studies have examined what factors within a peer review implementation contribute to review accuracy of individual reviews and knowledge about these factors could shape new interventions to avoid or remediate errors in particular reviews. In the current study, we tested a three-level framework (reviewer, essay, and reviewing process) for predicting the location of peer review errors. Further, we examined what factors within each level are predictive of two different types of review errors: severity and leniency. Leveraging a large dataset from an Advanced Placement English and Composite course implementing a common assignment with web-based peer review across 10 high schools, we found support for all levels in the framework and the importance of separating severity and leniency errors: review comment length predicted both severe and lenient errors but in opposite directions: longer comments are more likely to be associated with severe errors and less likely to be associated with lenient errors; review disagreement, reviewer ability and average sentence length of comments predicted severe errors; and essay quality predicted lenient errors. Implications for the development of new web-based tools for supporting peer-review are discussed.

Introduction

Peer review is defined as an educational activity where students assess the quality of work by other students of similar status. Students of similar status are often students enrolled in the same class or in the same program, who share the context but are not expert yet in the content to be peer reviewed. Although peer review sometimes is also referred to as peer assessment, we use peer review throughout this paper to focus on the reviewing activity and its characteristics. Peer review has been widely used in both K-12 and higher education across different disciplines, and now in web-based form (Li et al., 2016, Sanchez et al., 2017, Topping, 1998). It has been widely used for formative assessment purposes to guide student learning (Sanchez et al., 2017; Topping, 1998) and for summative purposes to give instructors and students summative information (Patchan, Schunn, & Clark, 2017; Suen, 2013). For example, Cho and Schunn (2007) showed how a web-based peer review system can be used by students to effectively revise papers. Peer review is popular for both the logistic reason of reducing instructors’ burden and its pedagogical benefits for disciplinary content learning (Sadler & Good, 2006), for promotion of cognitive and metacognitive skills (Topping, 1998), and for enhancing social relationships and establishing trust in a learning community (van Gennip, Segers, & Tillema, 2009, 2010).

A common observation that motivates the current work is that students and instructors are reluctant to rely on peer-provided feedback or grades for formative or summative purposes due to a concern about the accuracy and usefulness of peer feedback (e.g., Kaufman & Schunn, 2011). Indeed, there are a number of reasons to be concerned with using peer reviews even in formative feedback situations: 1) students tend not to revise when they receive very high ratings (Patchan et al., 2016); 2) harsh ratings can lead to negative self-evaluations which can then produce avoidant behaviors (Elizondo-Garcia et al., 2019); and 3) several online systems hold students accountable for the rating accuracy as a pressure to take the reviewing tasks seriously (Patchan et al., 2017).

To address the concern about peer review, a large portion of peer review research, especially at the higher education level, has been focused on reliability and validity of ratings (e.g., (Chang et al., 2011, Cho et al., 2006, Falchikov and Goldfinch, 2000, Hovardas et al., 2014, Li et al., 2016, Luo et al., 2014, Preston and Colman, 2000, Tsivitanidou et al., 2011)). Interestingly, reliability and validity of peer ratings were generally found to be at acceptable levels among those studies. A recent meta-analysis found a high average correlation between peer and instructor ratings of 0.63 (Li et al., 2016). An earlier study of 16 different higher education courses reported an inter-rater reliability among peer raters that was generally medium to high, ranging from 0.45 to 0.88 (Cho et al., 2006). However, a few other studies have found lower levels of peer reliability and validity (e.g., Chang et al., 2011; Tsivitanidou et al., 2011). The varied levels of reliability and validity may be related to how the peer review was carried out (Patchan et al., 2017, Schunn et al., 2016).

This concern about reliability and validity holds across the wide range of contexts in which peer review is applied, such as assignments/tasks/artifacts of different forms (e.g., oral presentations, written documents and reports, programming code, or design products) and with different subject disciplines. While some researchers hypothesized that peer reviews in science/engineering subject disciplines may have higher accuracy than those in social science/arts, meta-analyses showed no significant difference between science/engineering and social science/arts in terms of peer review validity measured by correlation between peer and expert ratings (Falchikov and Goldfinch, 2000, Li et al., 2016). The most prevailing factors associated with peer review accuracy were found to be on how the peer review activities were carried out, e.g., peer raters’ understanding about the rating rubrics, and whether peer reviewers and reviewees were matched at random (Li et al., 2016).

Most importantly here, even a high overall accuracy of peer review results (e.g., a correlation of 0.7 between peer and expert ratings) can mean that a non-trivial number of documents have received inaccurate grades, and indicators of which documents are likely to have been incorrectly graded are important to develop. With such information, new systems could be developed that automatically discount certain ratings, assign documents to additional reviewers, or flag reviews requiring further evaluation by instructors or teaching assistants.

To date, peer review reliability and validity issues have been well documented within what we term the macro-level lens. Peer review accuracy at the macro-level lens is defined as measurement of peer review accuracy at the level of assignment or higher, e.g., peer review accuracy in a course measured by correlation between peer ratings and instructor ratings. In those cases, one statistical number (e.g., Pearson's r) could represent accuracy in a course involving many peer reviewers/reviewees participating in a peer review activity. Studies addressing peer review accuracy at the macro level usually tackles two types of research questions: 1) what the overall reliability/validity is for the peer review implemented under certain contexts (e.g., descriptive studies: Chang et al., 2011); 2) what affects the overall reliability/validity across different peer assessment implementations (e.g., course content, course level, face-to-face vs. online reviewing, assignment type, rater training, rubric explicitly: Falchikov & Goldfinch, 2000). The second type of research are usually meta-analyses that synthesizing many studies under the first type of research (e.g., Sanchez et al., 2017). Despite wide variation in methods across those studies, the common thing is that peer review accuracy is considered as a property of the whole peer review activity, rather than analyzing variation in accuracy at the level of individual reviewer, reviewee, or review.

While research in a macro-level lens can help system designers and instructors arrange for higher overall validity of scores in the courses implementing peer review, some individual documents within a peer review task will inevitably still be incorrectly scored and instructors will seek support in addressing those mis-scored documents. It is, therefore, important to investigate what factors are associated with the accuracy of reviews of a specific document provided by a specific reviewer. This individual review level of accuracy, which we term the peer review accuracy at the micro-level, has been rarely studied. Peer review accuracy at the micro-level is defined as measurement of peer review accuracy at the individual review (or reviewer or document) level. A study involving peer review accuracy measured using a micro-level lens focuses predominantly on features that vary at the micro-level (e.g., the characteristics of the document being evaluated, the reviewer, or the review itself). A study at the macro-level might consider averages in those same features (e.g., the general characteristics of the pool of documents or reviewers), but macro-level will also consider features that can only vary at the larger grain size (e.g., features and general parameters of the assignment, level or discipline of the course, overall class/school climate).

Some recent studies that have reported peer review results with a micro-level lens mainly focused on the cognitive aspects of peer review process, e.g., examining what characteristics in peer feedback (e.g., directive/non-directive, global/local, and presence of solution or explanation in the feedback) were associated with student acceptance of the feedback, implementation of the feedback in their revisions, and quality of their revisions (e.g., Cho and MacArthur, 2010, Gao et al., 2019, Patchan and Schunn, 2016, Patchan et al., 2016, Saeed & Ghazali, 2017, Wu, Petit, & Chen, 2020), or focused on the qualitative differences between peer and expert comments (e.g., Wu, Petit, & Chen, 2015). Many interesting findings were revealed. Specifically, several characteristics of peer feedback were associated with improved higher-level revision, e.g., non-directive comments, global-level comments, presence of solution, explanation and hedges in the comments, and mitigating praise in the comments (Cho and MacArthur, 2010, Saeed & Ghazali, 2017, Wu, Petit, & Chen, 2020).

However, prior studies including micro-level measures mainly focused on cognitive or qualitative aspects of peer review comments, instead of investigating the quantitative discrepancy between peer and expert reviews or peer review errors. We only found one recent study that tackled peer review accuracy at the micro-level lens, examining occurrence of lenient and severe errors (a micro-level measure) (Liu et al., 2019). Interestingly, this study investigated how macro-level contextual variable, peer review requirement: compulsory vs. voluntary, affected micro-level lenient and severe errors in peer ratings.

By further investigating accuracy attached to individual peer reviews (particularly the process by which peers interact with documents during peer review), can we uncover the factors that are associated with those errors and eventually build interventions to efficiently address errors that do occur and further improve accuracy of all individual reviews. After all, the goal of formative assessment in education is to provide fair assessment to each individual student.

Section snippets

Review accuracy and errors at the micro-level

The aim of this study is to examine peer review errors, the opposite of peer review accuracy. In peer review contexts, accuracy normally refers to the agreement between peer reviews and expert reviews (AERA, APA, & NCME, 2014; Cho et al., 2006; Li et al., 2016). Therefore, review error, defined here as the discrepancy between peer reviews and expert reviews, is a threat to validity in which expert reviews are treated as the “gold standard” (AERA et al., 2014). By contrast, reliability-related

Participants

Participants included in the study were 818 students from ten different secondary schools. They were taking the same Advanced Placement course in writing (AP Language and Composition) taught by ten different teachers, each teaching between two and five different sections of this course at their school. This course has the largest annual enrollment among AP courses, with over 500,000 students enrolled in the course each year. Among the ten schools, four were Title I schools (high rates of

What characteristics predict web-based peer review errors?

Significant Predictors of Overall Errors. Table 4 presents the correlations among the predictors. Most correlations between the predictors were very small. There were only a couple of exceptions: review disagreement has a small negative correlation with essay quality; comment length has a small positive correlation with reviewer ability, a small negative correlation with speeded review, and a medium correlation with average sentence length. Variance inflation factor (VIF) was also checked to

Revisiting the framework of review errors

Most prior studies of overall reliability and validity of peer reviews have found acceptable levels of both reliability and validity in general. The present study went beyond this general level by separately examining the two types of review errors (i.e., leniency and severity) and possible factors that related to these two types of review errors so that more can be done to address errors that inevitably come for some students, even when overall reliability and validity is acceptable. Five

Author credit statement

Yao Xiong: Conceptualization, Data curation, Formal analysis, Visualization, original drafting, reviewing and editing; Christian D. Schunn: data acquisition, Conceptualization, Formal analysis, Visualization, original drafting, reviewing and editing.

Acknowledgement

The current project was partially funded by grant R305A120370 from the Institute of Education Sciences.

References (55)

  • J.-W. Strijbos et al.

    Peer feedback content and sender's competence level in academic writing revision tasks: Are they critical for feedback perceptions and efficiency?

    Learning and Instruction

    (2010)
  • O.E. Tsivitanidou et al.

    Investigating secondary school students' unmediated peer assessment skills

    Learning and Instruction

    (2011)
  • AERA et al.

    Standards for educational and psychological testing

    (2014)
  • L. de Alfaro et al.

    Dynamics of peer grading: An empirical study

  • D. Bates et al.

    Fitting linear mixed-effects models using “lme4

    Journal of Statistical Software

    (2015)
  • C.B. Becg et al.

    Calculation of polychotomous logistic regression parameters using individualized regressions

    Biometrika

    (1984)
  • S.F. Beers et al.

    Syntactic complexity as a predictor of adolescent writing quality: Which measures? Which genre?

    Reading and Writing

    (2009)
  • K. Cho et al.

    Validity and reliability of scaffolded peer assessment of writing from instructor and student perspectives

    Journal of Educational Psychology

    (2006)
  • J. Cohen

    A power primer

    Psychological Bulletin

    (1992)
  • J. Elizondo-Garcia et al.

    Quality of peer feedback in relation to instructional design: A comparative study in energy and sustainability MOOCs

    International Journal of Instruction

    (2019)
  • N. Falchikov et al.

    Student peer assessment in higher education: A meta-analysis comparing peer and teacher marks

    Review of Educational Research

    (2000)
  • Y. Gao et al.

    The alignment of written peer feedback with draft problems and its impact on revision in peer assessment

    Assessment & Evaluation in Higher Education

    (2019)
  • S. Huang

    When peers are not peers and don't know it: The Dunning-Kruger effect and self-fulfilling prophecy in peer-review

    BioEssays

    (2013)
  • B. Huisman et al.

    Peer assessment in MOOCs: The relationship between peer reviewers' ability and authors' essay performance

    British Journal of Educational Technology

    (2018)
  • J.H. Kaufman et al.

    Students’ perceptions about peer assessment for writing: Their origin and impact on revision work

    Instructional Science

    (2011)
  • Institute of Education Sciences

    National assessment of Title I final report: Summary of key findings

    (2007)
  • J. Kruger et al.

    Unskilled and unaware of it: How difficulties in recognizing one's own incompetence lead to inflated self-assessments

    Journal of Personality and Social Psychology

    (1999)
  • Cited by (0)

    View full text