🔎 Evaluation Projects

Several research projects work(ed) on the evaluation of the RESQUE rating scheme:

📔 Etzel, F. T., Seyffert-Müller, A., Schönbrodt, F. D., Kreuzer, L., Gärtner, A., Knischewski, P., & Leising, D. (2025, May 15). Inter-Rater Reliability in Assessing the Methodological Quality of Research Papers in Psychology. https://doi.org/10.31234/osf.io/4w7rb_v2.

Abstract

As many widely used research productivity metrics—such as h-indices and journal impact factors—have come under scrutiny for lacking validity, the demand for more viable alternatives is growing. This paper presents two empirical studies in which groups of raters (n1 = 3, n2 = 9) assessed the methodological rigor of research papers (k1 = 52, k2 = 110) using detailed catalogs of relatively well-defined quality criteria. The main endpoint in both studies was inter-rater reliability, which is a necessary prerequisite for any subsequent use of such assessments (e.g., as part of hiring or promotion procedures). Both studies showed that the application of some open science practices (e.g., open data, preregistration) may in fact be assessed with good reliability (Kappa > .60, ICCs > .75), even by raters who received little to no training, and within reasonable amounts of time (approximately 1 minute per criterion). When aggregated across criteria, inter-rater reliability for this type of assessment was good to excellent (Study 1: ICC(1, 1) = .91, Study 2: ICC(1,1) = .74). A subsample of papers in Study 2, drawn randomly from the recent literature (2020-2022), indicated that typical papers in contemporary psychology continue to exhibit very low methodological rigor, a pattern that likely reflects the still-ongoing process of slowly implementing better infrastructure, incentives, and norms for rigor-enhancing practices. Study 2 also showed that criteria related to consensus-building do not yet espouse sufficient reliability. Standard criterion sets for assessing the methodological rigor of empirical research should be used more widely in evaluating submissions to scientific journals, as well as published research (e.g., in evaluating the research productivity of individuals, groups or institutions). Such evaluations will be also facilitated by establishing clearer and more widely-adopted reporting standards.

One Key Finding is that the overall Relative Rigor Score of RESQUE can assessed with good to excellent inter-rater reliability (ICC(1,1) = .91) by student assistants.

đź“” Christoph Heller & Jakob Fink-Lamotte (in prep.) are testing the RESQUE v0.3 rating scheme in the field of clinical psychology.