Reviewing other’s work for the purpose of scoring it does not advance science. Scoring work does not help authors improve it. Scoring does not help a work’s audience understand the work, identify its limitations, or evaluate its credibility.
Scoring does, however, undermine our objectivity as peer reviewers because scoring activates our biases. We are predisposed to like works that are familiar in approach, language, and style to our own work; we trust results more easily when they confirm our existing beliefs and hopes. We cannot help but be influenced by those biases when quantifying how much a work deserves to be trusted, recognized, or otherwise valued. Biases introduced through scoring inevitably propagate to our written feedback. We may try to be objective, such as by ensuring every word we write is backed by facts, but we can only report a small fraction of the facts we observe. Facts that confirm the fairness of our scoring intentions are likely to appear more salient than facts that contradict our fairness.
Despite the hazards of scoring, the principal function of the most widely-available form of peer review, publication review, is to score works as accepts or rejects. Different journals and conferences may use different criteria to reduce works to binary scores, but they must ultimately choose between publishing a work or refusing to do so.1
The scant evidence available certainly does not suggest that we assign accepts and rejects objectively. In experiments by the NeurIPS conference in 2014 and again in 2021, half of the works selected for acceptance by one set of peer reviewers were rejected by parallel sets of reviewers.2 It’s fashionable to refer to this inconsistency as randomness, but doing so presumes that our reviewing errors are unbiased and that their impact diminishes over the span of a career.3 Biases can grow, becoming systemic. For example, a bias against underrepresented researchers, or against those willing to challenge popular ideas, would reduce the likelihood that young scientists in these categories to survive long enough in their careers to become reviewers themselves.4
Scoring is surely not the only source of bias when we review others’ work, but it is a source we can eliminate. If we will never assign a score, we should have no temptation to bias our feedback to justify a score.
Moreover, scoreless peer review can serve the two scientific functions we most often ascribe to peer review: to help the work’s authors improve their research work, and to help the audience of a research work understand it and evaluate its credibility.5 I will refer to these functions as author-assistive and audience-assistive review.
When we scrutinize our students’ and colleagues’ research work to catch errors, offer clarifications, and suggest other ways to improve their work, we are informally conducting author-assistive peer review. Author-assistive review is almost always a * scoreless*, as scores serve no purpose even for work being prepared for publication review.
Alas, the social norm of offering author-assistive review only to those close to us, and reviewing most everyone else’s work through publication review, exacerbates the disadvantages faced by underrepresented groups and other outsiders. A field’s insiders have access to other insiders to provide constructive author-assistive feedback. Those whose only way to seek feedback is to submit work for publication review may wait months longer, often only to receive feedback that is written to justify rejection.6 The problem is not that we make ourselves available to provide author-assistive review to students and friends. Rather, the problem is that, for everyone else, we make ourselves more available to cast judgment on on their work than to help them improve it.
We can address those unintended harms by making ourselves at least as available for scoreless author-assistive peer review as we are for publication review.7
We can also conduct scoreless peer review to help audiences understand research work and evaluate its credibility.8 And we should, because publication review is poorly suited to this function. Publications’ acceptance criteria include factors such as ‘novelty’ and ‘significance’ that are not only unrelated to credibility, but subvert it by rewarding unscientific practices and chance results.9 Publication review not only amplifies our biases, but it buries the evidence of bias in confidential communications and unpublished feedback, allowing those biases to hide and fester.10 Publication review slows the pace of scientific discourse because reviewers must collectively decide on a final outcome, delaying feedback until reviewers can reach agreement. To agree, reviewers must also review and discuss other works competing for a limited number of accept scores. Publication review further delays the release of credible research by rejecting it for reasons unrelated to its credibility. Publication review rarely detects outright fraud, but it often legitimizes fraudulent work.11 It also leads audiences to overlook limitations in research works because we have trained the public that publication is a signal of a work’s credibility—an implicit endorsement.
Scoreless audience-assistive peer review takes an entirely different approach, assuming work will reach an audience and enlisting reviewers to help that audience understand the work and any issues that might impact its credibility.12 When reviewing, we can provide nuanced feedback to help audiences differentiate extremely meticulous findings from those that are somewhat credible, those that are glaringly wrong, and everything in between.13 Our feedback can help audiences develop the skills to identify strengths and weaknesses of research results on their own. We can provide feedback quickly, without comparing the work to others’ work and without coordinating with others to choose an outcome. And, while we may still harbor biases that affect the contents of our reviews, sharing that content with the work’s audience exposes those biases for others to identify, refute, and even study.
As authors, writing for scoreless peer review empowers us to write for our audience, rather than to appeal to reviewers, as it is our audience who will ultimately judge our work.14
We should be as open to helping the audience of a work understand it as we are to deciding whether that audience will find the work in an exclusive publication or on a preprint server. We should each make ourselves at least as accessible for scoreless audience-assistive peer review as we are for publication review.
How You Can Assist (Right Now!)
You can help make science more objective, equitable, and just by making yourself at least as available for scoreless peer review as you are for publication review. You can do this right now, on your own, without waiting for others to declare a consensus or to offer you a seat on a scoreless review panel.
Just write that you are “available to conduct scoreless peer review” on your webpage, blog, or social media profile. You can add conditions, such as which topic areas you feel qualified to review, which you are interested in reviewing, and which you do not want to review. You can signal that you have limited time,15 such as by writing that you “try to be available to conduct scoreless per review when possible”. So long as you include the phrase “available to conduct scoreless peer review” on media indexed by search engines, you will make yourself discoverable to others (e.g. by searching via Google or Bing).16 17 (I do so here.) And, if you’re worried you might change your mind, opting back out is as easy as deleting that text.
Your action will have near-immediate impact.18 Authors who may not otherwise have access to assistive review will now be able to find you. You will be visible to those not brave enough to be the first in their field to signal openness to scoreless peer review. You will be discoverable by those who would like to form scoreless peer review panels and want to know if your field might have a sufficient number of potential reviewers.
Fixing Publication Review
Many of us have no choice but to submit work for publication review because universities and research labs make hiring and promotion decisions based on where our work is published.19 This not only shifts labor from hiring committees to publication-review committees, but it also shifts the moral and legal responsibility for subjective criteria used to score research and the biases of those doing the scoring. It repackages the subjective opinions of the senior members of a field into objective-looking statistics.
Scoreless peer review is intentionally not a replacement for publication review because publication review serves functions that we may not want to replace.
Rather, scoreless peer review provides authors an alternative to publication review when we need help to improve our research work, or to help our audience evaluate its credibility, without seeking recognition or prestige. It provides an alternative form of reviewing service, empowering us to assist others without producing scores and becoming complicit in how those scores will be used. Through such service, we can undermine the pretense that scoring research is service to the endeavor of science and correct the misconception that accepted means credible.
Footnotes follow the comment stream.
Comment via my accompanying fediverse post.
Similarly, funding agencies, such as the US National Science Foundation (NSF), must ultimately decide which of their reviewed projects to fund. But, they do not glorify exclusivity. The last NSF reviewing instructions I received noted that it was “unfortunate” that most proposals cannot be accepted. In this author’s experience, the NSF works far harder than most publication review committees to ensure reviews are assistive to authors. Their instructions encouraged reviewers to provide detailed comments to improve future proposals, and I’ve witnessed NSF staff ask panelists to write kinder and more constructive reviews. ↩︎
The NeurIPS 2021 Consistency Experiment followed the earlier experiment at NIPS 2014 (the conference would be renamed NeurIPS). The 2014 call for papers stated “submissions will be refereed on the basis of technical quality, novelty, potential impact, and clarity” whereas the 2021 call for papers listed no evaluation criteria. ↩︎
Referring to our scoring errors as randomness may make us feel better about scoring others’ work, and having our work rejected, but it also delegitimizes the challenges faced by those who face actual bias. ↩︎
In The natural selection of bad science, Smaldino and McElreath argue that that publication review also favors scientists whose “poor” methods “produce the greatest number of publishable results” which leads to “increasingly high false discovery rates”. ↩︎
Some believe that curating work by segregating it into accepts and rejects assists audiences with limited attention, since attention remains a limited resource as the costs of printing, mailing, and storing paper once were. As publication costs went to zero, we adapted to the deluge of information by evolving socio-technological innovations such as email, mailing lists, USENET, blogs & RSS feeds, Wikis, and social media. Those who want to offer curation recommendations have ample mechanisms to do so that do not require delaying work they find less significant from reaching its audience.
It is a common folly of age to believe that one’s juniors appreciate your priorities and values, when seniority can make us less open to new trends and less capable of recognizing them. Theocracies employ councils of elders to control access to information; science should not. ↩︎
Since journals and conferences compete to be exclusive, those submitting to the publications with the most experts can expect the highest rejection rates. ↩︎
We should also offer to assist authors before they conduct experiments, helping them refine their hypotheses and improve their experimental designs prior to incurring the risks, costs, and time the experiments require. The current practice of reviewing and rejecting experimental designs at publication time, after the costs of conducting the experiment have been sunk, is wasteful and frustrating for all involved. Pre-experimental review and registration of experimental designs can increase the chance that null results will be published. A 2018 examination of 127 bio-medical and psychological science registered reports (pre-registered studies) showed a 61% null result rate, as opposed to a typical 5-20% null result rate for studies published in venues that did not require pre-registration. ↩︎
It’s tempting to assume that we could accept or reject experiments objectively if we examined only their credibility. To understand why we cannot, consider an experiment that attempts to prove a hypothesis by rejecting a null hypothesis. The experiment does not consider or attempt to test a third hypothesis that would also lead the null hypothesis to be rejected. If a reviewer considers that third hypothesis sufficiently implausible, the third hypothesis does not impact the credibility of the experiment and it must be accepted. If a reviewer considers the third hypothesis sufficiently plausible, they might conclude that the experiment should have been designed to disprove it as well, and must reject the study. ↩︎
Evaluating experiments by their outcomes encourages authors to highlight those hypotheses that yielded statistically significant results than those that didn’t, hypothesize potentially significant results only after seeing data (HARKing), and to engage in other irresponsible scientific practices. ↩︎
There are forms of publication review that open up some of their discourse, such as open peer review, in which the reviews of accepted papers become public, but biases that cause papers to be rejected still remain hidden. ↩︎
The failure of peer review to detect fraud by Adam Mastroianni in the postmortem of The rise and fall of peer review, starting at paragraph 4. (For those who read the full article, I feel obligated to note an issue with the prior paragraph, which cites three experiments in which participants acting as peer reviewers caught 25%-30% of major flaws to argue that “reviewers mostly didn’t notice” major flaws. In those three cited experiments, researchers added 8, 9, and 10 major errors to the research work participants reviewed. Reviewers need only find one major error to conclude a work should not be published as is, and after documenting two or three of 8-10 errors, participants conducting publication review can surely be expected to stop adding additional evidence for rejection.) ↩︎
Whereas publication review can reinforce knowledge asymmetries, audience-assistive feedback is designed to reduce knowledge asymmetries, reducing the knowledge gap between authors and audience. ↩︎
Further, when rejected work goes unpublished they disappear from scientific discourse, we can’t learn from any mistakes that were made, and so the same mistakes and failed experiments can be re-run over and over. We bias research in favor of chance results. ↩︎
Research works today are often more written for reviewers than a work’s true audience, at a cost both to that audience and to authors. For example, reviewers’ expectations for ever-increasing numbers of cited works, and our need to exceed their expectations for citation counts when submitting work, have caused the number of citations to grow out of hand. Bibliographies have evolved like peacocks’ trains: vast assemblies of plumage, most of which serve no function but to attract the admiration of those who co-evolved to prize it. ↩︎
We each have limited precious time to dedicate to service, and we each have to decide how much time to spend scoring others and how much time to spend assisting others. If you are worried about the time required, consider that scoreless peer review could actually reduce all of our workloads. As more work is submitted for publication having already benefited from scoreless review, with its credibility already evaluated, less of that work may need to be rejected. Consider that work submitted to review committees with acceptance rates of 25% will require an expectation of four submissions to be published, consuming the time of four sets of reviewers. Increasing the probability of acceptance to 34% reduces the expected number of submissions by one (from 4 to 3), removing the expected review workload by an entire set of reviewers. Further, if publication review is no longer needed to establish the credibility of work, more conferences could forgo using peer reviewers to choose which works they think attendees will want to see and, instead, ask prospective attendees which works they actually want to see. ↩︎
If you are not willing to perform both author- and audience-assistive peer review, I’d suggest “available to conduct author-assistive peer review” or “available to conduct audience-assistive peer review”. ↩︎
Peer review as unpaid (volunteer) service has been a norm in much of the scientific community. Yet, author-assistive reviews are often performed by colleagues at the same employer, or collaborators who are on the same grants, and so are often implicitly paid. We need to be understanding that reviewers may reasonably expect to be paid when reviewing industrial research and research by authors who are as well compensated, if not better compensated, than the reviewer is. ↩︎
Pending any delay for search engines to crawl your website for updates. ↩︎
Those universities and research labs could make hiring and promotion decisions without examining where work is published without much disruption. They could still use letters of support from others in the candidates’ field, read the scientific discourse that comments on and cites the candidates’ research, read the candidates’ research itself, and/or watch the candidate present their research. This would shift accountability for any bias closer to those making the hiring decisions, hopefully raising awareness of it. ↩︎