Standards and the Role of Preprints in Scholarly Communication

January 2020

A high-level vision of the role preprints properly play in scholarly publishing should inform development of the National Information Standards Organization (NISO) community standards for preprint formatting, publication, and search retrieval.

The Model

One vision (hereafter, referred to as “model”) for preprint publication, disclosed in an evolving preprint [1], focuses on physics preprints in arXiv. This focus is natural, given that physicists and allied practitioners of other mathematical and quantitative fields have been long-standing adopters of preprints. Unsurprisingly, arXiv has therefore played a lead role in the preprint space. Preprint servers that share the “-rXiv” suffix with arXiv have emerged.

The model may never materialize in pristine form. This would take decades at a minimum. However, it provides one analytical framework for understanding the interplay of various components of scholarly journals publishing and for thinking about how preprints can mitigate problems that beset this complicated market. (A new version of the aforementioned preprint will further develop this critique, which is beyond the scope of this paper and discusses open access generally.)

The model suggests that journal publishing and preprints in physics should be increasingly symbiotic. They have distinct roles that reflect historically recurring needs in physics (and STEM) publishing generally.

One need is rapid publication to stake priority to new discoveries.

Robert Merton [2] documents how claims to priority figure in the history of science. He also points out problems that arise when originality becomes an end in itself: “Contentiousness, self-assertive claims, secretiveness lest one be forestalled, reporting only the data that support an hypothesis, false charges of plagiarism, even the occasional theft of ideas and, in rare cases, the fabrication of data...” (p. 323). While no system will be perfect, there are good arguments for incentivizing originality by enabling staking of claims to priority of discovery [3].

Preprints play this role by enabling open frontier, cutting edge disclosures of discrete findings within unfolding research agendas [4].

The model suggests, by contrast, that journal articles take the form of traditional review articles [5] that cite other journal articles, conference proceedings, books, and — increasingly — research disclosed over several preprints. Journal articles should play a pedagogical role in orienting researchers and students to new fields, creating narratives about newly emerging trends, contextualizing discoveries, and fostering interdisciplinary research.

The model calls for the journal market to contract significantly but not entirely as preprints supplant journal articles as the place to disclose small slivers of research. Re-purposing journal articles and correspondingly trimming their numbers will decrease demand for journal subscriptions that pressure budget-strapped libraries. An increased emphasis on review articles will assist researchers in navigating their fields, help counter hyper-specialization, and make the inter-generational transmission of science much more efficient. Also, contracting the journals market and re-purposing it almost exclusively toward review articles can save genius-hours spent doing peer-review, time better spent doing research disclosed in preprints, writing or reviewing integrative journal articles, and teaching.

arXiv and the Development of NISO Standards

Guided by the model’s framework, the remainder of the article reviews arXiv features and suggests some enhancements, all in the context of suggesting community standards useful for physics publishing and, perhaps, other subject domains.

Standards will create a high bar that will discourage haphazard or non-serious preprint submissions. Also, investing preprints with the capabilities and features now standard in electronic journal publishing will underscore the model’s symbiosis between preprints and journal articles.

Indexing

Doesn’t the model just perpetuate a preprint publishing glut that mirrors the journal article glut?

In reply, the model foresees journal articles increasingly serving as pointers to, and critical maps of, research disclosed in the frontier preprint space. Also, it argues that fine-grained thesaurus and classification schemes, useful in the past to navigate the journal article glut, can help navigate the preprint space.

arXiv displays a high-level subject classification scheme on its website and also provides fields for submissions to the math and “cs” archives that enable identification of the “mathematical classification code according to the Mathematics Subject Classification” and “ACM Computing Classification System,” respectively. The arXiv advanced search capability includes these in the drop-down field menu.

Detailed classification schemes, as well as granular thesaurus schemes, enable focused retrieval of search results concerning a particular subject, which the searcher could then further supplement with keywords in searching.

Best practices emanating from the NISO information community could recommend that preprint servers exploit the full range of already existing thesauri and classificatory schemes to enhance preprint organization and searchability. Preprint servers might, as a practice, require persons submitting preprints to label their work with controlled index terms. To accommodate interdisciplinary work, much like NLM MeSH header trees, one preprint could fall under more than one hierarchy of indexing terms.

Standardized indexing can also help abstracting and indexing databases in efforts to provide coverage of preprint literature.

Bibliometric Tools

Expectations to publish large quantities of journal articles as a condition for securing tenure and promotion and grant funding helps drive the glut of article publishing. Rewarding faculty for publishing high impact preprints would help advance the model’s goals by decreasing the number of journal articles while accommodating the drive to stake claims to discrete research findings at the cutting edge of science. Arguably, however, significant rewards should be conferred on researchers for writing excellent review articles as for generating original research results, given that reviews play an essential role in educating and transmitting knowledge.

The vexing and enduring question, of course, is how to assess the impacts of preprints and review articles.

Citing and cited data will remain important, including metrics that weigh how much a citing item is itself cited. Consistent with the model citations from integrative journal articles of the type mentioned earlier should have particular gravitas. Standards for availability and use of altmetrics would also help in this context. If a quality, peer reviewed journal article of the review sort mentions a preprint, this will often (not as a rule) suggest that the preprint has “made it,” since it merits mention in the critical narrative of a review article.

Community standards for preprint servers might encourage easy collection of a wide range of relevant bibliometric data. These would encourage flexibility of approach in assessing impact. One noteworthy attempt is SciMeter, which enables analysis of arXiv papers.

There needs to be ways to rank the citing data for arXiv articles made available via the Semantic Scholar, INSPIRE HEP, and NASA ADS. This would help users of arXiv discover influential preprints, make it easy for journal editors to identify preprints worth citing in review articles, and more readily enable bibliometric work to track macro-trends in physics research, such as identification of newly emerging sub-domains in physics.

Email Alerts

arXiv enables email alerts about broad subject headings. The indexing just discussed would enable users to set up much more focused email alerts than arXiv currently enables. It would also be helpful for arXiv to enable email alerts for the publication of new preprints that cite a preprint or author of interest.

Credentialing

arXiv mentions, “During the submission process, however, we may require authors who are submitting papers to a subject category for the first time to get an endorsement from an established arXiv author.” Also, “arXiv submitters are …encouraged to associate an institutional email address, if they have one, with their arXiv account.”

These practices provide a very helpful bar to publishing preprints.

As it applies to community standards, institutional affiliations or endorsements should generally be a requirement to publish in a preprint server. Use of ORCID iDs [6] provide one more reasonable bar to publishing preprints.

ArXiv mentions, “We encourage all arXiv authors to link their ORCID iD with arXiv.” Perhaps the standard in general for preprint servers would be to make this a requirement for submission.

Data

Da Silva 2017 concludes that the “rush to publish work as a free OA document with a citable identifier, the digital object identifier (DOI), may also invite a wealth of bad, weak, or poor science. To reduce this risk, given the centrality of preprints in the open science movement, preprints should also have open data policies, that is, preprints cannot be published unless the data sets are also placed in the public domain.”[7]

arXiv allows data sets to accompany a preprint, but a best practice according to Da Silva would be to recommend that all preprints disclosing empirical results be accompanied by data, either with authors uploading the data and computer code used in processing or modeling data directly to the preprint server, or identifying within the preprint’s metadata a DOI for a data set the author has deposited in a stable independent repository.

Such a practice might better ensure higher quality and seriousness of purpose in preprints. Per the model, it would also help consolidate the preprint format's role as the point of initial disclosure of new research findings. It would align with the aims of the open data movement and underscore the possible [8] role of preprints as a place to record negative results.

Journal articles, again per the model, can provide meta-analyses of how well empirically focused preprints use data to support hypotheses that inform research agendas.

Bibliographic Visualization

Standards defining best practices for visually displaying citing-cited relationships between preprints and journal articles can help reinforce the model’s emphasis on their complementarity [9]. Another area for visualization would be based on topic clustering to show points of contact between various subject clusters.

In this context, see the keyword cloud visualization and other capabilities at Scimeter.org.

Text Mining

APIs can help in studying the symbiosis between preprints and journal articles that the model enjoins. It would also help in bibliometric studies relevant to developing metrics for research impact of preprints, as well as help in tracking new trends in scientific research.

It is important to enable text mining to the extent possible. Cf. arXiv: “arXiv supports real-time programmatic access to metadata and our search engine via the arXiv API. Results are returned using the Atom XML format for easy integration with web services and toolkits.”

Manuscript Templates and Journal Article Recruitment

Enforcing a uniform manuscript style for preprints is a way to make their appearance less haphazard and help reinforce their symbiosis with journals, which typically have standardized formatting.

Preprint servers offer journal editors a recruitment ground for articles of the kind the model envisions. ArXiv mentions that “arXiv-hosted manuscripts are used as the submission channel to traditional publishers such as the American Physical Society, and newer forms of publication such as the Journal for High Energy Physics and overlay journals.” This is yet another area for discussion about best practices that will help reinforce the preprint/journal symbiosis.

It can be helpful to label preprints that critically review a discipline to facilitate this recruitment.

"Caveat Lector" Boilerplate

At the fall Charleston meeting, Kent Anderson argues that preprints can mislead the public. Since my concern is more with the arXiv’s domain, which includes coverage of many topics that do not immediately bear on human welfare, I will not evaluate this claim as it pertains to biomedical research. One may ask, however, whether the "NIH Expectations for Researchers Who Post Preprints" [10] enjoining researchers to "clearly state work was not peer reviewed" might help for preprints generally.

Such a “caveat lector” (reader beware) warning, to be effective, should explain the meaning of “peer review” for the benefit of laypeople and benighted journalists unfamiliar with the concept. This becomes a valuable teaching moment.

As Polka suggests, we need transparency about the level of peer review to which preprints may already have been subjected. As she says, “The preprint disclosures [suggesting that a preprint has not been subjected to peer review] undermines the work that’s gone into producing the existing [preprint manuscript] evaluation.” [11]

Her analysis suggests the need for a community standard calling for clear identification of the level of scrutiny a preprint has undergone. Funding agencies that support research disclosed in preprints could recommend or require this. Polka adds that a comparable point also extends to journal articles: “ … the state of the journal articles should be clear, too. A framework for concise expression of peer review attributes was proposed by the Peer Review Transparency project.”

There are also questions about how to make practicable Anderson’s suggestion [12] about restricting preprint access to institutions and persons who can responsibly use preprint content. If this involves limiting access to paying institutions, researchers at non-subscribing institutions would not know immediately about stakes to priority for new discoveries. Questions about equitable access to this important data arise.

Perhaps access to preprint servers could be limited to .edu or .org institutions, but how would this be implemented, what technological workarounds are readily available to circumvent such arrangement, and how would this impact independent scholars or (responsible!) journalists?

Also, some preprints will never be peer reviewed. Is this such a bad thing? Preprints accommodate the ability to publish negative experimental results lacking the flashy appeal that newsworthy scientific publishing in major journals affords. Disclosure of negative results helps other researchers avoid wasting time or, on the other hand, gives them ideas about how to improve experimental approaches that did not yield results [13].

Anderson’s concerns about a sort of graveyard of preprints emerging that have not proved scientifically useful should not extend to physics, and possibly other areas, though again I will not weigh in on biomedical areas. Should we mothball preprints that disclose negative results? Also, removing public access to preprints that have not turned into peer reviewed journal articles removes them from ready study by historians of science. Not to mention that there is a distinct possibility that a preprint will disclose an important but not immediately appreciated finding, one for which it is useful to have a date-stamp of disclosure to help in adjudicating claims to priority.

Professional Standards and the Role of Societies

Academic societies need to police the integrity of the preprint space. NISO could work with them to define standards of professional conduct in use of preprints as well as penalties for misuse, such as expulsion from a society for stealing ideas initially disclosed in a preprint, plagiarizing preprint text, manufacturing data that appears with a preprint, or cherry-picking of data to support one’s points. They should also work with journalists and journalism schools to improve greatly journalistic standards in science reporting.

Ideally, societies would work toward reconceptualizing journals as properly serving as review articles, contracting the journal article glut, and transforming tenure and promotion and grant funding criteria that drive the journal article glut.

Commenting and Annotating Capabilities

Tom Narock and E. Goldstein mentioned in their PowerPoint from the recent NISO conference about preprints that "a total of 135 annotations were found across the >9000 on the nine services" they examined [14].

The nine servers did not include arXiv. I am not aware of a study of arXiv in this respect. (The ScienceWISE service is relevant here. From ScienceWISE: "The ScienceWISE system allows scientists, in the course of their daily work ... [to do] annotating [of] scientific research papers, uploaded to ArXiv.org, and linking them to the ScienceWISE ontology, thus expanding content of their papers with supporting material in the form of encyclopedia-like articles." (cf. also the “Trackbacks” capability in arXiv.)

One can see value in development of standards that promote critical evaluation of work that appear in other preprints and enables researchers easily to find preprints that take opposing positions on a given research topic. For example, there should be links in a preprint’s metadata to other preprints that cite it.

Payment Models

An obvious challenge for preprint servers is funding. Anderson suggested that there could be a charge for using preprint servers. A preprint submission fee would be one more helpful bar to haphazard preprint submission [15]. The charge should be as nominal as possible to avoid use of grant funds better spent for other purposes.

Concerning Anderson’s suggestion referenced above that access to preprints be limited just to researchers [16], while doing so would provide a revenue source through institutional or individual subscriptions, limiting access in this way would undercut the model’s view that preprints are valuable for establishing priority of discoveries. Public accessibility without a paywall ensures that researchers are not arbitrarily beholden to subscription-based access to a plethora of preprint servers in order to learn about new discoveries, especially interdisciplinary ones. Consolidation of preprint servers might help avoid this problem, but is that desirable, given that independent preprint servers for various subjects can attend in an expert way to the specific needs of particular subject domains?

Incidentally, questions about the sustainability of preprint servers are not unique to them, given plausible questions whether the current surfeit of journal publishing itself is sustainable in the face of limited library budgeting.

Conclusions

In sum, very long-run interrelated goals are to cut down the material that researchers must read just to stay afloat, to make the primary function of journals the peer-reviewed evaluation of swaths of research at the preprint frontier, and to place downward pressure on tenure and promotion and grant evaluation criteria that create demand for huge quantities of journal articles.

Following are some concluding suggestions to achieve these goals. First, decision-making about NISO preprint standards would benefit from continuous survey of user perceptions in various subjects. Surveys can focus in part on researcher behaviors in contexts where Big Deal journal packages negotiations have broken down. They afford “natural experiments” of a sort. These surveys might question how researchers navigated in an environment in which they did not have as much access to journal literature. Did preprints help satisfy their needs? If so, what do researchers think about the quality of preprints on which they relied in this environment? Such survey questions can segue into more general questions about researcher perceptions about the future role preprints should play vis-à-vis journal articles.

Second, libraries whose budgets cannot support the usual course of journal prices can, in a coordinated way (including within their consortia), work toward defining a core set of mission critical journals and make these the sole focus of negotiations for purchase.

In response to the argument that this makes libraries and their consortia far too activist in transforming scholarly publishing, consider their advocacy for an end-run goal of universal open access, often without attending to plausible counterarguments about the unintended consequences of particular schemes for doing so.

Finally, questions about the model’s particular perspective should not distract from the larger point. A concept of how preprints satisfy historically longstanding needs in scholarly publishing should drive development of community preprint standards. The process should be slow, deliberate, and open-mindedly responsive to a wide range of counterarguments to this or that proposed idea.

Acknowledgement

Many thanks to Jill O’Neill for helpful comments.

Disclaimer

Views expressed are those of the author and do not necessarily represent those of his employer.