Innovations in Search and Retrieval of the COVID-19 Literature

In mid-March, the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Microsoft, and the National Library of Medicine (NLM) at the National Institutes of Health jointly announced CORD-19, a continually updated corpus of content that initially held at launch some 26,000 articles from the scientific literature. (That database is updated weekly on Fridays and, as of this writing, consists of more than 52,000 articles.) In the White House Call to Action, the allocation of effort was described this way, “Microsoft’s web-scale literature curation tools were used to identify and bring together worldwide scientific efforts and results, CZI provided access to pre-publication content, NLM provided access to literature content, and the Allen AI team transformed the content into machine-readable form, making the corpus ready for analysis and study.”  

What might not be immediately obvious to the public is the number of organizations that have since launched search and discovery initiatives in support of researchers working to minimize the impact of the virus. A sampling appears here:

There are some intriguing presentations of search functionality across that range. 

The interface on the Mendel engine allows for Boolean query building, with buttons allowing the user to select All (and), Any (or) or None (not). As with the search tool Yewno, Mendel focuses on search of terms in context, rather than keywords. 

The CORD-19 Search tool from Amazon initially suggests the user submit queries using natural language. One example is When is the salivary viral load highest for COVID-19? On the following results page, a left-hand sidebar allows filtering according to topics that appear to be limited to nine options with the user permitted to select five of the nine. The topics are listed according to the number of retrieved results; the top option has the highest number.  The salivary viral question offers such possible topics as public-health-policy and labs-trial-human.  

Microsoft’s CORD-19 AI Powered Search begins with more of an assumption about current awareness, suggesting that the user will want to initiate a search for the latest contributions by date range, with a powerful range of twelve filters exposed, once dates have been entered. 

The Allen Institute’s offering is blindingly fast in terms of retrieval but mystifying in that it’s not immediately obvious to the user just how retrieval is driven or how artificial intelligence is being leveraged in determining the results. 

Verizon Media’s Vespa-driven search relies on Boolean operators, displaying the older operator symbols (plus signs, parentheses, quotation marks, etc.) in its query structures. Their system stands out for showing highlighted search terms appearing in the article title, within the context of the abstract and full text, and a machine-generated summary. Their display also includes a clickable DOI.   

Cactus Communications offers a particularly nice platform and interface, overlaying the CORD-19 corpus as well as those from Dimensions and LitCovid. The site describes itself as providing “a global research repository of available literature on COVID-19 from all the prominent sources and made it searchable using state-of-the-art AI technologies....Additionally, we have added relevant records from CrossRef. We are also looking beyond just English sources and will be covering content from Chinese, Japanese, and Korean publications, journals, and preprints by April 24 for more complete coverage.” The dashboard interface is far more engaging in comparison with aforementioned search tools, designed as it is to spotlight the full range of Cactus Communications technology options. Modules in the sidebar point the visitor towards recommended and popular reads, must-watch videos, and theme-based, expert reviews of the literature.   

BIP! Finder for COVID-19 is in a slightly adjacent space in looking at the scientific literature; it aims to “ease the exploration of COVID-19-related literature by enabling ranking articles based on various impact metrics.” Those metrics include popularity, influence, reader attention and social media attention. It is not, however, a search tool, offering no means to search the literature dataset being ranked.

Over on Reddit, individual programmers are announcing their project development of search tools, albeit on a more limited scale. There is a Kaggle data science community that is actively engaged as well as other collaborators working in the Allen Institute for AI Challenge. Technology providers, such as Ontotext, are touting the use of their knowledge-base building tools by entities such as the Mayo Clinic and Cochrane. 

Under the auspices of the Department of Commerce, the National Institute of Standards and Technology (NIST) in conjunction with the White House Office of Scientific Technology and Policy (OSTP) announced that it will be sponsoring a TREC-COVID Challenge. The announcement describes it as being “a unique opportunity for the information retrieval (IR) and text processing communities to contribute to the response to this pandemic, as well as to study methods for quickly standing up information systems for similar future events. The results of the TREC-COVID Challenge will identify answers for some of today's questions while building infrastructure to improve tomorrow's search systems.” The TREC (Text REtrieval Conference) is a long-established government initiative to encourage research in information retrieval from large text collections.   

What the information community is seeing in the midst of this global pandemic harkens back to events in the wake of Sputnik’s launch more than fifty years ago, when researchers and scholars in the United States were jolted into the recognition that existing current awareness practices for tracking scientific development were failing even as the scientific  literature was exploding. It became a high-stakes game and the response from the scholarly associations and government agencies of the time fostered the emergence and use of new abstracting and indexing services to save the time of the reader.  The tools currently under development reflect a more sophisticated set of available technologies and a practical recognition of how researchers may expect to uncover the literature most relevant to their work. Efforts now will undoubtedly influence the development of information services yet to come.