Text and Data Mining

Virtual Conference


Not so long ago, Text and Data Mining (TDM) — the automated detection of patterns and extraction of knowledge from machine-readable content or data — was a particular area of interest. So much so, that libraries and content providers developed licensing language and other resources to support researchers wanting to work with and manipulate this material, including a proliferation of LibGuides and APIs. But where are we now in identifying available resources and tools for TDM activities?

This virtual conference will provide an “explainer” for information professionals tasked with supporting researchers who are just beginning to engage with TDM, and wondering how to pull the data they need, how it is structured, and how they can expect to engage with it. Our speakers will cover essential technology, how it is deployed and used, the scope of support that the library may be asked to provide, and the spectrum of options for collaboration between information professionals and content and service providers.

Confirmed speakers include (among others): 

  • Dr. Nathan Kelber, Director, Text Analysis Pedagogy Institute, JSTOR Labs
  • Petr Knoth, Senior Research Fellow in Text and Data Mining. Open University
  • Dr. Prathik Roy, Group Product Manager, Springer Nature
  • Shyama Saha: Senior Machine Learning/Text-mining Scientist, Literature Service, EMBL-EBI and Aravind Venkatesan: Senior Data Scientist, Literature Service, EMBL-EBI
  • John Walsh, Director, Hathi Trust Research Center.
  • Huajin Wang, Co-Director, Open Science & Data Collaborations Program, Carnegie Mellon University

Event Sessions

12:00 Noon - 12:15 Welcome

12:15pm - 12:45pm Vision Interview


Peter Knoth

Senior Research Fellow in Text and Data Mining
Open University (UK)

Petr Knoth, Senior Research Fellow in Text and Data Mining. Open University will be our Vision Interview for this event.

12:45pm - 1:15pm Handling the Basics


Prathik Roy

Product Director, Data Solutions and Strategy
Springer Nature

The path to innovation requires the systematic analysis of millions of documents. Springer Nature has built Open APIs for Text & Data Mining (TDM) purposes to enable researchers to better explore new paths to innovation. By leveraging the power of our APIs and mining our multi-disciplined portfolio of open access content, researchers can open up endless possibilities for developing new knowledge models or creating new APIs that can be developed and monetized. 

As the volume of scientific publications increases and TDM software tools improve, we are moving towards a more formalized process to enable TDM, and strive to make this as simple as possible for information managers and researchers.

In this session, Dr. Prathik Roy will explore the power of Springer Nature's Text & Data Mining/Open APIs and how they have been helping researchers uncover new data and solve complex problems.

1:15pm - 1:45 pm Text and Data-Mined Content at Europe PMC & the Impact of Licensing


Shyama Saha

Senior Machine Learning/Text-mining Scientist
Literature Service, EMBL-EBI

Europe PMC is a digital repository that indexes life science scholarly publications, it provides intuitive and powerful search tools and links the underlying data to the relevant biological data resources. Europe PMC hosts 40.5 million abstracts and 7.8 million full-text articles, including research articles, preprints, books, protocols, and reviews. Europe PMC uses text-mining techniques including machine learning, to annotate literature from the Open Access and CC-BY set with relevant biological terms, their relationships, data citations/accession numbers etc. The text-mined information is publicly available and programmatically accessible through our Annotation API in a standardised machine-readable format for reusability, helping stakeholders including scientists, bioinformaticians and curators access the underlying data, promoting Open Science. In this talk, we will give an overview of our text and data mining activities and their impact.  


1:45 pm - 2:30 pm Comfort Break (45 minutes)

2:30pm - 3:00pm. Case Study - Carnegie Mellon


Huajin Wang

Liaison Librarian
Carnegie Mellon University Libraries

In this talk, Huajin Wang will provide a brief overview about resources at Carnegie Mellon University Libraries that support text and data mining, offer a few use cases that highlight challenges that students and faculty face in their text and data mining projects, and discuss how the Libraries help them to navigate these challenges

3:00pm - 3:30pm Case Study -- Improving Library Services and Researcher Outcomes for Text Analysis


How can we improve support for librarians and researchers interested in text analysis? In the past three years, JSTOR Labs has interviewed hundreds of librarians and researchers across the globe about their challenges for completing text analysis projects. Constellate, the successor to Data for Research, offers researchers access to textual data from JSTOR, Portico, and other providers. This data access is a significant boon for research, yet our experience has shown that the most significant way to improve research outcomes is to improve text analysis literacy. How are libraries doing this now? And how will that change over the next decade?

3:30pm - 4:00pm Case Study - Hathi Trust


The mission of the HathiTrust Research Center (HTRC) is to provide tools, environments, and services for computational research on the content of the 17-million-volume HathiTrust Digital Library. In this talk, I will provide an overview of the Text Data Mining (TDM) activities and services provided by HTRC, with additional detail on two current initiatives, Scholar Curated Worksets for Analysis, Re-use, and Dissemination (SCWAReD), supported by the Andrew W. Mellon Foundation, and Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction (TORCHLITE), supported by the National Endowment for the Humanities.

Additional Information

NISO assumes organizations register as a group. The model assumes that an unlimited number of staff will be watching the live broadcast in a single location, but also includes access to an archived recording of the event for those who may have timing conflicts. 

NISO understands that, during the current pandemic, staff at a number of organizations may be practicing safe social distancing or working remotely. To accommodate those workers, we are allowing registrants to share the sign-on instructions with all colleagues so that they may join the broadcast directly. 

Registrants receive sign-on instructions via email on the Friday prior to the virtual event. If you have not received your instructions by the day before an event, please contact NISO headquarters for assistance via email (nisohq@niso.org). 

Registrants for an event may cancel participation and receive a refund (less $35.00) if the notice of cancellation is received at NISO HQ (nisohq@niso.org) one full week prior to the event date. If received less than 7 days before, no refund will be provided. 

Links to the archived recording of the broadcast are distributed to registrants 24-48 hours following the close of the live event. Access to that recording is intended for internal use of fellow staff at the registrant’s organization or institution. Speaker presentations are posted to the NISO event page.

Broadcast Platform

NISO uses the Zoom platform for purposes of broadcasting our live events. Zoom provides apps for a variety of computing devices (tablets, laptops, etc.) To view the broadcast, you will need a device that supports the Zoom app. Attendees may also choose to listen just to audio on their phones. Sign-on credentials include the necessary dial-in numbers, if that is your preference. Once notified of their availability, recordings may be downloaded from the Zoom platform to your machine for local viewing.