Text and Data Mining

Virtual Conference


Not so long ago, Text and Data Mining (TDM) — the automated detection of patterns and extraction of knowledge from machine-readable content or data — was a particular area of interest. So much so, that libraries and content providers developed licensing language and other resources to support researchers wanting to work with and manipulate this material, including a proliferation of LibGuides and APIs. But where are we now in identifying available resources and tools for TDM activities?

This virtual conference will provide an “explainer” for information professionals tasked with supporting researchers who are just beginning to engage with TDM, and wondering how to pull the data they need, how it is structured, and how they can expect to engage with it. Our speakers will cover essential technology, how it is deployed and used, the scope of support that the library may be asked to provide, and the spectrum of options for collaboration between information professionals and content and service providers.

Confirmed speakers include (among others): 

  • Dr. Nathan Kelber, Director, Text Analysis Pedagogy Institute, JSTOR Labs
  • Petr Knoth, Senior Research Fellow in Text and Data Mining. Open University
  • Dr. Prathik Roy, Group Product Manager, Springer Nature
  • Shyama Saha: Senior Machine Learning/Text-mining Scientist, Literature Service, EMBL-EBI
  • John Walsh, Director, Hathi Trust Research Center.
  • Huajin Wang, Co-Director, Open Science & Data Collaborations Program, Carnegie Mellon University

Event Sessions

12:00 Noon - 12:15 Welcome

12:15pm - 12:45pm Vision Interview


Petr Knoth

Senior Research Fellow in Text and Data Mining
Open University (UK)

Petr Knoth, Senior Research Fellow in Text and Data Mining at Open University will be our Vision Interview for this event.

Important Resources:

LaTex - a document preparation system for high-quality typesetting. It is most often used for medium-to-large technical or scientific documents but it can be used for almost any form of publishing

The Text Encoding Initiative (TEI) - a consortium which collectively develops and maintains a standard for the representation of texts in digital form

JATS (Journal Article Tag Set) - an application of NISO Z39.96-2019, which defines a set of XML elements and attributes for tagging journal articles and describes three article models.

12:45pm - 1:15pm Handling the Basics


Prathik Roy

Product Director, Data Solutions and Strategy
Springer Nature

The path to innovation requires the systematic analysis of millions of documents. Springer Nature has built Open APIs for Text & Data Mining (TDM) purposes to enable researchers to better explore new paths to innovation. By leveraging the power of our APIs and mining our multi-disciplined portfolio of open access content, researchers can open up endless possibilities for developing new knowledge models or creating new APIs that can be developed and monetized. 

As the volume of scientific publications increases and TDM software tools improve, we are moving towards a more formalized process to enable TDM, and strive to make this as simple as possible for information managers and researchers.

In this session, Dr. Prathik Roy will explore the power of Springer Nature's Text & Data Mining/Open APIs and how they have been helping researchers uncover new data and solve complex problems.

1:15pm - 1:45 pm Text and Data-Mined Content at Europe PMC & the Impact of Licensing


Shyama Saha

Senior Machine Learning/Text-mining Scientist
Literature Service, EMBL-EBI

Europe PMC is a digital repository that indexes life science scholarly publications, it provides intuitive and powerful search tools and links the underlying data to the relevant biological data resources. Europe PMC hosts 40.5 million abstracts and 7.8 million full-text articles, including research articles, preprints, books, protocols, and reviews. Europe PMC uses text-mining techniques including machine learning, to annotate literature from the Open Access and CC-BY set with relevant biological terms, their relationships, data citations/accession numbers etc. The text-mined information is publicly available and programmatically accessible through our Annotation API in a standardised machine-readable format for reusability, helping stakeholders including scientists, bioinformaticians and curators access the underlying data, promoting Open Science. In this talk, we will give an overview of our text and data mining activities and their impact.  


1:45 pm - 2:30 pm Comfort Break (45 minutes)

2:30pm - 3:00pm. Case Study - Carnegie Mellon


Huajin Wang

Senior Librarian & Co-director, Open Science & Data Collaborations Program
Carnegie Mellon University Libraries

In this talk, Huajin Wang will provide a brief overview about resources at Carnegie Mellon University Libraries that support text and data mining, offer a few use cases that highlight challenges that students and faculty face in their text and data mining projects, and discuss how the Libraries help them to navigate these challenges

3:00pm - 3:30pm Case Study -- Improving Library Services and Researcher Outcomes for Text Analysis


How can we improve support for librarians and researchers interested in text analysis? In the past three years, JSTOR Labs has interviewed hundreds of librarians and researchers across the globe about their challenges for completing text analysis projects. Constellate, the successor to Data for Research, offers researchers access to textual data from JSTOR, Portico, and other providers. This data access is a significant boon for research, yet our experience has shown that the most significant way to improve research outcomes is to improve text analysis literacy. How are libraries doing this now? And how will that change over the next decade?

Important Resources:

Voyant Tools - a web-based reading and analysis environment for digital texts

Constellate - a platform for teaching, learning, and performing text analysis using the world's leading archival repositories of scholarly and primary source content

AntConc - a freeware, multi-platform, multi-purpose corpus analysis toolkit

Text Analysis Pedagogy Institute (TAPI) - an open educational institute for the benefit of teachers (and aspiring teachers) of text analysis in the digital humanities

Data Feminism by Catherine D'Ignazio and Lauren Klein (MIT Press) - a new way of thinking about data science and data ethics that is informed by the ideas of intersectional feminism

2021 Text Analysis Pedagogy Institute - open course materials for the 2021 Text Analysis Pedagogy Institute which concluded in July 2021

ITHAKA TDM Notebooks - example notebooks and tutorials from Constellate, the text analysis service from ITHAKA

3:30pm - 4:00pm Case Study - Hathi Trust


The mission of the HathiTrust Research Center (HTRC) is to provide tools, environments, and services for computational research on the content of the 17-million-volume HathiTrust Digital Library. In this talk, I will provide an overview of the Text Data Mining (TDM) activities and services provided by HTRC, with additional detail on two current initiatives, Scholar Curated Worksets for Analysis, Re-use, and Dissemination (SCWAReD), supported by the Andrew W. Mellon Foundation, and Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction (TORCHLITE), supported by the National Endowment for the Humanities.

Additional Information

NISO assumes organizations register as a group. The model assumes that an unlimited number of staff will be watching the live broadcast in a single location, but also includes access to an archived recording of the event for those who may have timing conflicts. 

NISO understands that, during the current pandemic, staff at a number of organizations may be practicing safe social distancing or working remotely. To accommodate those workers, we are allowing registrants to share the sign-on instructions with all colleagues so that they may join the broadcast directly. 

Registrants receive sign-on instructions via email on the Friday prior to the virtual event. If you have not received your instructions by the day before an event, please contact NISO headquarters for assistance via email (nisohq@niso.org). 

Registrants for an event may cancel participation and receive a refund (less $35.00) if the notice of cancellation is received at NISO HQ (nisohq@niso.org) one full week prior to the event date. If received less than 7 days before, no refund will be provided. 

Links to the archived recording of the broadcast are distributed to registrants 24-48 hours following the close of the live event. Access to that recording is intended for internal use of fellow staff at the registrant’s organization or institution. Speaker presentations are posted to the NISO event page.

Broadcast Platform

NISO uses the Zoom platform for purposes of broadcasting our live events. Zoom provides apps for a variety of computing devices (tablets, laptops, etc.) To view the broadcast, you will need a device that supports the Zoom app. Attendees may also choose to listen just to audio on their phones. Sign-on credentials include the necessary dial-in numbers, if that is your preference. Once notified of their availability, recordings may be downloaded from the Zoom platform to your machine for local viewing.