AI & Prompt Design: A 2024 NISO Training Series

Training Series

Scope

This course will introduce students to prompt engineering with large language models, or LLMs. It is designed for students with no coding knowledge. It presumes no knowledge about machine learning or large language models (LLMs). Because the course focuses on prompt engineering, or the way in which you design and tailor a message to an LLM to perform a specific task, a basic knowledge of machine learning will be helpful. This course will, therefore, bring students up to speed with all the necessary terminology and concepts.

This course will be hands-on, meaning it will be divided between instruction and hands-on lessons where students will apply the material as they learn it. LLMs are stochastic, meaning they can behave unpredictably and inconsistently due to the randomness inherent in them. This can make it challenging to reproduce results. While this may introduce some inconsistencies for students during the course, it is an opportunity to learn. We will discuss the issues and challenges that surface to understand better how and why these issues occur and how to address them.

When working with students across different operating systems, hardware, and experience, it is challenging to work with open-source software and machine learning models. For this reason, the course will require students to have a subscription to ChatGPT for the duration of the course (2 months). This will ensure that all students are working with the same resources and eliminate potential challenges. That said, this course will devote a week to introducing students to the world of open-source machine learning so that they may explore it independently after the course concludes.

Because the field of machine learning is advancing rapidly, this course outline may change between its construction (January 2024) and its use (April–May 2024). Nevertheless, the core structure is expected to remain the same.

REMINDER: A paid version of ChatGPT is necessary for students to have consistent experiences. You may use the free version for this course, however results will vary.

Training Facilitator

William Mattingly is a Postdoctoral Fellow at the Smithsonian Institution Data Science Lab in collaboration with the United States Holocaust Memorial Museum (USHMM). He has a B.A. and M.A. in History from Florida Gulf Coast University and a Ph.D. in History from the University of Kentucky. His dissertation research explored using historical social network analysis, cluster analysis, and computational methods for identifying ninth-century intellectual and pedagogical networks. Most recently, his research has focused on developing text classification neural network models to identify sources in medieval texts and developing natural language processing (NLP) methods for medieval Latin. At the Smithsonian and USHMM, he is developing machine learning methods to aid, in among other things, the cataloging of Holocaust documents. He is co-investigator and developer for the Structured Data Extraction and Enhancement in South Africa’s Truth and Reconciliation Archive project and lead investigator and developer for the Digital Alcuin Project.

Course Duration and Dates

The series consists of eight (8) weekly segments, each lasting 90 minutes. Specific dates are:

April 4, 11, 18, 25
May 2, 9, 16, 23

Each session will be recorded and links to that archived recording will be disseminated to course registrants within 2 business days of the close of the specific session. We strongly encourage attendees to download these files to ensure continued access.

Event Sessions

Session One: April 4 - Introduction and Machine Learning

In this week, we do a deep dive into machine learning, how it works, how its used, and its limitations. This week provides a cursory overview of the essential concepts and terminology that we will use throughout the next seven weeks.

Recommended Reading and Resources

Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence, by Kate Crawford

Code Dependent: Living in the Shadow of AI, by Madhumita Murgia

Data Feminism, by Catherine D'Ignazio and Lauren F. Klein

How AIs, like ChatGPT, Learn

Session Two: April 11 - Large Language Models and Key Concepts

With a foundational understanding of machine learning, students will move into Week 2. Here, we will learn about large language models, the technology behind popular tools, such as ChatGPT. You will learn the key terminology and concepts relevant to prompt engineering and working with LLMs.

Shared Resource

42: GPT’s answer to Life, the Universe, and Everything by Juan Ignacio Specht

Session Three: April 18 - Beginning Conversations

In Week 3, we will begin learning about prompt engineering. We will learn some best practices. Here, we will apply the concepts we have learned in the first two weeks by learning through in-class exercises. The goal is to learn about the strengths and weakness of certain prompt designs.

Shared Resource:

ChatGPT — Release Notes: A changelog of the latest updates for ChatGPT

Session Four: April 25 - Structured Data and Assistants

Over the last year, we have seen numerous advances in LLMs. One is the output of structured data and the second is what is known as assistants. In this week, we will learn about why structured data is important and, most importantly, how to engineer a prompt to produce consistent structured outputs from an LLM.

Assistants are LLMs that are designed to perform a specific task by receiving instructions and, in some cases, data before the user first engages with a model. We will learn how to design assistants and deploy them through platforms like ChatGPT.

Shared Resource:

ChatGPT Structured Data Assistant By William J Mattingly: Generates structured data output from text.

Session Five: May 2 - Named Entity Recognition with LLMs

In week 5, we shift course a little bit and begin looking at the real-world applications of LLMs and their limitations. We will focus on performing named entity recognition (NER) with these models. The goal of this week is to provide you with a broad understanding of NER, its importance, and how to generate an NER output with an LLM.

Shared Resources:

RegExr: RegExr is an online tool to learn, build, & test Regular Expressions (RegEx / RegExp).

Natural Language Toolkit (NLTK) modules: a suite of libraries and programs in Python for Natural Language Processing Tasks. It is one of the most widely used NLP Python libraries. It can perform various NLP tasks like tokenization, stemming, POS tagging, lemmatization and classification to name a few.

UMAP Projection of Proper Nouns in Text: Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. More on UMAP

Hugging Face Space, GLiNER: a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

Github Urchade GLiNER: Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 24

JSTOR Text Analyzer: Text Analyzer is a beta tool built by JSTOR Labs. With it, researchers can search for content on JSTOR just by uploading a document.

Session Six: May 9 - Text Classification with LLMs

Building off of NER with Week 5, we will switch to another real-world application of LLMs, namely text classification. Text classification allows us to classify a document or portions of a document. LLMs are important in this area of applied machine learning due, in part, to their large context windows, or the size of data that they can ingest. However, this is not without limitations. In this week we will learn how to do this and how to address some of the key challenges that surface.

Shared Resource:

Enhancing Retrieval-Augmented Generation: Tackling Polysemy, Homonyms and Entity Ambiguity with GLiNER for Improved Performance: The Large Language Models (LLMs) capture people's attention due to their ability to solve many general problems. Still, when it comes to emerging knowledge or domain-specific tasks, the LLMs tend to fail by either hallucinating or failing to give the correct answer. To solve the problem, the Retrieval Augmented Generation (RAG) is one of the many solutions that is preferred among the solutions.

Session Seven: May 16 - Open Source Language Models

Although this course is structured around the use of ChatGPT, it is vital that students learn about the world of open-source machine learning. This is a thriving community centered around HuggingFace, a machine learning platform similar to GitHub that hosts machine learning models and datasets, and makes them freely available to all.

For many, the future of machine learning and LLMs is open-source and for that reason, students should be aware of what is available from the open-source community and how to access it. In this week, we will address both of these things.

Shared Resource:

ANNIF: Tool for automated subject indexing and classification

Session Eight: May 23 - Limitations and Potential Solutions

A constant theme throughout the previous seven weeks will be limitations of LLMs. This cannot and should not be ignored. Putting an LLM into production without properly vetting the output has the potential to lead to catastrophic consequences, especially if the data is sensitive in nature, as is the case with many archives around the world. In this week, we will dive more deeply into these limitations and we will learn about some of the potential solutions to them.

Shared Resources:

Hugging Face Models

OpenAI Flagship Models: The OpenAI API is powered by a diverse set of models with different capabilities and price points. You can also make customizations to our models for your specific use case with fine-tuning.

OpenAI Apps

Chat GPT Libra Classifier, by William Mattingly: Libra Classifier is specifically engineered to process texts and output multiple Library of Congress subject headings in JSON format. For each text provided, it will analyze the content comprehensively and generate a JSON array, containing multiple dictionaries. Each dictionary will represent a unique subject heading, complete with a 'heading' field for the subject title, a 'justification' field explaining the rationale for the heading selection, and a 'class' field indicating the classification code. This structured approach ensures detailed and thorough classification, adhering to the Library of Congress standards. The Classifier will focus solely on generating these JSON outputs, refraining from engaging in discussions or offering explanations beyond the classification data. It will also address requests for clarification on ambiguous texts for more accurate classification.

LangChain: a framework for developing applications powered by large language models (LLMs).

Additional Information

Registrants receive sign-on instructions via email three business days prior to the scheduled session If you have not received your instructions by the day before an event, please contact NISO headquarters for assistance via email (nisohq@niso.org).

Registrants for an event may cancel participation and receive a refund (less $30.00) if the notice of cancellation is received at NISO HQ (nisohq@niso.org) one full week prior to the event date. If received less than 7 days before, no refund will be provided.

Broadcast Platform

NISO uses the Zoom platform for purposes of broadcasting our live events. Zoom provides apps for a variety of computing devices (tablets, laptops, etc.) To view the broadcast, you will need a device that supports the Zoom app. Attendees may also choose to listen just to audio on their phones. Sign-on credentials include the necessary dial-in numbers, if that is your preference. Once notified of their availability, recordings may be downloaded from the Zoom platform to your machine for local viewing.

Event Dates

April 04, 2024 11:00am – May 23, 2024 12:30pm

Fees

Members:

Early bird registration: Register by 11:59 pm EDT March 21 and pay a discounted rate of USD $750.00.
Register on or after March 22 and pay USD $850.00

Non-Members:

Early bird registration: Register by 11:59 pm EDT March 21 and pay a discounted rate of USD $825.00
Register on or after March 22 and pay USD $925.00

Group Rates:

Tier One
- 3-5 individuals - 17% discount
Tier Two
- 6-9 individuals - 25% discount
Tier Three
- 10+ individuals - 30% discount

Please note that it is not possible to register for individual program segments or lectures.

Additionally, please register using an institutional/work email.

Location

Educational events are online programs. NISO uses the Zoom platform for purposes of broadcasting our live events. Zoom provides apps for a variety of computing devices (tablets, laptops, etc.) To view the broadcast, you will need a device that supports the Zoom app. Attendees may also choose to listen just to audio on their phones. Sign-on credentials include the necessary dial-in numbers, if that is your preference. Once notified of their availability, recordings may be downloaded from the Zoom platform to your machine for local viewing.

Registrants receive sign-on instructions prior to the virtual event. If you have not received your instructions by the day before an event, please contact NISO headquarters for assistance via email (nisohq@niso.org).

This is an 8-week series, with each weekly segment having a duration of 90 minutes. It is a virtual event. NISO uses the Zoom platform to deliver our virtual events. Please check your system in advance to make sure it meets Zoom (US) requirements.

AI & Prompt Design: A 2024 NISO Training Series

Scope

Training Facilitator

Training Facilitator: William Mattingly, Postdoctoral Fellow, Smithsonian Institution's Data Science Lab

Course Duration and Dates

Event Sessions

Session One: April 4 - Introduction and Machine Learning

Session Two: April 11 - Large Language Models and Key Concepts

Session Three: April 18 - Beginning Conversations

Session Four: April 25 - Structured Data and Assistants

Session Five: May 2 - Named Entity Recognition with LLMs

Session Six: May 9 - Text Classification with LLMs

Session Seven: May 16 - Open Source Language Models

Session Eight: May 23 - Limitations and Potential Solutions

Additional Information

Broadcast Platform

Event Dates

Fees

Members:

Non-Members:

Group Rates:

Location

IMPORTANT: The time zone is in Eastern Time (US & Canada).