Common Corpus: Multilingual Data Set for Training LLMs

March 2024

Information Industry News

Releasing Common Corpus https://t.co/QTPjfKaNJK
— Jill ONeill (@jillmwo) March 21, 2024

Additional Details

March 20, 2024

Common Corpus is the largest public domain dataset released for training LLMs.
Common Corpus includes 500 billion words from a wide diversity of cultural heritage initiatives.
Common Corpus is multilingual and the largest corpus to date in English, French, Dutch, Spanish, German and Italian.
Common Corpus shows it is possible to train fully open LLMs on sources without copyright concerns.

Common Corpus is an international initiative coordinated by Pleias, involving researchers in LLM pretraining, AI ethics and cultural heritage, in association with major organizations committed to an open science approach for AI (HuggingFace, Occiglot, Eleuther, Nomic AI). Common Corpus has received the support of Lang:IA, a state start-up supported by the French Ministry of Culture and the Direction du numérique (Agent Public. Pleias is a French start-up specialized in the training of Large Language Models for document processing on fully open and auditable corpus.

Full Text of the Announcement

Common Corpus: Multilingual Data Set for Training LLMs

Information Industry News

Additional Details

March 20, 2024

Related Information

Proposal for New Software Framework for LLMs

AI & Prompt Design: A 2024 NISO Training Series

2034 AI Futures: NISO Plus Closing Keynote Panel