Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here

RedPajama has unveiled the latest version of its dataset, RedPajama-Data-v2, which is a colossal repository of web data aimed at advancing language model training. This dataset encompasses a staggering 30 trillion tokens, meticulously filtered and deduplicated from a raw pool of over 100 trillion tokens, sourced from 84 CommonCrawl data dumps in five languages, including English, French, Spanish, German, and Italian.

Click here to check out the GitHub repository.

RedPajama-Data-v2 comes with a remarkable addition of 40+ pre-computed data quality annotations that offer invaluable tools for further data filtering and weighting.

The dataset covers 5 languages, with 40+ pre-computed data quality annotations that can be used for further filtering and weighting. Here is one example of how to filter RedPajama-Data-v2 in a similar way as Gopher: pic.twitter.com/VqKObX9Iqr

— Together AI (@togethercompute) October 30, 2023

Over the past six months, the impact of RedPajama’s previous release, RedPajama-1T, has been profound in the language model community. This 5TB dataset of high-quality English tokens has been downloaded by more than 190,000 individuals, who have harnessed its potential in creative ways.

RedPajama-1T served as a stepping stone towards the goal of creating open datasets for language model training, but RedPajama-Data-v2 takes this ambition to new heights with its mammoth 30 trillion token web dataset.

RedPajama-Data-v2 stands out as the largest public dataset specifically crafted for LLM training, significantly contributing to the field. Most notably, it introduces 40+ pre-computed quality annotations, empowering the community to enhance the dataset’s utility. This release encompasses over 100 billion text documents derived from 84 CommonCrawl data dumps, constituting a total of 100+ trillion raw tokens.

Together.AI says that the dataset offers a solid foundation for advancing state-of-the-art open LLMs such as Llama, Mistral, Falcon, MPT, and the RedPajama models.

RedPajama-Data-v2 primarily focuses on CommonCrawl data, while data sources such as Wikipedia are available in RedPajama-Data-v1. To further enrich the dataset, users are encouraged to integrate Stack (by BigScience) for code-related content and s2orc (by AI2) for scientific articles. RedPajama-Data-v2 is meticulously crafted from publicly available web data, comprising the core elements of plain text source data, 40+ quality annotations, and deduplication clusters.

The process of creating the source data begins with each CommonCrawl snapshot passing through the CCNet pipeline, chosen for its light processing approach, preserving raw data integrity. This results in the generation of 100 billion individual text documents, maintaining alignment with the overarching principle of data preservation.

The post Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here appeared first on Analytics India Magazine.

Previous News

Here’s how the new entry-level M3 MacBook Pro compares to more expensive models

Next News

Elevation Capital Leads INR 150 Cr Funding In Vridhi Home Finance

Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here

Disclaimer

Popular

Microsoft Copilot Health: Privacy, Security, and Data Unification

Tamil Nadu govt invests Rs 25 crore in space startup Agnikul Cosmos

A Hacker Accidentally Broke Into the FBI’s Epstein Files

MacBook Neo, MacBook Air, Studio Display XDR, more 9to5Mac

Norda 001A G+ Trail Running Shoe Review: As Expensive as an Apple Watch

More Like this

The Best Outdoor Deals From the REI Member Days Sale (2026)

Today’s NYT Connections: Sports Edition Hints, Answers for March 15 #538

Meta to Axe End-to-End Encryption for Instagram Messaging

CPU fraud scandal erupts as another Chinese laptop busted with a fake chip — second device sporting a disguised Ryzen 5500U uncovered after vendor...

How a Raspberry Pi Microcontroller Saved the Super Nintendo’s Infamously Inferior Version Of ‘Doom’

Nothing Phone 4a now has this exclusive Phone 3a CE feature

Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here

Disclaimer

More like this

The Best Outdoor Deals From the REI Member Days...

Today’s NYT Connections: Sports Edition Hints, Answers for March...

Meta to Axe End-to-End Encryption for Instagram Messaging

Popular

Block title

Accenture hiring more entry-level jobs globally: CEO Julie Sweet

The technical leap where most brilliant AI initiatives spectacularly fail

Wipro Enterprises eyes entry into semiconductor sector

Oracle and OpenAI’s Abilene expansion saga detailed: 600MW expansion gets scrapped, as larger 4.5GW...

MacBook Neo Will Have Day One Software Update

Netflix launches Eyeline Studios in Hyderabad, expands VFX and production footprint in India

Mario Day 2026: Mar10 Day News and Live Updates from Nintendo

Startup Events

Trending News

The Best Outdoor Deals From the REI Member Days Sale (2026)

Today’s NYT Connections: Sports Edition Hints, Answers for March 15 #538

Meta to Axe End-to-End Encryption for Instagram Messaging

CPU fraud scandal erupts as another Chinese laptop busted with a fake chip — second device sporting a disguised Ryzen 5500U uncovered after vendor...

How a Raspberry Pi Microcontroller Saved the Super Nintendo’s Infamously Inferior Version Of ‘Doom’

About

Partnership

Contact us