Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here

All News General

RedPajama has unveiled the latest version of its dataset, RedPajama-Data-v2, which is a colossal repository of web data aimed at advancing language model training. This dataset encompasses a staggering 30 trillion tokens, meticulously filtered and deduplicated from a raw pool of over 100 trillion tokens, sourced from 84 CommonCrawl data dumps in five languages, including English, French, Spanish, German, and Italian.

Click here to check out the GitHub repository.

RedPajama-Data-v2 comes with a remarkable addition of 40+ pre-computed data quality annotations that offer invaluable tools for further data filtering and weighting.

The dataset covers 5 languages, with 40+ pre-computed data quality annotations that can be used for further filtering and weighting. Here is one example of how to filter RedPajama-Data-v2 in a similar way as Gopher: pic.twitter.com/VqKObX9Iqr

— Together AI (@togethercompute) October 30, 2023

Over the past six months, the impact of RedPajama’s previous release, RedPajama-1T, has been profound in the language model community. This 5TB dataset of high-quality English tokens has been downloaded by more than 190,000 individuals, who have harnessed its potential in creative ways.

RedPajama-1T served as a stepping stone towards the goal of creating open datasets for language model training, but RedPajama-Data-v2 takes this ambition to new heights with its mammoth 30 trillion token web dataset.

RedPajama-Data-v2 stands out as the largest public dataset specifically crafted for LLM training, significantly contributing to the field. Most notably, it introduces 40+ pre-computed quality annotations, empowering the community to enhance the dataset’s utility. This release encompasses over 100 billion text documents derived from 84 CommonCrawl data dumps, constituting a total of 100+ trillion raw tokens.

Together.AI says that the dataset offers a solid foundation for advancing state-of-the-art open LLMs such as Llama, Mistral, Falcon, MPT, and the RedPajama models.

RedPajama-Data-v2 primarily focuses on CommonCrawl data, while data sources such as Wikipedia are available in RedPajama-Data-v1. To further enrich the dataset, users are encouraged to integrate Stack (by BigScience) for code-related content and s2orc (by AI2) for scientific articles. RedPajama-Data-v2 is meticulously crafted from publicly available web data, comprising the core elements of plain text source data, 40+ quality annotations, and deduplication clusters.

The process of creating the source data begins with each CommonCrawl snapshot passing through the CCNet pipeline, chosen for its light processing approach, preserving raw data integrity. This results in the generation of 100 billion individual text documents, maintaining alignment with the overarching principle of data preservation.

The post Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here appeared first on Analytics India Magazine.

Previous News

Here’s how the new entry-level M3 MacBook Pro compares to more expensive models

Next News

Elevation Capital Leads INR 150 Cr Funding In Vridhi Home Finance

Disclaimer

Popular

Microsoft to Introduce Voice Reporting Feature for Xbox

Adobe teams up with India’s Education Ministry for creative learning initiative

Meta May Allow Instagram and Facebook Users in Europe to Pay to Avoid Ads

Indian fintechs amplify payments soundbox pitches to woo merchants

Fintech Unicorn Pine Labs Launches Mini — A QR-First Device With Card Support

More Like this

DeepRoute raises $100M in push to beat Tesla’s FSD in China

Meta says it’s making its Llama models available for US national security applications

No, startups shouldn’t always take the highest valuation, seed VCs say

New Age Tech Stocks Tank In Line With Broader Market

Buy For The Long Term, Say Brokerages

Valorant is winning the war against PC gaming cheaters

Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here

Disclaimer

More like this

DeepRoute raises $100M in push to beat Tesla’s FSD...

Meta says it’s making its Llama models available for...

No, startups shouldn’t always take the highest valuation, seed...

Popular

Seizing A Trillion-Dollar Opportunity By 2030

Prediction markets are not being manipulated — Kalshi founder

8i Ventures Exits M2P Fintech With 12X Returns

US has 26M strong ‘crypto voting bloc’ ahead of elections — Survey

Elon Musk’s X is changing its privacy policy to allow third parties to train...

59 Cleantech Startups Working Towards Making India Greener

Trump’s crypto website crashed after its WLFI token went on sale

Upcoming Events

Singapore Fintech Festival | Singapore | November 6 - 8

Startup Networking | Chennai | November 9

Product Marketing Insights | Bengaluru | November 8

Startup Finance Fest - SFF 2024 | Bengaluru | November 8-9

Startup Meetup | Delhi | November 9

StartupNews.fyi

StartupNews.fyi

Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here

Disclaimer

Popular

More Like this

Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here

Disclaimer

More like this

Popular

Upcoming Events

Newsletter Signup Form!

Newsletter Signup Form!