Copied


NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining

Iris Coleman   Jan 11, 2025 02:13 2 Min Read


NVIDIA has announced the release of Nemotron-CC, a groundbreaking 6.3-trillion-token English language dataset designed to advance the pretraining of large language models (LLMs). This dataset, derived from Common Crawl, aims to elevate the accuracy and efficiency of LLMs through innovative data curation techniques, including the use of 1.9 trillion tokens of synthetically generated data, according to NVIDIA.

Enhancing LLM Pretraining

NVIDIA's initiative addresses a critical need in LLM training, where the quality of pretraining datasets plays a pivotal role. While recent models like Meta's Llama series have been based on datasets comprising up to 15 trillion tokens, the exact composition of these datasets remains largely undisclosed. Nemotron-CC seeks to fill this gap by providing the wider community with a high-quality dataset capable of supporting both short and long token horizon training.

Traditional datasets often sacrifice up to 90% of data to improve benchmark accuracies, limiting their utility for extensive training. Nemotron-CC, however, demonstrates how to transform Common Crawl data into a superior dataset, surpassing even the Llama 3.1 8B model through advanced methods such as classifier ensembling and synthetic data rephrasing.

Significant Results

Nemotron-CC's efficacy is evidenced by its performance in various benchmarks. When training 8B parameter models for one trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms leading datasets like DCLM, increasing MMLU scores by 5.6 points. Furthermore, the complete 6.3-trillion-token dataset matches DCLM on MMLU while offering four times more unique real tokens. This enables effective training over long token horizons, with Nemotron-CC-trained models surpassing Llama 3.1 8B in multiple metrics, including a 5-point increase in MMLU and a 3.1-point rise in ARC-Challenge scores.

Innovative Data Curation Techniques

The development of Nemotron-CC involved several key insights. By ensembling different model-based classifiers, NVIDIA was able to select a broader array of high-quality tokens. Additionally, rephrasing techniques reduced noise and errors, yielding diverse and valuable data variants. The decision to disable traditional heuristic filters further boosted the dataset's quality without compromising accuracy.

NVIDIA utilized its NeMo Curator tool to extract and refine data from Common Crawl, applying filters for language, deduplication, and quality classification. This process was complemented by synthetic data generation, contributing approximately two trillion tokens to the dataset.

Future Prospects

Nemotron-CC is positioned as a vital resource for pretraining state-of-the-art LLMs over varying token horizons. NVIDIA plans to expand its offerings by releasing more specialized datasets, including those focused on specific domains like mathematics, to further enhance LLM capabilities.


Read More
x.ai introduces Grok 3 Beta, a sophisticated reasoning agent model, showcasing advanced capabilities through extensive pretraining. Discover how this innovative AI technology is transforming the landscape.
Bitcoin (BTC) has held the top spot in the cryptocurrency world since its creation in 2009. It remains the largest and most recognized digital asset by market capitalization.
Institutional interest in crypto surges; regulatory clarity and tokenization reshape the landscape.
AI and blockchain converge, enabling decentralized data ownership and real-time integration for better predictions.
Crypto for Everyone: Crypto must focus on real-world utility and user experience to gain mainstream acceptance and rebuild trust.
Online casinos have experienced rapid growth during the last decade as they have had to overcome security issues all while working to establish transparency.
Blockchain technology transformed digital transactions, with crypto apps playing a crucial role in this transformation.
Grayscale is expanding its ETF lineup with two new Bitcoin income funds designed to generate monthly payouts.