NVIDIA NeMo Curator Enhances Non-English Dataset Preparation for LLM Training

Timothy Morano Jul 12, 2024 12:53 3 Min Read

Data curation is critical for developing effective and fair large language models (LLMs). High-quality, diverse training data directly impacts LLM performance by addressing issues like bias, inconsistencies, and redundancy. NVIDIA has recently announced the open-source release of the NVIDIA NeMo Curator, a data curation library designed to enhance LLM training accuracy through scalable and efficient dataset preparation.

Importance of Data Curation

When training localized multilingual LLMs, particularly for low-resourced languages, web-crawled data such as OSCAR is vital. However, this data often contains noise, irrelevant content, duplicates, and formatting issues. Effective data curation is essential to mitigate these problems and ensure high-quality LLM performance. The NeMo Curator offers a customizable and modular interface that simplifies pipeline expansion and accelerates model convergence by preparing high-quality tokens.

NeMo Curator Overview

The NeMo Curator leverages GPU-accelerated data curation using Dask and RAPIDS, enabling users to mine high-quality text at scale from massive uncurated web corpora as well as custom datasets. For instance, a data curation pipeline can be constructed using the Thai Wikipedia dataset, a smaller subset of the Wikipedia dataset, which can be processed on a single GPU. Wikipedia is considered high-quality for LLM pretraining due to its accurate, well-structured content. NeMo Curator enhances this by detecting and filtering low-quality documents, ensuring only the best data is used for training.

Data Curation Pipeline Example

Using the Thai Wikipedia as an example, the data curation pipeline involves several steps:

Download and extract the dataset to a JSONL file.
Perform preliminary data cleaning, including language separation and Unicode text fixes.
Advanced cleaning, such as GPU-accelerated exact and fuzzy deduplication, and heuristic filtering.

For the complete code sample for this tutorial, see the NVIDIA NeMo Curator GitHub repo.

Prerequisites and Setup

To use GPU-accelerated deduplication, the following hardware setup is recommended:

NVIDIA GPU: This tutorial uses the NVIDIA A10 24GB GPU.
CUDA and NVIDIA Drivers: CUDA 12.2 with Driver 535.154.05.
Ubuntu 22.04.
NVIDIA-container-toolkit version 1.14.6.

To install the NeMo Curator library, run the following commands:

git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com "[cuda12x]"

Advanced Data Cleaning

Advanced data curation techniques such as deduplication and heuristic filtering are applied to yield better data quality. For example, the ExactDuplicates class removes identical documents using GPU-accelerated implementations from the RAPIDS cuDF library. Similarly, the FuzzyDuplicates class removes near-identical documents using the MinhashLSH algorithm, which is computationally efficient.

Heuristic Filtering

Heuristic filtering helps remove low-quality content from the dataset using simple, efficient-to-compute rules. At the time of publication, NeMo Curator provides 24 heuristics for natural languages and eight for coding languages. These filters can be applied using a YAML config file to define the filters for heuristic filtering.

Next Steps

The tutorial demonstrated how to construct a sample data curation pipeline for Thai Wikipedia data. For more information and examples, see the collection of data curation examples on GitHub. Enterprises can also request access to the NVIDIA NeMo Curator microservice, which provides streamlined performance and scalability.

News

NVIDIA NeMo Curator Enhances Video Processing on DGX Cloud

NVIDIA introduces the NeMo Curator, a GPU-accelerated streaming pipeline for efficient video processing on DGX Cloud, optimizing AI model development and reducing costs.

Alvin Lang

Mar 19, 2025 | 0 Min Read

News

NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining

NVIDIA debuts Nemotron-CC, a 6.3-trillion-token English dataset, enhancing pretraining for large language models with innovative data curation methods.

Iris Coleman

Jan 11, 2025 | 2 Min Read

News

Can New Cryptos Outpace Bitcoin? Exploring the Battle for Market Dominance

Bitcoin (BTC) has held the top spot in the cryptocurrency world since its creation in 2009. It remains the largest and most recognized digital asset by market capitalization.

News Publisher

Apr 01, 2025 | 3 Min Read

News

Coindesk CONSENSUS 2025 (Part 1) - Crypto's Next Phase

Institutional interest in crypto surges; regulatory clarity and tokenization reshape the landscape.

by Khushi. V. Rangdhol

Apr 03, 2025 | 3 Min Read

News

Coindesk CONSENSUS 2025 (Part 2) - AI and Blockchain

AI and blockchain converge, enabling decentralized data ownership and real-time integration for better predictions.

by Khushi. V. Rangdhol

Apr 03, 2025 | 3 Min Read

News

Coindesk CONSENSUS 2025 (Part 3) - Crypto for Everyone

Crypto for Everyone: Crypto must focus on real-world utility and user experience to gain mainstream acceptance and rebuild trust.

by Khushi. V. Rangdhol

Apr 02, 2025 | 0 Min Read

Press Release

How Blockchain Technology Is Revolutionizing Online Casinos

Online casinos have experienced rapid growth during the last decade as they have had to overcome security issues all while working to establish transparency.

News Publisher

Apr 02, 2025 | 3 Min Read

Press Release

The Evolution of Crypto Apps and Their Role in Betting

Blockchain technology transformed digital transactions, with crypto apps playing a crucial role in this transformation.

News Publisher

Apr 02, 2025 | 3 Min Read

NVIDIA NeMo Curator Enhances Non-English Dataset Preparation for LLM Training

Importance of Data Curation

NeMo Curator Overview

Data Curation Pipeline Example

Prerequisites and Setup

Advanced Data Cleaning

Heuristic Filtering

Next Steps

Read More

Newsletter