NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training


Joerg
Hiller


May
07,
2025
15:38

NVIDIA
introduces
Nemotron-CC,
a
trillion-token
dataset
for
large
language
models,
integrated
with
NeMo
Curator.
This
innovative
pipeline
optimizes
data
quality
and
quantity
for
superior
AI
model
training.

NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

NVIDIA
has
integrated
its
Nemotron-CC
pipeline
into
the
NeMo
Curator,
offering
a
groundbreaking
approach
to
curating
high-quality
datasets
for
large
language
models
(LLMs).
The
Nemotron-CC
dataset
leverages
a
6.3-trillion-token
English
language
collection
from
Common
Crawl,
aiming
to
enhance
the
accuracy
of
LLMs
significantly,
according
to
NVIDIA.

Advancements
in
Data
Curation

The
Nemotron-CC
pipeline
addresses
the
limitations
of
traditional
data
curation
methods,
which
often
discard
potentially
useful
data
due
to
heuristic
filtering.
By
employing
classifier
ensembling
and
synthetic
data
rephrasing,
the
pipeline
generates
2
trillion
tokens
of
high-quality
synthetic
data,
recovering
up
to
90%
of
content
lost
by
filtering.

Innovative
Pipeline
Features

The
pipeline’s
data
curation
process
begins
with
HTML-to-text
extraction
using
tools
like
jusText
and
FastText
for
language
identification.
It
then
applies
deduplication
to
remove
redundant
data,
utilizing
NVIDIA
RAPIDS
libraries
for
efficient
processing.
The
process
includes
28
heuristic
filters
to
ensure
data
quality
and
a
PerplexityFilter
module
for
further
refinement.

Quality
labeling
is
achieved
through
an
ensemble
of
classifiers
that
assess
and
categorize
documents
into
quality
levels,
facilitating
targeted
synthetic
data
generation.
This
approach
enables
the
creation
of
diverse
QA
pairs,
distilled
content,
and
organized
knowledge
lists
from
the
text.

Impact
on
LLM
Training

Training
LLMs
with
the
Nemotron-CC
dataset
yields
significant
improvements.
For
instance,
a
Llama
3.1
model
trained
on
a
1
trillion-token
subset
of
Nemotron-CC
achieved
a
5.6-point
increase
in
the
MMLU
score
compared
to
models
trained
on
traditional
datasets.
Furthermore,
models
trained
on
long
horizon
tokens,
including
Nemotron-CC,
saw
a
5-point
boost
in
benchmark
scores.

Getting
Started
with
Nemotron-CC

The
Nemotron-CC
pipeline
is
available
for
developers
aiming
to
pretrain
foundation
models
or
perform
domain-adaptive
pretraining
across
various
fields.
NVIDIA
provides
a
step-by-step
tutorial
and
APIs
for
customization,
enabling
users
to
optimize
the
pipeline
for
specific
needs.
The
integration
into
NeMo
Curator
allows
for
seamless
development
of
both
pretraining
and
fine-tuning
datasets.

For
more
information,
visit
the

NVIDIA
blog
.

Image
source:
Shutterstock

Comments are closed.