Natural Language Processing
Tokenization
Natural Language Processing· Intermediate
Definition
The process of breaking text into smaller units called tokens — words, subwords, or characters — that a model processes numerically. Modern LLMs use subword tokenization (BPE, SentencePiece). A token is roughly 0.75 words on average.
Maxx Stacks Context
Example: 'enterprise AI' → ['enterprise', 'AI'] (word) or ['enter', '##prise', 'AI'] (subword).
Tags
#preprocessing#tokens#BPE#vocabulary
MS
Maxx Stacks Editorial
Reviewed by enterprise AI practitioners
Maxx University
Keep learning. Keep building.
250+ terms. 5 learning paths. AI maturity assessment. Jargon translator. All free, always.