Conceptlow

What is a token

A token is the text unit a model processes. It shapes input splitting, context limits, latency, and usually API cost.

What is it

A token is the unit of text a language model processes. It is not necessarily a full word: it can be a whole word, part of a word, punctuation, or even whitespace, depending on the tokenizer the model uses.

This matters because models do not read text "as written". Before inference or generation, they convert text into tokens. Everything you do with an LLM goes through that layer: the input prompt, the output, the context limit, and usually the API bill.

If you understand tokens, you understand why a short-looking text can cost more than expected, why two models can count the same input differently, and why prompt optimization is not just about wording.

Mental model

Think of a token as an intermediate piece between human text and the numbers the model actually processes. You write a sentence. The tokenizer breaks it into pieces. The model then operates on those pieces, not on natural-language words.

For example, the phrase tokenization in English might be split into fragments such as token, ization, in, English. It will not always be exactly that, because different model families use different vocabularies and rules, but the core idea stays the same: the model works on frequent text fragments rather than grammar-aware word units.

flowchart LR A[Original text] --> B[Tokenizer] B --> C[Token sequence] C --> D[Numeric IDs] D --> E[Model] E --> F[Next token prediction]

Two practical consequences follow from this:

  • More tokens usually mean more cost and more latency.
  • The context window is measured in tokens, not words or characters.

How it's used

In practice, you deal with tokens whenever you prepare prompts, estimate costs, or check whether a conversation fits inside a model's context window.

One common way to inspect them is with a tokenization library:

import tiktoken
 
encoding = tiktoken.get_encoding("cl100k_base")
text = "Tokenization shapes cost, context, and latency."
 
tokens = encoding.encode(text)
 
print(f"Total tokens: {len(tokens)}")
print(tokens)

That token count helps answer concrete questions:

  • Will my prompt fit together with the expected response?
  • How much will it cost to process this document?
  • Should I summarize, trim, or chunk the input?

One important caveat: the same text can produce different counts across models. Tokenization is not universal. If you are measuring real limits or cost, use the tokenizer that matches the model you will actually call.

When to use it / when not to

Use it as a design concept when:

  • You need to estimate model usage costs.
  • You are debugging prompts that exceed the context window.
  • You want to split documents into RAG chunks without losing too much useful content.
  • You are comparing models and need to understand differences in latency or effective capacity.

Do not over-focus on it when:

  • You are running a quick test and the text volume is small.
  • Your application already manages limits and costs well at a higher layer.
  • You need semantic clarity more than micro-optimizing text fragments.
  • The real issue is prompt quality, not saving 5 or 10 tokens.

History and evolution

Show history

The idea of splitting text into processable units predates modern LLMs by decades. Classical NLP systems worked with full words, characters, or subwords depending on the task.

The major shift came with subword tokenization methods such as Byte Pair Encoding (BPE), WordPiece, and SentencePiece. These approaches made it possible to handle large vocabularies without relying on a closed list of full words, which reduced unknown-word problems and improved multilingual coverage.

Modern LLMs inherit that logic: they convert text into reusable pieces that balance efficiency, vocabulary coverage, and computational cost.