Skip to main content

How to split text by tokens

Prerequisites

This guide assumes familiarity with the following concepts:

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

js-tiktoken​

note

js-tiktoken is a JavaScript version of the BPE tokenizer created by OpenAI.

We can use js-tiktoken to estimate tokens used. It is tuned to OpenAI models.

  1. How the text is split: by character passed in.
  2. How the chunk size is measured: by the js-tiktoken tokenizer.

You can use the TokenTextSplitter like this:

import { TokenTextSplitter } from "@langchain/textsplitters";
import * as fs from "node:fs";

// Load an example document
const rawData = await fs.readFileSync(
"../../../../examples/state_of_the_union.txt"
);
const stateOfTheUnion = rawData.toString();

const textSplitter = new TokenTextSplitter({
chunkSize: 10,
chunkOverlap: 0,
});

const texts = await textSplitter.splitText(stateOfTheUnion);

console.log(texts[0]);
Madam Speaker, Madam Vice President, our

Note: Some written languages (e.g.Β Chinese and Japanese) have characters which encode to 2 or more tokens. Using the TokenTextSplitter directly can split the tokens for a character between two chunks causing malformed Unicode characters.

Next steps​

You’ve now learned a method for splitting text based on token count.

Next, check out the full tutorial on retrieval-augmented generation.


Was this page helpful?


You can leave detailed feedback on GitHub.