tiktoken
tiktoken is a fast
BPE tokenizer created by OpenAI.tiktoken to estimate tokens used. It will probably be more accurate for the OpenAI models.
- How the text is split: by character passed in.
- How the chunk size is measured: by
tiktokentokenizer.
tiktoken directly.
tiktoken, use its .from_tiktoken_encoder() method. Note that splits from this method can be larger than the chunk size measured by the tiktoken tokenizer.
The .from_tiktoken_encoder() method takes either encoding_name as an argument (e.g. cl100k_base), or the model_name (e.g. gpt-4). All additional arguments like chunk_size, chunk_overlap, and separators are used to instantiate CharacterTextSplitter:
RecursiveCharacterTextSplitter.from_tiktoken_encoder, where each split will be recursively split if it has a larger size:
TokenTextSplitter splitter, which works with tiktoken directly and will ensure each split is smaller than chunk size.
TokenTextSplitter directly can split the tokens for a character between two chunks causing malformed Unicode characters. Use RecursiveCharacterTextSplitter.from_tiktoken_encoder or CharacterTextSplitter.from_tiktoken_encoder to ensure chunks contain valid Unicode strings.
spaCy
spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.
- How the text is split: by
spaCytokenizer. - How the chunk size is measured: by number of characters.
SentenceTransformers
The SentenceTransformersTokenTextSplitter is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use. To split text and constrain token counts according to the sentence-transformers tokenizer, instantiate aSentenceTransformersTokenTextSplitter. You can optionally specify:
chunk_overlap: integer count of token overlap;model_name: sentence-transformer model name, defaulting to"sentence-transformers/all-mpnet-base-v2";tokens_per_chunk: desired token count per chunk.
NLTK
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.
NLTK to split based on NLTK tokenizers.
- How the text is split: by
NLTKtokenizer. - How the chunk size is measured: by number of characters.
KoNLPY
KoNLPy: Korean NLP in Python is is a Python package for natural language processing (NLP) of the Korean language.
Token splitting for Korean with KoNLPy’s Kkma Analyzer
In case of Korean text, KoNLPY includes at morphological analyzer calledKkma (Korean Knowledge Morpheme Analyzer). Kkma provides detailed morphological analysis of Korean text. It breaks down sentences into words and words into their respective morphemes, identifying parts of speech for each token. It can segment a block of text into individual sentences, which is particularly useful for processing long texts.
Usage Considerations
WhileKkma is renowned for its detailed analysis, it is important to note that this precision may impact processing speed. Thus, Kkma is best suited for applications where analytical depth is prioritized over rapid text processing.
Hugging Face tokenizer
Hugging Face has many tokenizers. We use Hugging Face tokenizer, the GPT2TokenizerFast to count the text length in tokens.- How the text is split: by character passed in.
- How the chunk size is measured: by number of tokens calculated by the
Hugging Facetokenizer.