huggingface-tokenizers

Solid

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

AI & Automation 9,636 stars 724 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# HuggingFace Tokenizers - Fast Tokenization for NLP Fast, production-ready tokenizers with Rust performance and Python ease-of-use. ## When to use HuggingFace Tokenizers **Use HuggingFace Tokenizers when:** - Need extremely fast tokenization (<20s per GB of text) - Training custom tokenizers from scratch - Want alignment tracking (token → original text position) - Building production NLP pipelines - Need to tokenize large corpora efficiently **Performance**: - **Speed**: <20 seconds to tokenize 1GB on CPU - **Implementation**: Rust core with Python/Node.js bindings - **Efficiency**: 10-100× faster than pure Python implementations **Use alternatives instead**: - **SentencePiece**: Language-independent, used by T5/ALBERT - **tiktoken**: OpenAI's BPE tokenizer for GPT models - **transformers AutoTokenizer**: Loading pretrained only (uses this library internally) ## Quick start ### Installation ```bash # Install tokenizers pip install tokenizers # With transformers integration pip install tokenizers transformers ``` ### Load pretrained tokenizer ```python from tokenizers import Tokenizer # Load from HuggingFace Hub tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Encode text output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029] # Decode back text = tokenizer.decode(output.ids) print(text) # "hello, how are you?" ``` ### Train custom...

Details

Author
Orchestra-Research
Repository
Orchestra-Research/AI-Research-SKILLs
Created
7 months ago
Last Updated
1 months ago
Language
TeX
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category