huggingface-tokenizers

Solid

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

AI & Automation 9,636 stars 724 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# HuggingFace Tokenizers - Fast Tokenization for NLP Fast, production-ready tokenizers with Rust performance and Python ease-of-use. ## When to use HuggingFace Tokenizers **Use HuggingFace Tokenizers when:** - Need extremely fast tokenization (<20s per GB of text) - Training custom tokenizers from scratch - Want alignment tracking (token → original text position) - Building production NLP pipelines - Need to tokenize large corpora efficiently **Performance**: - **Speed**: <20 seconds to tokenize 1GB on CPU - **Implementation**: Rust core with Python/Node.js bindings - **Efficiency**: 10-100× faster than pure Python implementations **Use alternatives instead**: - **SentencePiece**: Language-independent, used by T5/ALBERT - **tiktoken**: OpenAI's BPE tokenizer for GPT models - **transformers AutoTokenizer**: Loading pretrained only (uses this library internally) ## Quick start ### Installation ```bash # Install tokenizers pip install tokenizers # With transformers integration pip install tokenizers transformers ``` ### Load pretrained tokenizer ```python from tokenizers import Tokenizer # Load from HuggingFace Hub tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Encode text output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029] # Decode back text = tokenizer.decode(output.ids) print(text) # "hello, how are you?" ``` ### Train custom...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

OpenAI · AI Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid