sentencepiece

Solid

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

AI & Automation 9,609 stars 724 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# SentencePiece - Language-Independent Tokenization Unsupervised tokenizer that works on raw text without language-specific preprocessing. ## When to use SentencePiece **Use SentencePiece when:** - Building multilingual models (no language-specific rules) - Working with CJK languages (Chinese, Japanese, Korean) - Need reproducible tokenization (deterministic vocabulary) - Want to train on raw text (no pre-tokenization needed) - Require lightweight deployment (6MB memory, 50k sentences/sec) **Performance**: - **Speed**: 50,000 sentences/sec - **Memory**: ~6MB for loaded model - **Languages**: All (language-independent) **Use alternatives instead**: - **HuggingFace Tokenizers**: Faster training, more flexibility - **tiktoken**: OpenAI models (GPT-3.5/4) - **BERT WordPiece**: English-centric tasks ## Quick start ### Installation ```bash # Python pip install sentencepiece # C++ (requires CMake) git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build && cd build cmake .. && make -j $(nproc) sudo make install ``` ### Train model ```bash # Command-line (BPE with 8000 vocab) spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe # Python API import sentencepiece as spm spm.SentencePieceTrainer.train( input='data.txt', model_prefix='m', vocab_size=8000, model_type='bpe' ) ``` **Training time**: ~1-2 minutes for 100MB corpus ### Encode and decode ```python import sentencepiece as spm # Load model sp = s...

Details

Author
Orchestra-Research
Repository
Orchestra-Research/AI-Research-SKILLs
Created
7 months ago
Last Updated
1 months ago
Language
TeX
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category