sentencepiece

Solid

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

AI & Automation 9,609 stars 724 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# SentencePiece - Language-Independent Tokenization Unsupervised tokenizer that works on raw text without language-specific preprocessing. ## When to use SentencePiece **Use SentencePiece when:** - Building multilingual models (no language-specific rules) - Working with CJK languages (Chinese, Japanese, Korean) - Need reproducible tokenization (deterministic vocabulary) - Want to train on raw text (no pre-tokenization needed) - Require lightweight deployment (6MB memory, 50k sentences/sec) **Performance**: - **Speed**: 50,000 sentences/sec - **Memory**: ~6MB for loaded model - **Languages**: All (language-independent) **Use alternatives instead**: - **HuggingFace Tokenizers**: Faster training, more flexibility - **tiktoken**: OpenAI models (GPT-3.5/4) - **BERT WordPiece**: English-centric tasks ## Quick start ### Installation ```bash # Python pip install sentencepiece # C++ (requires CMake) git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build && cd build cmake .. && make -j $(nproc) sudo make install ``` ### Train model ```bash # Command-line (BPE with 8000 vocab) spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe # Python API import sentencepiece as spm spm.SentencePieceTrainer.train( input='data.txt', model_prefix='m', vocab_size=8000, model_type='bpe' ) ``` **Training time**: ~1-2 minutes for 100MB corpus ### Encode and decode ```python import sentencepiece as spm # Load model sp = s...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

OpenAI · AI Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

sentencepiece

27,984 Updated today

davila7

AI & Automation Solid

huggingface-tokenizers

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

9,609 Updated 1 months ago

Orchestra-Research

AI & Automation Featured

huggingface-tokenizers

27,984 Updated today

davila7