simpo-training

Solid

Simple Preference Optimization for LLM alignment. Reference-free alternative to DPO with better performance (+6.4 points on AlpacaEval 2.0). No reference model needed, more efficient than DPO. Use for preference alignment when want simpler, faster training than DPO/PPO.

AI & Automation 9,609 stars 724 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# SimPO - Simple Preference Optimization ## Quick start SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model. **Installation**: ```bash # Create environment conda create -n simpo python=3.10 && conda activate simpo # Install PyTorch 2.2.2 # Visit: https://pytorch.org/get-started/locally/ # Install alignment-handbook git clone https://github.com/huggingface/alignment-handbook.git cd alignment-handbook python -m pip install . # Install Flash Attention 2 python -m pip install flash-attn --no-build-isolation ``` **Training** (Mistral 7B): ```bash ACCELERATE_LOG_LEVEL=info accelerate launch \ --config_file accelerate_configs/deepspeed_zero3.yaml \ scripts/run_simpo.py \ training_configs/mistral-7b-base-simpo.yaml ``` ## Common workflows ### Workflow 1: Train from base model (Mistral 7B) **Config** (`mistral-7b-base-simpo.yaml`): ```yaml # Model model_name_or_path: mistralai/Mistral-7B-v0.1 torch_dtype: bfloat16 # Dataset dataset_mixer: HuggingFaceH4/ultrafeedback_binarized: 1.0 dataset_splits: - train_prefs - test_prefs # SimPO hyperparameters beta: 2.0 # Reward scaling (2.0-10.0) gamma_beta_ratio: 0.5 # Target margin (0-1) loss_type: sigmoid # sigmoid or hinge sft_weight: 0.0 # Optional SFT regularization # Training learning_rate: 5e-7 # Critical: 3e-7 to 1e-6 num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 8 # Ou...