← ClaudeAtlas

nlp-alignmentlisted

Best practices for LLM alignment techniques including RLHF, DPO, and instruction tuning. Use when working on alignment or safety.
thada2402/AutoResearchClaw · ★ 1 · AI & Automation · score 75
Install: claude install-skill thada2402/AutoResearchClaw
## LLM Alignment Best Practice Methods: - RLHF: Train reward model → PPO fine-tuning (complex but powerful) - DPO: Direct preference optimization (simpler, no reward model needed) - GRPO: Group relative policy optimization - SFT: Supervised fine-tuning as alignment baseline Training recipe: - Start with SFT on high-quality instruction data - DPO: lr=5e-7, beta=0.1, batch_size=64 - PPO: lr=1e-6, clip=0.2, KL coeff=0.02 - Use reference model for KL penalty - Evaluate on safety benchmarks (TruthfulQA, BBQ, etc.) Common pitfalls: - Reward hacking: model finds shortcuts to high reward - Mode collapse: model generates repetitive outputs - Catastrophic forgetting: loses general capabilities