sparse-autoencoder-training

Solid

Provides guidance for training and analyzing Sparse Autoencoders (SAEs) using SAELens to decompose neural network activations into interpretable features. Use when discovering interpretable features, analyzing superposition, or studying monosemantic representations in language models.

AI & Automation 2,279 stars 168 forks Updated 3 weeks ago Apache-2.0

Install

View on GitHub

Quality Score: 97/100

Stars 20%
100
Recency 20%
90
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# SAELens: Sparse Autoencoders for Mechanistic Interpretability SAELens is the primary library for training and analyzing Sparse Autoencoders (SAEs) - a technique for decomposing polysemantic neural network activations into sparse, interpretable features. Based on Anthropic's groundbreaking research on monosemanticity. **GitHub**: [jbloomAus/SAELens](https://github.com/jbloomAus/SAELens) (1,100+ stars) ## The Problem: Polysemanticity & Superposition Individual neurons in neural networks are **polysemantic** - they activate in multiple, semantically distinct contexts. This happens because models use **superposition** to represent more features than they have neurons, making interpretability difficult. **SAEs solve this** by decomposing dense activations into sparse, monosemantic features - typically only a small number of features activate for any given input, and each feature corresponds to an interpretable concept. ## When to Use SAELens **Use SAELens when you need to:** - Discover interpretable features in model activations - Understand what concepts a model has learned - Study superposition and feature geometry - Perform feature-based steering or ablation - Analyze safety-relevant features (deception, bias, harmful content) **Consider alternatives when:** - You need basic activation analysis → Use **TransformerLens** directly - You want causal intervention experiments → Use **pyvene** or **TransformerLens** - You need production steering → Consider direct activatio...

Details

Author
foryourhealth111-pixel
Repository
foryourhealth111-pixel/Vibe-Skills
Created
3 months ago
Last Updated
3 weeks ago
Language
Python
License
Apache-2.0

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category