sparse-autoencoder-training

Solid

Provides guidance for training and analyzing Sparse Autoencoders (SAEs) using SAELens to decompose neural network activations into interpretable features. Use when discovering interpretable features, analyzing superposition, or studying monosemantic representations in language models.

AI & Automation 191,515 stars 33299 forks Updated today MIT

Install

View on GitHub

Quality Score: 93/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# SAELens: Sparse Autoencoders for Mechanistic Interpretability SAELens is the primary library for training and analyzing Sparse Autoencoders (SAEs) - a technique for decomposing polysemantic neural network activations into sparse, interpretable features. Based on Anthropic's groundbreaking research on monosemanticity. **GitHub**: [jbloomAus/SAELens](https://github.com/jbloomAus/SAELens) (1,100+ stars) ## The Problem: Polysemanticity & Superposition Individual neurons in neural networks are **polysemantic** - they activate in multiple, semantically distinct contexts. This happens because models use **superposition** to represent more features than they have neurons, making interpretability difficult. **SAEs solve this** by decomposing dense activations into sparse, monosemantic features - typically only a small number of features activate for any given input, and each feature corresponds to an interpretable concept. ## When to Use SAELens **Use SAELens when you need to:** - Discover interpretable features in model activations - Understand what concepts a model has learned - Study superposition and feature geometry - Perform feature-based steering or ablation - Analyze safety-relevant features (deception, bias, harmful content) **Consider alternatives when:** - You need basic activation analysis → Use **TransformerLens** directly - You want causal intervention experiments → Use **pyvene** or **TransformerLens** - You need production steering → Consider direct activatio...

Details

Author
NousResearch
Repository
NousResearch/hermes-agent
Created
10 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category