Mechanistic Interpretability of LLMs

LLM Interpretability with Sparse Autoencoders explores how to reverse-engineer large language models (LLMs) to better understand their internal representations. Traditional neuron-based analysis falls short due to polysemanticity — where single neurons encode multiple unrelated concepts. To address this, we adopt Sparse Autoencoders (SAEs) to extract monosemantic features that more cleanly correspond to human-understandable concepts.

Inspired by recent work from Anthropic, we train Top-K autoencoders across multiple LLMs and evaluate the quality of learned features using visualizations, feature steering, and interpretability metrics. We also propose techniques like Ghost Grads and a novel kScheduler to address dead latent issues in sparse training. Applications include insight into safety-related behaviors (e.g., bias, deception) and the ability to steer generation with targeted features instead of model finetuning.

Sample Results

Example: Steering Vector Influence on Generation

Steering Vector	Generated Poem
Unsteered	The sun, oh glorious sun, Bringing warmth to everyone, Never to fade away. Its rays, so bright and bold, Painting the sky with gold, Waking up the earth, Making all things unfold.
Donald Trump	NYC is the city that never sleeps, Where dreams come true and secrets keep. A place where people from all walks of life, Gather to make their own unique strife. From Wall Street to Times
Taj Mahal	India, you beautiful country. A land of history and culture so rich. Your cities are magnificent, your people so kind. I'm in love with you, oh India! From the Himalayas to the Bay of Bengal,
Adolf Hitler	I am a cog in the great machine, Wound up tight in the gears of industry. My hands move with precision and speed, As I churn out product after product

This example illustrates how activating specific features (via steering vectors) influences the style, theme, and content of generated outputs — demonstrating the interpretability and controllability of internal representations.

We also developed a real-time web-based feature activation visualizer and demonstrated feature-level alignment with real-world concepts like gender bias, COT reasoning, and landmarks. These methods pave the way for more interpretable and steerable LLMs.

View Full Report (PDF)

View Slides (PPT)