Arguably one of the most important papers of 2024
Interpretability is considered by many one of the next frontiers in LLMs. These new generation of frontier models are often seen as opaque systems: data enters, a response emerges, and the reasoning behind the specific response remains hidden. This obscurity complicates the trustworthiness of these models, raising concerns about their potential to produce harmful, biased, or untruthful outputs. If the inner workings are a mystery, how can one be confident in their safety and reliability?
Delving into the model’s internal state doesn’t necessarily clarify things. The internal state, essentially a collection of numbers (neuron activations), lacks clear meaning. Through interaction with models like Claude, it is evident they comprehend and utilize various concepts, yet these concepts cannot be directly discerned by examining the neurons. Each concept spans multiple neurons, and each neuron contributes to multiple concepts.
Last year, Anthropic published some very relevant work in the interpretability space focused on matching neuron activation patterns, termed features, to concepts understandable by humans. Using “dictionary learning” from classical machine learning, they identified recurring neuron activation patterns across various contexts. Consequently, the model’s internal state can be represented by a few active features instead of many active neurons. Just as words in a dictionary are made from letters and sentences from words, AI features are made by combining neurons and internal states by combining features.
Anthropic’s work was based on relatively small model. The next obvious challenge was to determine whether that work scales to large frontier models. In a new paper, Anthropic used dictionary learning to extract interpretable features from its Claude Sonnet model. The core of the technique is based on familiar architecture.