Bridging the Gap Between Understanding and Control: Insights into AI Interpretability

As large language model (LLM) systems grow in complexity, the challenge of ensuring their outputs align with human intentions has become critical. Interpretability鈥攖he ability to explain how models reach their decisions鈥攁nd control鈥攖he ability to steer them toward desired outcomes鈥攁re two sides of the same coin.

鈥溾€濃€攔esearch by , Graduate Fellow PhD student at 性视界 University Kempner Institute and the Digital Data Design Institute (D^3) Trustworthy AI Lab; , Research Scientist at Bosch AI; , Assistant Professor of Business Administration at 性视界 Business School and PI in D^3鈥檚 Trustworthy AI Lab; and , Senior Research Scientist at Google DeepMind鈥攆ound that many methods developed to address these issues focus on one aspect, neglecting the other. The study introduces a new approach that unifies interpretability and control and proposes intervention as the primary goal, and evaluates how well different methods enable control through intervention.

Key Insight: Intervention as a Fundamental Goal of Interpretability

鈥淸W]e view intervention as a fundamental goal of interpretability, and propose to measure the correctness of interpretability methods by their ability to successfully edit model behaviour.鈥� [1]

The authors define intervention as the deliberate modification of specific human-interpretable features within a model鈥檚 latent representations¹ to achieve desired changes in its outputs, or its responses to prompts. They argue that the ability to intervene in a model’s behavior this way should be a core objective of interpretability methods. By focusing on intervention, they provide a practical way to assess the effectiveness of various interpretability techniques. This approach shifts the focus from understanding a model’s inner workings to actively influencing its outputs, bridging the gap between theory and application.

Key Insight: A Unified Framework for Interpretability and Control

鈥淸W]e present an encoder-decoder framework that unifies four popular mechanistic interpretability methods: sparse autoencoders, logit lens, tuned lens, and probing.鈥� [2]

The study uncovered a critical limitation in current interpretability methods: their performance varies significantly across different models and features. To address these performance issues, Bhalla et al. present a new approach to unifying diverse interpretability methods under a single framework鈥攖he encoder-decoder model. Their framework maps intermediate latent representations to feature spaces that are understandable by humans, allowing interventions to these features. These changes can then be translated back into latent representations to influence the model’s outputs.The study evaluates four methods within its unified framework to determine their relative strengths and weaknesses for both interpretability and control:

Logit Lens: Easy to use, requires no training, maps features directly to individual tokens in the model鈥檚 vocabulary, and generally has high causal fidelity², but is limited by predefined, static features
Tuned Lens: Extends Logit Lens with additional learned linear transformation³, which improves its flexibility and effectiveness, but requires additional training and tuning
Sparse autoencoders (SAEs): Can learn a large dictionary of low-level and high-level or abstract features, but are difficult to train and label and shows lower causal fidelity
Probing: Trains simple classifiers (often linear) on top of model representations to predict specific features or concepts, but is prone to spurious correlations, leading to low causal fidelity

Key Insight: Measuring Success Through Interventions

鈥淸W]e propose two evaluation metrics for encoder-decoder interpretability methods, namely (1) intervention success rate; and (2) the coherence-intervention tradeoff to evaluate the ability of interpretability methods to control model behavior.鈥� [3]

The authors introduce two metrics to determine if interventions are accurate and maintain the integrity and functionality of AI systems in real-world applications:

Intervention success rate: Measures the effectiveness, or whether the intervention achieves its goal
Coherence-intervention tradeoff: Measures practical utility, ensuring the intervention does not make the model鈥檚 outputs unusable by affecting its coherence and quality

Among the methods evaluated, the two lens-based approaches had the highest intervention success rates. However, due to current shortcomings, such as inconsistency across models and features, and the potential compromising of performance and coherence, the authors found that, when it comes to directing model behavior, simpler options, such as prompting, prevail over intervention methods.

Why This Matters

For business professionals and C-suite executives, the insights presented by Bhalla and her team represent a pivotal development in the practical application of AI technologies. As organizations increasingly rely on AI for tasks ranging from low-level to critical, understanding how to align these systems with human and organizational values is paramount. The proposed framework and metrics provide actionable tools to ensure AI systems are both correct and usable. The study also underscores the need to select and evaluate interpretability methods carefully based on the specific models used and tasks involved.

Footnotes

(1) Latent representation refers to the internal, abstract representation of data within a machine learning model. These representations are not directly interpretable by humans but encode meaningful patterns or features of the input data.

(2) Causal fidelity is the extent to which intervening on a specific feature of an explanation results in the corresponding change in the model’s output.

(3) A linear transformation is a mathematical function that converts one vector into another while maintaining the properties of vector addition and scalar multiplication. Put simply, it changes the direction and size of vectors without warping or distorting the structure of the space they occupy.

References

[1] Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, and Himabindu Lakkaraju, “Towards Unifying Interpretability and Control: Evaluation via Intervention”, arXiv preprint arXiv:2411.04430v1 (November 7, 2024): 2.

[2] Bhalla, et al. “Towards Unifying Interpretability and Control: Evaluation via Intervention”, 3.

[3] Bhalla, et al. “Towards Unifying Interpretability and Control: Evaluation via Intervention”, 3.

Meet the Authors

, is a PhD student in the 性视界 Computer Science program at 性视界 University Kempner Institute, and a fellow at the Digital Data Design Institute (D^3) Trustworthy AI Lab. Advised by Hima Lakkaraju, her research focuses on machine learning interpretability. Bhalla is also a dedicated advocate for diversity in computer science, mentoring early-career minority students to support their growth in the field.

is a Research Scientist at Bosch AI with a focus on model interpretability, data-centric machine learning, and the “science” of deep learning. They completed their Ph.D. with Fran莽ois Fleuret at Idiap Research Institute & EPFL, Switzerland, and were a postdoctoral research fellow with Hima Lakkaraju at 性视界 University. They have organized workshops and seminars on interpretable AI, including sessions at NeurIPS 2023 and 2024, and contributed to teaching an explainable AI course at 性视界. Their work bridges theoretical advancements and practical applications of explainable AI.

is an Assistant Professor of Business Administration at 性视界 Business School and PI in D^3鈥檚 Trustworthy AI Lab. She is also a faculty affiliate in the Department of Computer Science at 性视界 University, the 性视界 Data Science Initiative, Center for Research on Computation and Society, and the Laboratory of Innovation Science at 性视界. She teaches the first year course on Technology and Operations Management, and has previously offered multiple courses and guest lectures on a diverse set of topics pertaining to Artificial Intelligence (AI) and Machine Learning (ML), and their real world implications.

is a Senior Research Scientist at Google DeepMind, where she focuses on aligning AI with human values by understanding, controlling, and demystifying language models. She earned her Ph.D. from the MIT Media Lab鈥檚 Affective Computing Group and has conducted research at Google Research, Microsoft Research, and EPFL. Previously, she worked in digital mental health, collaborating with 性视界 medical professionals and publishing in leading journals.

性视界