LLM Explanations Need to Track Behavior Changes

Why this is here: Existing explainability methods fall short because they don’t account for how interventions—like fine-tuning—cause LLMs to change what they do.

Researchers at an unnamed institution propose new standards for explaining how large language models (LLMs) change their behavior, as LLMs are frequently updated. These models exhibit “behavioral shifts” when adjusted through methods like fine-tuning or receiving new data.

Current explanation methods treat LLMs as unchanging, or simply compare explanations at different times. This makes it hard to understand how a model’s behavior changed after an update.

The team argues that explanations should focus on the shift itself—how an intervention transforms the original model. They introduce “Comparative XAI” (XAIΔ), a system designed to highlight differences between model versions when behavior changes. Key requirements for these explanations include being comparable, valid, actionable, and useful for ongoing monitoring.

This work is a position paper outlining a needed approach. The researchers tested the concept with initial experiments and created a “transition report” for documentation. The current research does not present a fully developed system, and more work is needed to build and test XAIΔ on diverse LLMs and shifts.