Eventi>M11 - AI Explainability
What if we could step inside an LLM and watch it think in real time?
This talk distills the latest research from Anthropic, DeepMind, and OpenAI to present the current state of the art in LLM interpretability.
We’ll start with the modern interpretation of embeddings as sparse, monosemantic features living in high-dimensional space.
From there, we’ll explore emerging techniques such as circuit tracing and attribution graphs, and see how researchers reconstruct the computational pathways behind behaviors like multilingual reasoning, refusals, and hallucinations.
We’ll also look at new evidence suggesting that models may have limited forms of introspection—clarifying what they can, and crucially cannot, reliably report about their internal processes.
Finally, we’ll connect these “microscopic” insights to real engineering practice: how feature-level understanding can improve debugging, safety, and robustness in deployed AI systems, and where current methods still fall short.