It took me many years to arrive at the field of natural language processing and artificial intelligence. Before that, I completed a degree in bioinformatics at Tel Aviv University and worked in the cybersecurity industry. What connected all those stages was my fascination with solving mysteries — finding meaningful patterns hidden inside noisy data.
In bioinformatics, we try to decipher DNA, a long sequence of molecules that encodes our hereditary information. In my work in industry, I analyzed cyberattacks on large organizations, tracking digital evidence and investigating unfamiliar malicious files. In both cases, there is a deep sense of satisfaction when meaning suddenly emerges from a sequence of symbols or events that initially appears random and incomprehensible.
Large language models such as ChatGPT and Gemini are the most difficult puzzle I have studied so far. These systems are based on artificial neural networks inspired by processes in the human brain and encode vast amounts of knowledge about language and the world. In practice, such a network is an extremely complex mathematical function with trillions of parameters, mapping the text we provide as input to the text it generates as output.
When we ask a question like, “Who was the country’s first prime minister?” the model performs an internal computation, retrieving knowledge embedded in its parameters and producing an answer. But how exactly does this process work? What knowledge is stored inside the model, and how is it accessed during computation? These questions become increasingly critical as we rely more heavily on these systems in everyday life.
In my laboratory, we study a field known as interpretability, which seeks to explain the internal computations of neural-network-based models. Just as neuroscientists attempt to understand how the human brain functions, we aim to understand how artificial intelligence systems operate. This may surprise some readers — after all, humans created these models. But because they are trained through data-driven learning algorithms, there is no simple way to know what they have actually learned. I often compare this to infants: they are exposed to enormous amounts of information every day, yet we cannot directly observe what they know or how they think.
To better understand and control language models, we develop advanced methods that allow us to peer into the model’s “brain” and observe what happens inside. We study how knowledge is encoded in the model’s parameters and how it is used during computation. For example, we have found that certain parameters function as a kind of memory, storing information that the model retrieves much like a dictionary definition.
To understand how information flows during computation, we intervene in the model in various ways and examine how those interventions affect its output. If we identify a region responsible for a specific type of knowledge, we can amplify it or erase it, effectively controlling the model’s behavior.
Beyond developing interpretability methods, we also use them to address broader questions. Recently, my lab introduced a technique that allows specific knowledge to be precisely removed from a model with minimal impact on its other abilities. Such targeted deletion can be useful for limiting topics a model discusses or preventing harmful or offensive language.
In another study, conducted with Prof. Liat Mudrik of Tel Aviv University and Dr. Ariel Goldstein of the Hebrew University of Jerusalem, we use interpretability tools to track the “thought processes” of language models as they encounter input that contradicts their internal knowledge. Based on these analyses, we examine whether models show indicators of awareness similar to humans — the ability to consciously detect conflicting information and update beliefs accordingly.
Looking ahead, I hope we will develop interpretability methods that are more precise, consistent and efficient. The field still struggles to explain long chains of computation, which are essential for solving complex problems that require reasoning and the use of external tools. In a world where AI agents perform tasks on our behalf, it is crucial that we can “open the system” when needed to understand what is happening inside and why certain decisions are made.
Today, people often rely on explanations generated by the models themselves, but research has shown that such explanations do not always faithfully reflect the model’s internal computations. Our own interpretations of those texts may also diverge from how the model actually uses information. My goal is to use insights from interpretability research to help develop AI systems with stronger cognitive abilities that remain transparent and controllable — systems that ultimately serve humans, rather than confuse them.
Dr. Mor Geva is a senior lecturer at the Blavatnik School of Computer Science and Artificial Intelligence at Tel Aviv University.


