OpenAI is developing a tool to provide insight into the “black box” workings of large language models (LLMs), such as OpenAI’s ChatGPT. The tool aims to automatically identify which components of an LLM are responsible for specific behaviours.
The code to run the tool has been made available in open-source form on GitHub. William Saunders, the interpretability team manager at OpenAI, explained that the company is looking to anticipate problems that could arise with AI systems, ensuring that its models can be trusted.
OpenAI’s tool is designed to simulate the behaviours of neurons in an LLM, breaking the models down into individual components. The tool analyses text sequences to determine which neurons activate most frequently. It then generates an explanation using OpenAI’s latest text-generating AI model, GPT-4.
To test the accuracy of these explanations, the tool simulates how the neuron would behave in response to text sequences. The tool has been used to generate explanations for all 307,200 neurons in OpenAI’s GPT-2 model, and the dataset containing these explanations has been released alongside the tool’s code.
Researchers say that tools like this could eventually be used to enhance an LLM’s performance, reducing bias and toxicity. However, the tool is still in its early stages, and the researchers acknowledge that there is a long way to go before it is useful.
The tool was confident in its explanations for only 1,000 neurons, a small fraction of the total. While some might argue that the tool is simply an advertisement for GPT-4, the researchers insist that this is not the case. Jeff Wu, who leads OpenAI’s scalable alignment team, said that the fact that the tool uses GPT-4 is incidental and shows the model’s weaknesses in this area. He also said that the tool was not created with commercial applications in mind and could potentially be adapted to use with other LLMs besides GPT-4.
Despite the tool’s limitations, the researchers hope that it will open up a new avenue for addressing interpretability in an automated way.
They aim to provide good explanations not only of what neurons are responding to but also of the overall behaviour of these models, including how specific neurons affect others. While more complex, larger models present additional challenges, the researchers believe that the tool could be adapted to address these in time.