On Sept. 12, OpenAI announced a new ChatGPT model that the company says is substantially better at math and science than previous versions, which struggled with reasoning. An earlier model scored just 13% on the qualifying exam for the International Mathematical Olympiad (the top high school math competition). The new model, called “o1,” raised that score to 83%.
Niloofar Mireshghallah, a University of Washington postdoctoral scholar in the Paul G. Allen School of Computer Science & Engineering, studies the privacy and societal implications of large language models, such as ChatGPT. UW News spoke with her about why math and reasoning have so challenged these artificial intelligence models and what the public should know about OpenAI’s new release.
ChatGPT and other LLMs work by predicting what word comes next with great fluency. Why has math and reasoning been so hard for LLMs?
Niloofar Mireshghallah: There are two main reasons. One is that it is hard to “figure out” rules and principles when the model does next-word prediction. You need to go back and forth a bit and deduce to do math. Regarding more logical or commonsense reasoning, another reason for difficulty is that, as my advisor Yejin Choi says, commonsense is like dark matter. It’s there, but we don’t see it or say it. We know that the door to the fridge shouldn’t be left open, but there is little text saying that. If there is no text for something, models won’t pick it up. The same goes for social norms or other forms of reasoning!
Jakub Pachocki, OpenAI’s chief scientist, told The New York Times: “This model can take its time. It can think through the problem — in English — and try to break it down and look for angles in an effort to provide the best answer.” Is this a big shift? Is this new model doing something closer to “thinking”?
NM: This whole “take its time” is a simplification of what is happening, which we call “test-time computation.” Up until now, big companies would scale models by sizing up both the models and training data. But the companies might have reached a saturation there — as in, there is not more pre-training data, and sizing up models may not help us much more. This investment in test time helps the model do internal reasoning, so it can try to decompose problems and do multiple iterations. This is called chain-of-thought reasoning, which is like showing your work in a math problem, but for language and thinking tasks. Instead of just giving a final answer, the AI works step by step, writing down each step of its reasoning process.
Imagine you’re asked to solve a word problem: “If Sally has 3 apples and gives 2 to her friend, how many does she have left?” A normal AI response might just say: “1 apple.”
But with chain-of-thought reasoning, it would look more like this:
- Sally starts with 3 apples
- She gives away 2 apples
- To find out how many are left, we subtract: 3 – 2 = 1
- Therefore, Sally has 1 apple left
This step-by-step process helps in a few ways: It makes the AI’s reasoning more transparent, so we can see how it arrived at its answer and, in the case of a mistake, potentially spot where things went wrong. Chain-of-thought reasoning is especially useful for more complicated tasks, such as answering multi-step questions, solving math problems or analyzing situations that require several logical steps.
In a sense, the model can test its own response, as opposed to just doing next-word prediction. One problem before was that if a model predicted one word wrong, it kind of had to commit, and it would get derailed because all its following predictions are based in part on that wrong prediction.
This form of chain-of-thought reasoning and response generation is the closest procedure we have to human thinking so far. We are not entirely sure how this internal reasoning fully works, but now the model can take the time to test its own response. Researchers have shown models finding their own mistakes and ranking their own responses when offered multiple choices. For instance, in a recent paper we showed that LLMs would spoil birthday surprises when generating a response, but when asked if their response is appropriate, they would realize the mistake. So this self-testing can help the model come up with a more logical response.
What should people know and pay attention to when companies announce new AI models like this?
NM: I think one thing that people should be careful with is to still fact-check the model outputs, and not get fooled by the model “thinking” and taking its time. Yes, we are getting better responses, but there are still failure modes.
For more information, contact Mireshghallah at niloofar@cs.washington.edu.