Recent research from Apple's AI team challenges the assumption that large language models (LLMs) can easily handle mathematical reasoning. In their paper, "Understanding the Limitations of Mathematical Reasoning in Large Language Models," they show that even the most advanced models struggle with math problems that require true logical thinking, especially when subtle modifications are introduced.
For example, take this simple math problem: Oliver picks 44 kiwis on Friday, 58 on Saturday, and double the number he picked on Friday on Sunday. How many Kiwis does he have? The correct answer is 190 (44 + 58 + 44 * 2). But when a minor detail, such as the size of the kiwis on Sunday being slightly smaller than average, was added, even the best LLMs made mistakes. For example GPT-o1-mini, responded:
"On Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday's kiwis) - 5 (smaller kiwis) = 83 kiwis."
The researchers found that LLMs often fail when faced with minor changes like this because they don't truly understand the problems—they are simply reproducing patterns they've seen before. While they can solve familiar problems, their performance declines when presented with slight variations. This suggests that LLMs aren't reasoning but rather mimicking.
The research also shows that while LLMs can generate responses that appear logical, they don't have genuine understanding, similar to how they might correctly respond "I love you, too" without grasping the meaning of love. The challenge arises when LLMs are marketed as capable of complex reasoning when, in fact, they are mostly repeating learned patterns.
As AI becomes more common in daily life, it's important to recognize both the capabilities and limitations of these models. This research highlights the need for ongoing improvements in AI to address these shortcomings, while also acknowledging that human intelligence, with all its complexities, is far from fully replicated by machines for now.