Bottom line up front: Appleās āIllusion of Thinkingā paper claimed that AI reasoning models catastrophically fail at complex tasks, but methodological flaws and suspicious timing suggest the study reveals more about corporate strategy than AI limitations.
On June 6, 2025, Appleās research team led by Mehrdad Farajtabar dropped a bombshell: a study claiming that state-of-the-art AI reasoning models experience ācomplete accuracy collapseā when faced with complex puzzles. The paper, titled āThe Illusion of Thinking,ā tested models like OpenAIās o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet on classic logic problems, concluding that what appears to be reasoning is actually sophisticated pattern matching.
But Appleās timing was suspect. The paper appeared just days before WWDC 2025, where the company was expected to showcase limited AI advancement compared to competitors. What followed was one of the most contentious academic controversies in recent AI history.
The LinkedIn Hype Train Derails
Before technical experts could properly evaluate Appleās methodology, business influencers on LinkedIn had already picked sides. The speed of these reactions reveals how modern tech controversies unfold in real-time, with strategic narratives racing ahead of scientific rigor.
Dion Wiggins characterized Appleās research as corporate manipulation, arguing that Apple had āhijacked the message, erased the messengers, and timed it for applause.ā He contended that Apple couldnāt lead on innovation, so it tried to steal relevance by reframing the entire AI conversation.
Nicolas Ahar argued that Apple had ālit the fuseā on Silicon Valleyās ā$100 billion AI reasoning bubble,ā characterizing it as a āclassic Apple moveā where the company watches competitors burn through venture capital before taking advantage of the situation.
Michael Kisilenko framed the sentiment as āIf you canāt beat them, debunk them,ā describing the research as āsophisticated damage control from a company that bet wrong on AI.ā
Meanwhile, Saanya Ojha noted the irony of the situation, observing that the paper had āstrong āguy on the couch yelling at Olympic athletesā energyā and pointing out that Apple was critiquing reasoning approaches while having āpublicly released no foundation modelā and being āwidely seen as lagging in generative AI.ā
Gary Marcus Claims Victory
Gary Marcus, the longtime AI skeptic, saw vindication. On June 7, he published āA knockout blow for LLMs?ā arguing that Appleās findings validated his decades of criticism about neural networks dating back to 1998.
Marcus highlighted what he saw as devastating details: LRMs failed on Tower of Hanoi with just 8 discs (255 moves), well within token limits. Models couldnāt execute basic algorithms even when explicitly provided. These were problems that first-year computer science students could solve, yet billion-dollar AI systems collapsed completely.
Marcus revealed private communications with Apple researchers, including co-author Iman Mirzadeh, who confirmed that models failed even when given solution algorithms. If these systems couldnāt solve problems that Herbert Simon tackled with 1950s technology, Marcus contended, this raised serious questions about the prospects for artificial general intelligence.
The Devastating Technical Rebuttal
Then came the academic equivalent of a precision strike. On June 13, Alex Lawsen from Open Philanthropy published his rebuttal: āThe Illusion of the Illusion of Thinking,ā co-authored with Anthropicās Claude Opus.
Lawsenās credentials made his critique particularly damaging. A Senior Program Associate at Open Philanthropy focusing on AI risks, he holds a Master of Physics from Oxford and has deep experience in AI safety research. This wasnāt a corporate hit jobāit was rigorous academic analysis.
Lawsen identified three critical flaws that undermined Appleās headline-grabbing conclusions:
Token Budget Deception: Models were hitting computational limits precisely when Apple claimed they were ācollapsing.ā According to Lawsenās analysis, Claude would indicate when approaching token limits, producing outputs such as statements about stopping to save tokens when solving Tower of Hanoi problems. Appleās automated evaluation couldnāt distinguish between reasoning failure and practical output constraints.
Impossible Puzzle Problem: Most damaging of all, Appleās River Crossing experiments included mathematically unsolvable instances for puzzles with 6+ actor/agent pairs and boat capacity of 3. Models were penalized for correctly recognizing these impossible problemsāwhat Lawsen characterized as equivalent to penalizing a SAT solver for correctly identifying an unsatisfiable formula.
Evaluation Script Bias: Appleās system judged models solely on complete, enumerated move lists, unfairly classifying partial solutions as total failures even when the reasoning process was sound.
Lawsen demonstrated the flaw by asking models to generate recursive solutions instead of exhaustive move lists. Claude, Gemini, and OpenAIās o3 successfully produced algorithmically correct solutions for 15-disk Hanoi problemsāfar beyond the complexity where Apple reported zero success.
Appleās Conspicuous Silence
Apple has remained notably quiet about the methodological criticisms. The company issued no comprehensive official statements addressing Lawsenās devastating technical analysis or other researchersā concerns. Appleās silence on the methodological criticisms was particularly notable given the companyās track record. While Apple questioned whether AI models could truly āthink,ā many users were still wondering whether Siri could truly ālisten.ā
The only acknowledgment came indirectly at WWDC 2025, where Apple executives admitted to significant delays in AI development. Craig Federighi and Greg Joswiak stated that features āneeded more time to meet our quality standardsā and didnāt āwork reliably enough to be an Apple product.ā The AI-powered Siri upgrade was pushed to 2026.
Industry analysts described the event as showing āsteady but slow progressā and being ālargely unexciting.ā The limited AI announcements came amid intense competition from Googleās I/O conference, which showcased massive new AI features.
The Broader Pattern of Questionable Research
The controversy becomes more troubling when viewed alongside Appleās research history. The same team led by Mehrdad Farajtabar previously published the GSM-Symbolic paper, which faced similar methodological criticisms but was still accepted at major conferences despite identified flaws.
Researcher Alan Perotti noted that Appleās latest paper was āunderwhelmingā and drew concerning parallels to the teamās previous questionable work. This pattern raises serious questions about peer review standards for corporate AI research.
Meanwhile, Apple has traditionally lagged behind competitors like Google, Microsoft, and OpenAI in AI development, focusing instead on privacy-preserving, on-device processing rather than cloud-based solutions.
The Academic Pile-On
As technical experts examined Appleās methodology, the scientific communityās response grew increasingly harsh. Conor Grennan expressed frustration after reading Lawsenās rebuttal, noting his irritation with Apple and stating that the study had been proven seriously flawed.
Sergio Richter published a critique titled āThe Illusion of Thinking (Apple, 2025) ā A Masterclass in Overclaiming,ā arguing that Apple proved nothing and characterizing the research as āa branding heistā rather than genuine science. He noted the irony of Apple criticizing other models while deploying them directly into Apple Intelligence.
The research community began viewing Appleās study as an attack on legitimate scientific work, particularly concerned about the invalidation of prior research without proper justification and the failure to invite external peer review from competitive organizations.
What This Reveals About AI Research
The Apple controversy exposes uncomfortable truths about the intersection of corporate interests and scientific research in the AI field. When a companyās competitive position influences its research conclusions, can we trust the science?
The speed with which business influencers formed strong opinionsābefore technical experts could evaluate the methodologyāreveals how modern tech controversies unfold across social media. Strategic narratives race ahead of scientific rigor, with business implications driving the conversation.
This matters because the stakes are enormous. If Appleās research had been sound, it would have fundamentally challenged assumptions about AI capabilities and potentially influenced billions in investment decisions. Instead, the methodological flaws suggest the study was more about corporate positioning than scientific advancement.
The Lasting Damage
While subsequent analysis largely debunked Appleās claims, the initial narrative spread faster than the corrections. YouTube channels titled their coverage āApple Just SHOCKED Everyone: AI IS FAKE!?ā generating millions of views before the technical rebuttals emerged.
The controversy highlights the need for stronger norms around corporate AI research: independent peer review requirements, clearer disclosure of conflicts of interest, and more rigorous standards for experimental design in capability assessment.
Perhaps most importantly, it demonstrates that extraordinary claims about AI limitations, like extraordinary claims about AI capabilities, require extraordinary evidenceāsomething Appleās study failed to provide.
The Real Illusion
The true illusion wasnāt in AI thinkingāit was in thinking that Appleās research represented objective science rather than strategic positioning. The irony wasnāt lost on observers that Appleāwhose Siri still struggles to understand basic requests after 13 yearsāwas now positioning itself as the authority on AI reasoning capabilities. The company couldnāt lead on AI innovation, so it attempted to lead the conversation by undermining competitorsā achievements.
As one LinkedIn observer noted, Apple didnāt discover new limitations in AI reasoningāthey repackaged existing criticisms, timed them strategically, and presented them as breakthrough research. The real breakthrough was in corporate messaging, not scientific understanding.
The field will likely see more such controversies as competitive pressures intensify. But if this episode leads to better standards for AI evaluation and more transparency in corporate research motivations, perhaps some good will emerge from this academic battlefield.
Key takeaway: Appleās study said more about corporate strategy than AI limitations, serving as a cautionary tale about distinguishing genuine scientific inquiry from weaponized research in the high-stakes AI competition.