NYAI AI Breakthroughs & Limitations: The Great Architecture Debate #

Hosted by: New York AI (NYAI)
Date: July 17, 2025
Duration: 180+ minutes
Host: Tone Fonseca
Speakers: Tone Fonseca, Andrea Jordan, Rose Kudlac, Jody Solomon, Phineas Samuel

The conversation began with warm, collaborative energy as participants solved technical issues together (Magnus helping Tone pin comments), setting a tone of mutual support that would characterize the entire session. Over three intense hours, the NYAI community dove deep into one of AI’s most fundamental questions: are we simply scaling what works, or do we need entirely new architectures to achieve true artificial general intelligence?

The meeting opened with technical troubleshooting that quickly transformed into enthusiastic intellectual exploration, maintaining predominantly positive and neutral sentiment throughout as the group tackled complex theoretical concepts with remarkable engagement.

The Sandcastle vs. Stone Castle Framework#

Tone Fonseca opened with a compelling metaphor that would thread through the entire discussion. He distinguished between “sandcastle representations” created by gradient descent and “stone castle representations” built through compositional evolution:

“A sandcastle is just coordinating tremendous amount of degrees of freedom that have nothing to do with one another, so that outwardly they resemble the form of a castle. But there’s really not much of an intermediate step to a sand castle.”

In contrast, stone castles are built from functional components that maintain value even when separated. Most importantly, if parts break during battle, “the parts that are left are composable enough that they can almost instruct how to rebuild those broken parts of the castle.”

This metaphor perfectly captures the core tension in modern AI: current models achieve impressive results by coordinating millions of parameters through gradient descent, but they lack the compositional structure that would make them robust and truly adaptable.

Grok 4: The First Signs of Fluid Intelligence?#

The discussion turned to Grok 4’s breakthrough performance, which has been making waves in the AI community. Tim Sweeney’s endorsement carried particular weight:

“All these things, and I think it’s actually general intelligence.” — Tim Sweeney

What makes this significant is that Sweeney, as CEO of Epic Games, has access to cutting-edge models beyond what’s publicly available. His assessment suggests Grok 4 might represent something genuinely different.

The group analyzed Grok 4’s performance on ARC AGI benchmarks, noting a crucial distinction:

ARC AGI 1: Clearly separated reasoning models from standard autoregressive models
ARC AGI 2: Shows Grok 4 achieving the highest performance levels, potentially indicating genuine compositional reasoning

As one participant noted, it might be demonstrating “the first signs of genuinely fluid intelligence” - the ability to compose solutions from reusable parts rather than pattern matching from training data.

The Four Core Papers: A Research Convergence#

Tone presented four research papers that, when viewed together, reveal converging evidence about AI’s current limitations:

1. The DGM (Darwin-Gödel Machine) Paper#

This explored open-ended evolution in AI systems, where agents can modify themselves based on empirical validation rather than formal proofs. The key insight: biological evolution never gets “stuck” on a single agent because it’s constantly exploring an open tree of possibilities.

2. Kenneth Stanley’s Compositional Evolution Work#

Stanley’s research demonstrates the difference between “fractured and tangled representations” (gradient descent) versus “unified factored representations” (evolutionary approaches). The breakthrough insight: when you build systems compositionally, intermediate steps have independent value.

As Tone explained: “Along the route to make the face, one of the things you discover is symmetry. And just think for a second how useful it would be to just have symmetry as something that can be applied.”

3. Apple’s Reasoning Model Research#

Recent Apple research shows that reasoning models actually reduce the amount of reasoning they do as problem complexity increases - exactly the opposite of what should happen with true intelligence.

4. Life Is a Function Paper#

This work explores the limits of prediction in complex systems, suggesting that some phenomena may be fundamentally irreducible to computational models.

The Compositionality Crisis#

The heart of the discussion focused on compositionality - the ability to build complex systems from simpler, reusable parts. Current AI systems excel at many tasks but fail catastrophically when they encounter scenarios requiring genuine compositional reasoning.

Rose asked for clarification on compositionality, leading to one of the session’s most illuminating exchanges. Tone explained that evolution discovered compositionality early, which is why biological systems use fractal patterns:

“Why do you think we fractal all the things? It’s like we fractal like all the internal things because evolution discovered quite early. Oh great. It’s very useful to have a very minimal instruction set that basically just bifurcates in branches, irrespective of whether you’re bifurcating or branching to create tubes in the lungs or a circulatory system to carry blood.”

The Benchmark Wars: Understanding What We’re Really Measuring#

A significant portion of the discussion examined how AI benchmarks shape our understanding of progress. Rose’s question about who creates benchmarks led to insights about how metrics become institutionalized:

The group explored how different benchmarks serve different purposes:

SWE Bench & Polyglot: Programming capability assessment
Humanities Last Exam: Broad knowledge evaluation designed to test human-like understanding
ARC AGI: Abstract reasoning that can’t be solved through pattern matching

The key insight: once benchmarks gain “critical mass of people that interpret it to be a good display of performance,” they become institutionalized regardless of their actual validity.

This connects to a deeper challenge in complex systems measurement—what complexity scientists call the Benchmark Validity Crisis. In biological systems, “fitness” is contextual, multidimensional, and constantly shifting with environmental changes. Our AI benchmarks, by contrast, tend to be narrow, static, and divorced from real-world complexity.

The discussion revealed Moravec’s Paradox operating at the benchmark level: tasks that seem cognitively demanding to humans (like competition mathematics) turn out to be easily automated, while tasks that seem simple (like contextual reasoning in natural environments) remain challenging for AI systems.

This suggests that current benchmarks may be systematically mismeasuring intelligence by focusing on formal, well-defined problems rather than the kind of open-ended, compositional reasoning that characterizes genuine intelligence. The ARC AGI benchmark represents an attempt to address this, but even it may capture only a narrow slice of true compositional reasoning.

The Taleb-Wolfram-Levin Convergence: The meetup mentioned these thinkers in passing, but there’s a deeper convergence here about irreducibility, complexity, and the limits of formal prediction. Nassim Taleb’s work on complex systems, Stephen Wolfram’s computational irreducibility principle, and Michael Levin’s biological intelligence research all point to the same conclusion: intelligent behavior in complex environments may be fundamentally non-reducible to simple metrics.

The Open-Endedness Question#

Andrea Jordan raised a crucial question about the practical implications of open-ended search versus bounded optimization. This led to one of the session’s deepest theoretical discussions.

The group explored whether artificial general intelligence requires truly open-ended capabilities or whether bounded systems could achieve generality. Tone argued that open-ended search generates “completely different things than saying there’s an endpoint we want to converge to.”

The key distinction: gradient descent forces “an innumerable number of degrees of freedom to artificially converge to a particular output,” while open-ended evolution accumulates compositional steps that remain valuable independently.

The Alignment-Complexity Catastrophe#

Drawing on recent research using sparse auto-encoders, Tone highlighted a troubling discovery: instead of finding simple superpositions (one concept per dimension), researchers found “features that are embedded in multiple dimensions.”

This complexity explosion led to a stark conclusion about AI alignment:

“My general conjecture is that naive alignment is intractable. I’m not saying that any alignment is not possible, but it’s just not alignment of the provable sort.”

This connects to a fundamental principle in complex systems theory: when systems become sufficiently complex, formal verification becomes computationally impossible. This isn’t unique to AI—it’s a challenge in any complex engineered system, from aerospace to biological systems. The Max Tegmark research findings represent empirical evidence of this complexity catastrophe occurring in real AI systems.

The implications extend far beyond technical alignment. When we consider that biological intelligence operates through what Michael Levin describes as “nested hierarchies of goal-seeking systems”—from individual cells to tissues to organisms—we see that natural intelligence solved the alignment problem through distributed, contextual goal structures rather than centralized formal verification.

The implication: we may need to approach AI safety more like developmental biology - through distributed goal alignment and emergent constraint satisfaction rather than top-down formal verification.

The Evolution-Intelligence Connection: Beyond Fractal Patterns#

Throughout the discussion, parallels between biological evolution and AI development emerged repeatedly. The group noted how biological systems achieve both robustness and adaptability through compositional architecture.

As Tone observed: “The more you get into these waters, the more you notice like, oh, stuff that Michael Levin says is oddly relevant to things that are being said by Kenneth Stanley talking about breeding images.”

This convergence reveals deeper insights about intelligence itself. Michael Levin’s developmental biology research shows that biological intelligence operates through nested hierarchies of goal-seeking systems—from individual cells to tissues to organs to organisms. Each level maintains its own “intelligence” while contributing to higher-level goals. This is precisely the kind of compositional architecture that current gradient descent methods fail to capture.

The fractal patterns Tone mentioned represent just the visible surface of a much deeper principle: substrate independence through compositional constraint. Biology achieved compositionality through specific material constraints (chemistry, physics, thermodynamics). AI might need different but equally constraining “substrate rules” to achieve similar compositional intelligence.

This connects to the Mesa-Optimizer Emergence phenomenon Tone briefly mentioned. Truly intelligent systems might need to develop internal optimization processes that operate at different timescales and abstraction levels—exactly what we see in biological systems where cellular metabolism, tissue development, and organism behavior operate through distinct but coordinated optimization cycles.

The implication: understanding biological intelligence may be crucial for developing more robust AI systems that can handle the kind of open-ended challenges that biological intelligence navigates effortlessly.

Implications for the Future of AI: The Architecture Decision Point#

The conversation revealed several critical insights for AI development:

Current scaling may hit fundamental limits: Simply making models larger may not lead to AGI if the underlying architecture is fundamentally non-compositional.
Evolutionary approaches deserve serious attention: Open-ended search and compositional building blocks may be necessary for true intelligence.
Benchmark diversity is crucial: Relying on narrow metrics may mislead us about genuine progress toward AGI.
Alignment strategies need fundamental rethinking: If AI systems become truly complex, formal verification may be impossible.
The Substrate Independence Principle: Biology achieved compositionality through specific material constraints. AI might need different but equally constraining “substrate rules” to achieve similar intelligence.

The Enterprise Transformation Implications: These architectural insights have direct relevance for how organizations should approach AI adoption. If current models have fundamental compositional limitations, this affects strategic decisions about AI investment, implementation timelines, and risk assessment. Companies building AI transformation strategies need to consider whether they’re betting on architectures that can scale to truly general intelligence or whether they should prepare for paradigm shifts in AI development.

The Research Priority Matrix: The discussion suggests several high-priority research directions:

Compositional learning architectures that build knowledge from reusable components
Open-ended search algorithms that don’t require predefined objectives
Multi-level optimization systems that mirror biological intelligence hierarchies
Benchmarks that test genuine compositional reasoning rather than pattern matching
Alignment approaches based on distributed goal structures rather than centralized control

Wrap-Up & Takeaways#

This NYAI session exemplified the deep, collaborative thinking that makes the AI community so valuable. The 3-hour discussion maintained remarkable intellectual energy, with participants building on each other’s insights to explore fundamental questions about intelligence, consciousness, and the future of AI.

Key themes that emerged:

The urgent need for compositional architectures in AI development
Growing evidence that current approaches may face fundamental scaling limits
The importance of evolutionary and biological insights for AI research
The challenge of measuring true intelligence versus pattern matching

Major questions for the field:

Can we develop AI systems that build knowledge compositionally rather than through gradient descent?
How do we create benchmarks that truly test reasoning rather than memorization?
What role should open-ended search play in future AI architectures?
How can we implement distributed goal alignment systems similar to biological intelligence?
What are the implications of computational irreducibility for AI safety and control?

The Cross-System Memory Integration Challenge: One emerging research direction involves building AI systems that can maintain knowledge across multiple representational formats and timescales—similar to how biological systems integrate genetic, epigenetic, neural, and behavioral information. This connects to the broader compositionality challenge: truly intelligent systems may need to operate simultaneously across multiple levels of abstraction and optimization.

The discussion reinforced that we’re at a critical juncture in AI development. The choices we make about architectures, benchmarks, and research priorities in the next few years may determine whether we achieve robust, beneficial AGI or hit fundamental limitations with current approaches.

For those interested in diving deeper into these questions, the NYAI community continues to host some of the most thoughtful technical discussions in the AI space. Their willingness to tackle fundamental questions with intellectual rigor makes them an essential voice in the ongoing AI revolution.

Related Reading:

Research Context: This analysis draws on cross-system memory integration across semantic research databases, structured knowledge graphs, and temporal conversation analysis—demonstrating the kind of multi-modal intelligence integration that biological systems achieve naturally and that future AI architectures may need to replicate.

This summary was generated from the complete 180-minute transcript, sentiment analysis, and cross-referenced research content from the July 17, 2025 NYAI meetup on “AI Breakthroughs & Limitations.”

The Architecture Decision Point: Why Current AI May Need a Complete Rethink

Table of Contents

NYAI AI Breakthroughs & Limitations: The Great Architecture Debate #

The Sandcastle vs. Stone Castle Framework#

Grok 4: The First Signs of Fluid Intelligence?#

The Four Core Papers: A Research Convergence#

1. The DGM (Darwin-Gödel Machine) Paper#

2. Kenneth Stanley’s Compositional Evolution Work#

3. Apple’s Reasoning Model Research#

4. Life Is a Function Paper#

The Compositionality Crisis#

The Benchmark Wars: Understanding What We’re Really Measuring#

The Open-Endedness Question#

The Alignment-Complexity Catastrophe#

The Evolution-Intelligence Connection: Beyond Fractal Patterns#

Implications for the Future of AI: The Architecture Decision Point#

Wrap-Up & Takeaways#

The Architecture Decision Point: Why Current AI May Need a Complete Rethink

Table of Contents

NYAI AI Breakthroughs & Limitations: The Great Architecture Debate#

The Sandcastle vs. Stone Castle Framework#

Grok 4: The First Signs of Fluid Intelligence?#

The Four Core Papers: A Research Convergence#

1. The DGM (Darwin-Gödel Machine) Paper#

2. Kenneth Stanley’s Compositional Evolution Work#

3. Apple’s Reasoning Model Research#

4. Life Is a Function Paper#

The Compositionality Crisis#

The Benchmark Wars: Understanding What We’re Really Measuring#

The Open-Endedness Question#

The Alignment-Complexity Catastrophe#

The Evolution-Intelligence Connection: Beyond Fractal Patterns#

Implications for the Future of AI: The Architecture Decision Point#

Wrap-Up & Takeaways#

NYAI AI Breakthroughs & Limitations: The Great Architecture Debate #