Deeper Than Grammar: What Sperm Whale Vowels Tell Us About Translation

Researchers went looking for the whale equivalent of letters. They found something closer to a vowel system.

That is the short version of a paper Gašper Beguš of UC Berkeley published this month in Proceedings B, the Royal Society’s flagship biology journal. And it reframes the story I told last year about AI teaching us to speak whale, with the whales speaking back. Since then, the story has gotten deeper, the math has gotten stranger, and a couple of claims from the original post need cleaning up.

If you read the original, you know the shape of it already. What I want to do here is upgrade the thesis. It is not just that whale communication is surprisingly rich. It is that the richness itself is the thing; it is what makes the translation problem tractable, and it may be what two different kinds of intelligence, separated by 90 million years of evolution, look like when they solve the same problem.

What we got right, and why it mattered#

The core finding from last year still holds. In 2024, Project CETI researchers fed nearly nine thousand sperm whale vocalizations through a machine-learning pipeline and found a combinatorial system with four features: rhythm, tempo, rubato, and ornamentation. Those features combine freely to produce around 156 distinct coda types, which the team described as a phonetic alphabet. Different clans speak different dialects. Males who leave their birth clans adjust their vocal style when they join new pods.

That was a big deal. It made “grammar” a defensible word for what sperm whales do.

What the 2024 paper could not say much about was how the individual clicks themselves were structured. It looked at when a click happened. It did not look closely at what each click sounded like.

That is the gap the new paper fills.

The vowel layer#

Beguš and his collaborators analyzed the acoustic shape of individual clicks inside codas and found two independent axes of variation. The first is duration: clicks can be short or elongated. The second is pitch contour: the frequency trajectory inside a click can rise or fall. These are not metaphorical vowels. They are acoustic structures that whales modulate deliberately, and they are differentiable in exactly the way linguists differentiate vowels in tonal languages like Mandarin, Latin, and Slovenian; languages where prosodic modulation, not fixed frequency, carries the distinction.

The 2024 paper showed that whales vary when clicks happen. The 2026 paper shows that whales also vary how each click sounds. That is the difference between discovering word order and discovering a vowel system.

Beguš applied the same analytical framework to parrots and elephants, and reports that sperm whale communication exceeds any non-human system he has studied in complexity. That is a comparative claim worth sitting with. He has also published earlier work using deep-learning latent spaces to characterize whale vocalizations, so this is not a one-off hunch.

Here is the part that clarifies rather than surprises, once you see it.

Sperm whales and humans share a last common ancestor who lived more than 90 million years ago, before mammals diversified into the forms we recognize today. Two lineages separated by that much evolutionary distance have independently arrived at layered, hierarchical vocal systems with vowel-like distinctions. That is convergent evolution on the structure of communication itself.

The original post celebrated how extraordinary it was that whales had anything so complex. The better framing is this: these organizational principles may be near-universal solutions to the problem of efficient acoustic communication at scale. They are not human-specific achievements. Intelligence finds similar answers whether it develops in the ocean or on land.

The math of translation#

Here is the part that is most likely to surprise anyone who followed the original story.

In 2023, a team including Shafi Goldwasser (Turing Award winner), David Gruber (Project CETI founder), Adam Kalai, and Orr Paradise published a theoretical paper at NeurIPS that proves something counterintuitive. Unsupervised translation error rates are inversely proportional to the complexity of the source language and to the amount of common ground between source and target. The more complex the animal communication system, the more amenable it is to machine translation without parallel data.

Read that twice, because it runs against intuition. A richer language sounds like a harder translation target. The theorem says the opposite. A richer system provides more internal structure for a model to exploit as an alignment signal. A simple system, with only a handful of distinct signals, offers almost nothing for an unsupervised model to grip onto. Complexity is not noise in the translation problem; it is the handhold.

The team has continued to refine the framework, including a 2025 paper on evaluating animal-communication translators without access to ground-truth translations, which is the kind of methodological problem you only have to solve once you take the first theorem seriously.

Put the two threads together and a practical implication emerges. Every new layer of complexity we find in sperm whale communication, whether it is the phonetic alphabet, the vowel-like distinctions, cross-clan vocal style convergence, is not only biologically interesting. It is also evidence that the mathematical preconditions for machine translation are being met. The whales have been making our job easier with every finding.

From listening to speaking#

The original post described a moment when AI-generated whale calls were played back to wild pods and the whales “responded appropriately” sixty-eight percent of the time, with a link to an MIT CSAIL blog. I need to correct that.

The peer-reviewed Nature Communications paper does not contain playback-experiment data. The authors explicitly declined to run playback experiments, citing the vulnerability of the Eastern Caribbean population they study. The sixty-eight percent figure does not appear to come from the peer-reviewed literature and I should not have presented it the way I did. Early informal observations may have motivated the number; formal playback experiments remain pending.

The more defensible version of the “we’re learning to speak whale” milestone arrived in December 2025 with WhAM, the Whale Acoustics Model, from researchers at Northwestern, MIT, and Project CETI. WhAM is the first transformer model that can generate synthetic sperm whale codas from arbitrary audio prompts. It was built by fine-tuning VampNet, a model originally trained on musical audio, on roughly 10,000 coda recordings collected over two decades.

The music-prior choice is not arbitrary. Coda sequences have phrase-level temporal structure that genuinely resembles musical phrasing, so starting from a model that already understands musical time gives the system useful inductive bias without requiring a biology-specific architecture built from scratch.

Two results from the WhAM paper are worth sitting with.

First, the codas it generates are rated as acoustically plausible by marine biologists who have spent careers listening to the real thing. That is the honest version of the “speaking back” milestone.

Second, and more interesting, the internal representations WhAM learned from a pure generation task turned out to classify rhythm type, social-unit identity, and vowel-like categories without any classification-specific training. Generative pretraining produced structural understanding as a byproduct. That pattern is familiar from large language models: train a system to predict the next token and it implicitly learns syntax, semantics, and pragmatics. WhAM extends the same principle into cetacean bioacoustics. A related benchmark, WoW-Bench, is now trying to measure that kind of emergent structural understanding in whale-audio models more systematically.

Culture is horizontal too#

One more finding deserves a mention because it updates the “whale families have accents” section of the original post.

I described dialect transmission there as mostly top-down: parents to calves, with rover males adapting when they join new pods. A 2023 Project CETI paper by Leitão and colleagues adds a horizontal dimension. Geographically overlapping clans converge in vocal style, meaning the micro-variation patterns within individual codas, for most of their communication, even while maintaining distinct identity-coda repertoires. Whales absorb communicative style from their cultural neighbors without adopting the symbolic markers that would compromise their clan identity.

That is a more nuanced, more human-like picture of cultural transmission than the simple inheritance model I leaned on before.

Two-year-old to five-year-old#

The original post repeated Project CETI’s stated goal of establishing a core vocabulary of 50 whale “words” by 2027. Gruber has publicly revised that.

The near-term target now is comprehending about 20 vocalized expressions that relate to specific behaviors, things like diving and sleeping, within five years. Gruber’s own framing for the revision is worth keeping. “At the moment we are like a two-year-old, just saying a few words. In a few years’ time, maybe we will be more like a five-year-old.”

That is a more honest milestone than the 2027 deadline, and a better metaphor. The point of interspecies communication research is not to hit a round number of vocabulary entries by a particular year. The point is to develop, slowly, the kind of shared ground that lets two very different nervous systems begin to understand one another.

What the convergence means#

Here is where I think the story has actually moved.

A year ago, the question was whether whale communication was structured enough to be called a language. The answer turned out to be yes, in ways that continue to get stronger as better tools arrive.

The more interesting question now is what it means that two forms of intelligence, separated by 90 million years of independent evolution, arrived at such similar structural answers. Hierarchical organization. Combinatorial features. Vowel-like distinctions carried by prosodic modulation. Dialect systems that transmit vertically and horizontally. These are not human achievements with a whale footnote. They look more like deep features of what efficient acoustic communication becomes once a social species gets complicated enough to need it.

And if the Goldwasser-Gruber-Kalai-Paradise theorem is right, that complexity is not an obstacle to the translation project. It is the precondition. The whales are not just speaking. They have been, in effect, leaving structure in the signal for us to find. We are finally building the tools that can find it.

Two-year-old today. Maybe five in a few years. That is still further along than any of this was when I first wrote about it.