The Infinite Monkeys Aren’t Random Anymore

When OpenAI announced that one of its systems had disproved a long-standing conjecture in combinatorial geometry, the headline wrote itself: an AI solved a famous open problem in mathematics. The story traveled fast, and the framing was irresistible, a machine had done the thing we’d been told machines couldn’t do.

I read the paper, along with the remarks a group of mathematicians wrote up afterward and the announcement itself, and one detail kept pulling at me. At times the problem was described as being solved “in a completely automated fashion”: the model received an AI-written statement of the problem, and its output passed through an “AI grading pipeline” before any human looked at it. Read carefully, the picture is more specific than the headline. The proof was generated by the model in a single pass, then graded automatically, and only afterward cleaned up, reorganized, and checked by human mathematicians who improved it considerably. So this isn’t quite the chatbot-produces-proof story, and it isn’t quite an industrial search engine either. It’s something in between, a strong generation, an automated filter, and a lot of human work around both. The architecture around the model turns out to matter as much as the model, and I found myself more curious about that whole apparatus than about the headline.

To get a feel for the difference between the apparatus and the chat window, I ran a small experiment of my own, not on the unit-distance problem, but on a genuinely hard open problem closer to my own work. I won’t dwell on which one; the specifics matter less than what happened when I changed how I asked. My first prompts were modeled closely on OpenAI’s published style, and they failed instructively. The model went straight into survey mode: summarizing known results, classifying what’s open, listing partial theorems, gently reminding me the problem was unresolved. It behaved exactly like a conscientious graduate student asked a question above their pay grade, careful and well-read; the chat was not remotely trying to solve anything. If you’ve ever caught yourself reaching for the literature review instead of the problem, you know the posture from the inside.

Then I changed one thing. Instead of asking it to solve the open problem, I asked it to behave like an active research mathematician attempting to solve it, and I was explicit about what that meant. Don’t summarize the literature. Don’t tell me it’s open. Generate candidate lemmas. Find the bottleneck. Propose proof architectures. Try to build a counterexample. Locate the precise place where the standard machinery breaks.

The output changed character completely. The system stopped narrating mathematics and started doing something that looked like attempting it. It generated candidate strategies, proposed reductions, and kept circling back to a particular reframing of the problem, converging on it across multiple runs as if it had decided that was where the difficulty lived.

I want to be careful here, because this is exactly where it’s easy to oversell. I’m genuinely not sure the direction it kept returning to is a good one. It might be a dead end dressed in suggestive vocabulary. What struck me wasn’t that the model found the right path, I have no reason to believe it did, but that it had stopped reciting and started committing. It behaved as though it had a stake in an outcome. Whether the directions were any good is a separate question from whether the behavior was different, and the behavior was unmistakable.

This is where the old analogy comes back, and where I think it’s both right and badly misleading. The image of infinite monkeys typing Shakespeare is the natural one to reach for: generate enormous numbers of candidate arguments, discard the failures, keep what survives, eventually stumble onto something true. There’s truth in it, a process that generates and prunes is, in spirit, monkey-like. And it’s roughly how the mathematicians who examined the unit-distance proof describe the model’s actual chain of thought: trying a wide array of ideas from across mathematics, moving through them quickly, then locking on and working methodically once it hit the one that mattered.

But these are not random monkeys, and that’s not a footnote; it’s the whole story. A true monkey process explores all strings with equal probability, and the search is hopeless precisely because nothing constrains it. A model trained on the mathematical literature explores high-probability mathematical continuations instead. It carries proof templates, structural analogies, asymptotic instincts, a statistical sense of what a real argument looks like. The space it searches has already been bent by everything mathematicians have written, an astronomical compression. The monkeys are still typing more or less at random within a branch, but they’ve been handed the collected structure of the field as a prior, and that changes what “random” even means.

Which brings me to the gap between what OpenAI seems to have built and what the rest of us are using. When I run a public model, I get one trajectory of thought: one pass, one line of reasoning, take it or leave it. The paper calls its result automated and, in a sense, generated in one shot, but it also describes sending the output to a grading pipeline and studying how the success rate changed with more compute. You don’t build a grader for a process that produces a single answer, and success rate isn’t a dial you can turn unless you’re sampling many attempts and keeping the ones that pass. So “one shot” most plausibly describes the surviving trajectory, a single clean chain that didn’t need to be stitched together by hand, not the number of attempts it took to find one. What I’m shown is the winner; the discarded siblings never make the paper.

That reframing is, I think, more interesting than the headline, and it explains why my own experiments stall. The hard part of long, autonomous mathematics was never generating ideas, a model will generate ideas all day. The hard part is telling a real one from a confident-sounding dead end without a human reading every attempt. Scale without that filter is the actual infinite-monkey regime: produce a mountain of candidate proofs and you’ve gained nothing, because you can’t find the right one. The grader is what makes the scale pay off, and the grader is exactly what a chat window doesn’t have. When I prompt a public model into research mode, I get a single draw from a distribution whose right tail is where the hard proofs live, and I can’t afford to sample that tail by hand. The leverage is starting to live in the apparatus around the model, generation, but also criticism, verification, selection, not only in the weights.

There’s a further wrinkle, because it complicates any clean story about progress. Not every open problem looks equally hospitable to this kind of work. Some yield to recombination, chaining existing machinery, optimizing an asymptotic, assembling known pieces into a construction nobody had bothered to assemble. This result looks like exactly that: the mathematicians who digested it were candid that it introduced no powerful new geometric tools, and that with hindsight it’s a natural, if highly non-trivial, generalization of a construction Erdős himself had used. Other problems fail right at the endpoint, where the standard tools lose just enough control to break, and no amount of recombining the literature gets you across. Those may need genuinely new ideas latent nowhere in the training data, and no search over existing proof space, however well guided, will conjure them. If that distinction is real, “AI is getting good at math” is too coarse a claim. Some problems are hidden constructions waiting to be found. Others are waiting for a principle that doesn’t exist yet. The methods that crack one may be useless on the other, which is also a quietly consequential thing for anyone deciding which problem to stake the next few years on.

My own experiment solved nothing. The problem is still open, and the direction the model fixated on may lead nowhere. But it shifted the question I find interesting. For a few years the framing has been: can a language model reason like a mathematician? After watching one trajectory of a public model and reading carefully about the apparatus behind a private one, I think the sharper question is what happens when theorem search itself gets industrialized, when generating, criticizing, verifying, and writing up candidate arguments becomes something you scale and orchestrate rather than something a single mind does in a single sitting. We may be looking at an early, partial version of that right now.

That question reaches past the research frontier and into what we teach. If the valuable unit of work shifts even partway from the lone insight toward the orchestrated effort, then some of what we train students to prize, the heroic solo struggle, the conceptual leap held in one head, may need to share the stage with skills we barely teach at all: posing problems well, designing and grading searches, judging machine output, knowing which problems are search-shaped and which are waiting for an idea no search can reach. We may be looking at the early, awkward, easy-to-overhype first version of all this. Worth watching closely, and worth not mistaking the headline for the result.