Can a language model understand a 3D shape if it can write the shortest program that rebuilds it? The idea runs into a wall named the random seed. The way around the wall is the same trick diffusion models already use.
Contents
The real goal is hard: turn a 3D mesh into code, so a language model can reason about shape the way it reasons about text. Meshes are messy, so a cleaner stand-in works better.
Turn a Minecraft build into a program that rebuilds it.
Give an agent a small library, Box, Spline,
Repeat, and let it write new functions. The build can be a
64×64×64 chunk, too big for any context window, so the agent
uses tools to look: crop a region, shrink it, zoom where it is unsure.
The task: rebuild the exact voxels with the shortest code.
This task rules out one failure. The dumbest valid answer lists every block as a constant:
It rebuilds the world block for block, yet knows nothing about it: a photo of the answer, not a theory of it. It is also the longest code possible. Pushing for short code pushes the agent away from copying and toward ideas: a wall becomes a loop, a tower a spin, a forest noise. This is the old MDL idea in a Minecraft skin, the best code is the shortest one that still rebuilds the data.
\[ P^{*} \;=\; \arg\min_{P}\; \underbrace{\lVert V - \mathrm{Exec}(P)\rVert}_{\text{rebuild error}} \;+\; \lambda \cdot \underbrace{\lvert P \rvert}_{\text{code length}} \]Short code as a way to force understanding. The same correct output can sit at very different code lengths, and only the short end looks like it understood anything.
People build 3D shapes with noise all the time: terrain, walls, trails, scatter, all painted by a noise function and a seed. For many builds the shortest code is a single noise call:
One line, except for that seed. The builder used some seed, and
there is no way to guess which. Two seeds make two different, equally valid
walls. The code can make a wall, not this wall.
The shortest code cannot rebuild the shape exactly unless you already know the seed. And you never do.
The shape came from a generator with a hidden coin flip:
\[ V \;=\; G(s,\, z), \qquad s = \text{seed (hidden randomness)}, \quad z = \text{settings (style, scale)} \]Many programs come close; only one matches exactly, and that one has to store the seed, which is just copying. If a shape is not fully made of patterns, no program can both rebuild it exactly and stay short. The seed is not extra noise on the problem, the seed is the problem.
This is the same barrier as GAN inversion. A GAN learns a map from a latent vector to an image, \(z \mapsto G(z)\). Given a real image, recovering the \(z\) that made it is usually impossible: the map is many-to-one, so different latents land on the same image, and many real images sit just off the learned range, reachable by no \(z\) at all. The finished image carries less information than the input that produced it, so the input cannot be read back out. A Minecraft build is identical: the seed is the latent, the voxels are the output, and one output does not pin down the latent that made it. Recovering a hidden input from a single output is not hard, it is undefined.
A real build is also three kinds of stuff at once, mixed block by block:
Against this mix, the single-best-program search has no stable place to land: it copies the whole grid, picks a wrong noise model, or stalls between equal programs. The shortest-program task is ill-posed.
One way out: stop asking for a perfect program and split the answer in two parts.
Part A, the structure. The part that earns the word understanding: boxes, symmetry, repeats, noise, splines. It explains the big patterns cheaply. A castle wall is a loop, not a list.
Part B, the leftover. A short list of block fixes: decorations, hand edits, one odd torch. Things not worth a rule, just stored.
So the world is a confident guess plus a small fix:
\[ V \;=\; G(P) \;+\; \Delta, \qquad \Delta = \text{a few stored block fixes} \]This takes the seed pressure off. If noise fits, pick any plausible seed and move on; if not, drop the seed and let the leftover hold the difference. The seed goes from a must-have to a nice-to-have, and the build becomes a loop:
flowchart LR
A["look at a chunk"] --> B["guess a simple rule"]
B --> C["write code for it"]
C --> D["run, compare to truth"]
D --> E["error map"]
E -->|"big error"| B
E -->|"small error"| F["store as a fix"]
Saying "I don't know" is not an answer. The model will call every region unknown, dump it all into the leftover, and learn nothing. That does not scale.
An open "I don't know" channel is free space, and free things get abused. If the leftover costs nothing, copying always beats thinking, so the structure part dies. Back to the constant list, now with a polite label.
This is the same trap as NeRF and inverse rendering. A NeRF fits one scene by storing a continuous function that reproduces its pixels from any view. It compresses the scene, but it learns no rules: no wall, no repeat, no object, just a smooth lookup table over space. It is the constant block list with interpolation, photographic memory, not understanding. An unpriced leftover lets the program collapse back into exactly that.
The fix is not more structure. Unknown must never be free; it has to compete with structure under one shared cost.
\[ \mathrm{Cost} \;=\; \underbrace{\lvert P \rvert}_{\text{rule cost}} \;+\; \underbrace{\lvert \Delta \rvert}_{\text{stored fix cost}} \]Now every block must be paid for, by a rule or by an expensive stored fix. The three options stop being a menu and start being a market:
| Option | Cost if a pattern exists | Cost if it does not | Result |
|---|---|---|---|
| Explain with a rule | cheap | expensive | used where real structure is |
| Store the blocks | wasteful | cheap for small odd bits | used for the leftover only |
| "I don't know", free | always cheap | always cheap | broken, must be banned |
Once "unknown" has a price, the model finds a rule first and stores a block only when a rule costs more. The leftover shrinks to a few percent instead of a dumping ground. The pressure to compress must stay on everywhere: too weak and it copies, too strong and it invents fake patterns, off and everything is "unknown".
The same problem appears in image generation. Pixel diffusion handles pixel noise by sampling, latent diffusion moves the noise one level up, but neither models conceptual noise: the set of valid images is itself rough, with structure at every scale. Under a plain manifold view that set looks impossible to pin down, since noise runs through it at all levels at once.
When the data is large enough, the manifold stops being a sharp surface to rebuild and becomes a soft cloud of likelihood. Diffusion does not store where the valid points are; it learns how likely each region is. The same move fits Minecraft: stop rebuilding one sample, fit all samples at once, and let scale teach the shape of every build without a perfect single answer.
flowchart TB
subgraph OLD["The inversion view"]
direction LR
s1["one build V"] --> s2["invert it"] --> s3["recover seed and program"]
s3 --> s4["ill-posed: many seeds, one V"]
end
subgraph NEW["What diffusion does"]
direction LR
d1["the whole dataset"] --> d2["learn a score field"]
d2 --> d3["which way is more likely"]
d3 --> d4["sample, never invert"]
end
OLD -. "the missing shift" .-> NEW
Diffusion does not handle conceptual noise by modeling every noise level. It learns a score field over the data, not a way to rebuild samples or seeds.
Diffusion training is not "learn every noise level", not "recover the image", not "store the seed". It learns one thing: at any noisy point, which way moves toward more likely data. Noise is the training device that forces this field to be defined everywhere, not just at the data points.
\[ \nabla_{x} \log p(x) \;=\; \text{the score: the way toward more likely data} \]Swap pixels for programs.
| Diffusion | This system | |
|---|---|---|
| object | image | program |
| noise over | pixels | program edits |
| what it learns | score field over images | score field over programs |
| generation | sample, never invert | sample, never invert a seed |
Fitting the single best program for one build is inverse rendering of one sample, ill-posed by design. The fix is to learn a model that scores all valid builds high and bad ones low, instead of rebuilding samples. Rebuilding pushes toward copying; scoring pushes toward shape.
The seed, the wall from section 2, just dissolves:
The right target was never one program but a group of equal programs, the same way one image does not pin down one diffusion path. Copying fails not because it is banned, but because it does not generalize: a block table that nails one build explains none of the others, so under a score trained on many builds it does badly. The constant list was never a real rival once the goal turned distributional.
The design then writes itself as diffusion over programs, not seed inversion: a multi-scale voxel encoder, a model that proposes a spread of programs, an executor that runs them and returns an error map, and a step that spends the next look on the least sure region. A seed guesser stays only as an optional hint, never a need.
Note seed_range, not seed. We commit to a family
of generators and a latent we are allowed to be unsure about, not a magic number we were
never going to find.
The starting assumption was that understanding means finding the one true program behind a shape, and that short code forces it to appear. Both halves are wrong.
You do not learn all the samples. You learn the shape of the space of valid answers that makes all the samples short at once.
There is no true program for a built world, only a best compression. It mixes fixed rules, noise rules with guessed settings, and a few honest fixes, none ranked above the others. The seed is not a flaw in the framing; it is the framing pointing at its own mistake. Asking what the space of such builds looks like, instead of what made this one, makes the unanswerable question disappear.
Diffusion has lived in that answer all along: it never finds a seed, it learns where the world tends to be and walks toward it. Understanding was never rebuilding; it was always a field.
On mesh-to-code, MDL, and diffusion over programs. The Minecraft framing is deliberate: discrete blocks, a readable library, and a controllable generator make it a clean lab for learning programs with hidden randomness.