Writings

Why tiny changes are hard for Generative AIs


A generative image model does not move across a smooth, perfectly engineered map where every coordinate has a reliable, evenly-spaced neighbouring coordinate.

It is moving through a learned landscape of image-likelihood.

That landscape was shaped during training. At the beginning, the terrain is basically random: hills, valleys, ridges, and channels all over the place, with no useful structure. Then the model is shown enormous numbers of images and captions. Each time it tries to connect an image with the right idea — “portrait,” “man facing camera,” “sunlit kitchen,” “three-quarter profile,” “soft lighting” — the training process slightly reshapes the terrain. (Note the emphasis on ‘captions’ - this is foreshadowing.)

Over time, repeated patterns carve out stronger paths.

So the model develops regions like:

“man facing camera” “man in three-quarter profile” “man looking down” “fashion editorial pose”

Those regions are strong because they appeared again and again in the training data.

But a request like:

“Turn his head 5 degrees to the right and tilt it down 17 degrees”

assumes the model has a 3d representation of the body and a measurable angle system.

Most generative image models don’t have that. They have a landscape of visual probability, shaped by examples.

So you are not really saying:

“Rotate this object by exactly 5 degrees.”

You are saying something closer to:

“Move the image slightly toward the region of images that look more right-turned and downward, while preserving everything else.”

And that is much harder because the landscape is not uniformly detailed everywhere. It is more like a mountain range shaped by all the images the model has absorbed. Some areas are broad and easy to move through because the that was ‘well represented in the training data’. Some are jagged, blurry, under-sampled because the model saw fewer useful examples, or because the examples were inconsistent.

A model may have a stable path from:

“front-facing portrait”

to:

“three-quarter portrait”

But it may not have a high-quality, controllable path through every microscopic increment between them. That is because training does not carve every possibility into the terrain. It does not build a uniform slider for every possible attribute. It strengthens the paths that appeared often, clearly, and consistently enough in the data.

So when you ask for a very specific adjustment, the model may slide into a nearby familiar valley instead of taking the exact tiny step you imagined. It might interpret “slightly right” as a more right-facing pose. Or it may change the jawline, expression, lighting, hairline, shoulders, or identity, because in its learned landscape those elements are tangled together.

tldr: The model does not necessarily understand the image as a set of separately adjustable parts. Head angle, facial expression, lighting, hair shape, camera position, and identity may all be connected in the same region of the landscape, because they were learned together from examples.

To make things worse, this navigated through language - but increments of visual nuance far outstrip linguistic precision, especially at the level the image captions are created. If / when the companies add in more captions in future models (or you train LoRA around that additional information in a specific area) you can fill in more of the terrain and get more ‘quality paths’.

So broad creative direction works much better than specific mechanical corrections.

Thus prompts like this will generally result in ‘higher quality’:

“Make him look slightly more toward camera-right, with his chin a little lower.”

And harder prompts like this, can result in ‘lower quality’:

“Turn his head exactly 5 degrees right and 17 degrees down, while preserving identity, lighting, clothing, expression, hair, lens, and composition.”

The second request treats the model like it’s Cinema4d or Maya.

But the model is closer to a probability landscape: powerful, rich, and expressive, but not naturally precise at the micro-control level.