Table of Contents

Diffusion models.

Diffusion models are artificial intelligence systems designed to generate detailed images from meaningless statistical noise. By integrating an autoencoder, the CLIP model, and the UNET network, this technology transforms numerical chaos into coherent visual representations based on text descriptions.

Legend has it that when Michelangelo was asked how he had sculpted his unsurpassable David, the Renaissance genius offered an absurdly simple answer: “I didn’t do it; he was already there. I simply removed everything that wasn’t David.”

It’s a rather pragmatic, I’d even say slightly cheeky, way to explain one of the crowning masterpieces of art history. Deep down, according to the master’s own logic, he was merely applying a purely subtractive method, patiently chipping away at the excess material until the perfection hidden inside the stone was finally revealed.

It would be fascinating to see the look on good old Michelangelo’s face if he could travel through time and discover that, five centuries later, we’ve decided to shamelessly steal his idea. The only difference is that we’ve swapped marble dust for mathematical algorithms and the chisel for processors that heat up to unhealthy temperatures, much to the delight of our computer fans, which buzz like jet turbines about to blast off into the stratosphere (or scream for mercy so the graphics card doesn’t melt into a puddle).

And, instead of stone, we use pure, random chaos.

It is ironic that one of the most revolutionary artificial intelligence techniques uses this exact same logic to create art (and I hope nobody takes offense at me calling it art). I am referring to diffusion models for AI image generation, such as Stable Diffusion or Midjourney, systems capable of giving shape to statistical chaos and extracting a detailed visual work from what is, quite literally, nothing more than a noisy soup of nonsensical pixels.

In this post, we are going to break down how this methodology works, capable of generating images out of thin air. Get ready to understand how a handful of formulas manages to sculpt “Darth Vader ironing a Hawaiian shirt” starting from a visual static so loud it would shatter anyone’s visual eardrums.

The three pillars of digital creation

For a machine to be able to draw what we ask for via a simple text prompt, one single brain just won’t cut it. A diffusion model actually requires a team of three hyper-specialized components working in a chain: a compressor to extract the image’s essence, a translator to act as a bridge between the text and the image, and a chisel to carve it out of digital marble. Let’s meet the team.

Convolutional autoencoder: extracting the essence

Processing a high-resolution image pixel by pixel is a nightmare that would make any computer cry, given that a standard image can contain hundreds of thousands of data points. Think, for example, of a moderately sized colour image—let’s say 512 x 512. Since it has three colour channels (red, green, and blue), it totals 512 x 512 x 3 = 786,432 data points.

To lighten the computer’s workload, we turn to a type of neural network called an autoencoder.

An autoencoder works through two main blocks. The first is the encoder, which takes the original information (in this case, an image) and compresses it to obtain a much smaller representation known as a “latent.” This latent contains only the essential information, achieving a drastic reduction in data volume.

Subsequently, the decoder, the second component, performs the reverse task by taking that compact latent to generate the reconstructed image at the output. The goal is for the autoencoder, during its training, to learn how to reconstruct the original image data from whatever latent is presented as input.

Following our example, the 512 x 512 x 3 image is reduced to 64 x 64 x 4 = 16,384 data points after passing through the encoder. This 98% reduction in data volume makes the process much more manageable from a computational standpoint.

CLIP: the digital Rosetta Stone

This is where things get a bit technical, but very interesting. By nature, images (matrices of numbers representing coloured pixels) and texts (sequences of letters) are completely incompatible mathematical universes.

If one model encodes a photo of a dog and another encodes the word “dog”, their resulting numerical representations (the so-called embeddings or vector spaces) will end up in entirely different galaxies within the mathematical universe. The machine would have no way of knowing they mean the same thing.

To solve this, we turn to CLIP (Contrastive Language-Image Pre-training). The great achievement of this model is that it didn’t learn to read or see in isolation like the models mentioned above; instead, it was trained by combining an image encoder and a text encoder simultaneously.

For months, it was presented with millions of pairs consisting of an image and its textual description. Using statistical brute force and contrastive training (where the network learns by constantly comparing which image fits which text and discarding those that don’t), CLIP adjusted its gears to force the image embedding and the text embedding to converge at the exact same point in mathematical space.

Thanks to this, the numerical code for the phrase “Darth Vader” is equivalent to the code for a photograph of the Sith Lord. CLIP managed to unify two incompatible computer languages; the AI model can finally “see” what it “reads.”

UNET: the magic chisel

If the autoencoder prepares the canvas and CLIP translates our instructions, UNET is the true artist with the chisel in hand. It is the core model of the diffusion team.

Its sole and obsessive purpose in its digital life is to receive a latent (that essence of the image) completely filled with noise and visual grime, and learn to clean it progressively until it is spotless.

On a technical level, UNET is simply a convolutional neural network named after its peculiar U-shaped mathematical architecture. Its operation is fascinating: first, it compresses the input image information (going down one side of the U) to capture the global context of that chaotic jumble of pixels, and then it expands it (going up the other side of the U) to reconstruct the forms with pinpoint accuracy.

This dual ability to see both the forest and the trees makes UNET a true pixel surgeon. Its real-world superpower generally shines in semantic image segmentation. Its main function is to take an input image and generate a “mask” that classifies every single pixel into a specific category, allowing it to identify shapes, edges, and objects with great precision: from detecting roads in satellite imagery to locating tumours in CT scans, among many other applications.

And it is precisely this meticulous talent that we leverage in the case of image generation. Here, it’s not looking for tumours or roads; instead, we ask it to iteratively predict and erase static noise until, from all that chaos, a work of art emerges.

Here is the translation for the training section. I’ve maintained the instructional but breezy tone, ensuring the technical distinction between “frozen” and “active” components is clear.

The cybernetic art school: diffusion and training

Now that we’ve met our team’s main players, let’s see how they work together, starting with the training process, the art school for diffusion models.

To train a diffusion model, we first gather a massive training dataset with thousands or even millions of images paired with their text descriptions. Let’s pick one to understand the process. For instance: a photo of “a single, majestic emperor penguin on the Antarctic snow.”

But before we dive into the training details, we must set the ground rules: of our starring trio, only one is actually going to be learning in this phase. The encoder, the decoder, and the CLIP text encoder have already been trained independently and know their jobs; therefore, they are now completely frozen. This means their parameters (the model weights) are locked and won’t be trained or modified. The only one that’s going to break a sweat and update its parameters is the UNET.

The first step in training is the forward diffusion process. We take that beautiful penguin photo and compress it through the frozen encoder to get its latent or compact representation. Next, we intentionally start contaminating that latent with random noise.

We repeat this by applying various noise levels until the original image disappears completely, leaving nothing but static. We’ve destroyed the image. You can see this illustrated below though, for educational purposes, we’re showing the bird itself rather than its latent representation.

Meanwhile, the text description passes through the (also frozen) CLIP encoder to generate its mathematical embedding, which we’ll need for the next step.

Next comes reverse diffusion, the heart of the learning process. We hand UNET (the only model currently training and tweaking its parameters) that noise-wrecked latent, along with the text embedding generated by CLIP. It’s as if we were telling it: “Here’s this pile of noisy garbage. Using the penguin text as your guide, figure out what doesn’t belong and strip it away to recover the penguin hidden inside.” You can see the task we’ve set for it in the following image.

Over millions of iterations, UNET patiently learns to progressively remove the noise from the latent to preserve whatever matches the input text. Once it successfully removes the noise and produces a clean latent, we pass it through the (again, frozen) decoder to reconstruct an image that is, ideally, like our original photo.

We’ve taught the machine to sculpt by extracting the “truth” from within the statistical trash. To help you visualize it, I’ve summarized the training process in the diagram below.

The leap into magic: generating images out of thin air

Now, what happens when the model is already trained and we want to use it in the real world? This is where the process takes a radical turn.

When we, sitting comfortably at home, open the model and type the prompt “Darth Vader, intensely focused, ironing a Hawaiian shirt with a flamingo print”, the image generation is mind-blowing. First off, we don’t need to provide any input image. We can completely forget about uploading actual photos from the George Lucas films; since our text is the only input, we bypass the encoder entirely.

Instead, the software internally creates our raw block of marble: a latent composed of entirely random noise. In other words, pure mathematical chaos and visual static. Meanwhile, our absurd and brilliant phrase passes through CLIP’s text encoder to become an embedding.

Next, our UNET network, now a seasoned artist, receives that block of random noise on one side and the mathematical instructions from the text on the other. Applying everything it learned during its gruelling training phase, it gets to work. Step by step, it eliminates everything from that random noise that doesn’t look like a fearsome galactic villain wrestling with the wrinkles in his tropical attire.

Once it’s finished cleaning, it outputs a pristine, noise-free latent. All that’s left is to pass that clean latent through the decoder to transform it into the spectacular (and ridiculous) final image.

And here lies the great secret and the genius of the whole thing: the resulting image is entirely new and does not exist in the real world. The AI isn’t searching Google for images of Darth Vader to clumsily cut and paste them onto a summer clothing catalogue; the origin of the image is simply that latent of random noise. It is literally extracting the composition from an ocean of statistical chaos, giving birth to a unique and unrepeatable creation.

We’re leaving…

So, there you have it. We’ve dissected the magic trick. From the bilingual dictionary that is CLIP, through the extreme compression of the autoencoder, to the UNET, that cybernetic sculptor capable of staring into a sea of static and extracting a masterpiece.

It is fascinating to see that the most cutting-edge digital creation of our era is based, in essence, on the noble and ancient art of removing the trash until what is left makes sense. As it turns out, Michelangelo wasn’t that far off. He was just born five hundred years too early to see a computer prove him right.

However, before we finish this post, scientific honesty compels me to confess a small white lie I’ve slipped in throughout the text. I mentioned that the resulting image is “unique and unrepeatable” because it arises from chaos. Well, it’s not quite that unique, and the chaos isn’t all that chaotic.

Computers, by their very rigidly structured nature, are terrible at improvising. They are logical machines and are physically incapable of generating true randomness. What they do instead is generate pseudo-random noise guided by a starting number called a seed.

This means that if we type the same Darth Vader prompt, keep the same parameters, and use the exact same numerical seed, the computer will generate the same initial noise pattern. And if we start with the same block of static, the UNET will mathematically chisel it point by point to return the exact same image we thought was unrepeatable. Pure deterministic math masquerading as artistic bohemianism.

Now that we’ve debunked the myth of absolute chaos and know how this technological marvel works, we must admit we’ve barely scratched the surface. We haven’t dived into how the deeper layers of these neural networks operate, nor their astonishing (and worrying) ability to invent reality with total confidence, a system failure that engineers elegantly call hallucinations.

For instance, we could try to explain why an AI capable of generating hyper-realistic reflections on a Sith Lord’s armour suffers from these very hallucinations and can collapse when trying to draw a simple human hand without giving it seven sausage-shaped fingers. But that’s another story…

Convetir a PDF