Diffusion Models Explained: The Math-Free Guide to How Stable Diffusion and DALL-E Work
Introduction
In the span of just a few years, AI-generated images went from a niche curiosity to a technology that genuinely fools the human eye. Type a sentence into a text box and seconds later you have a photorealistic oil painting, a surrealist fantasy landscape, or a product photograph that never existed. The technology making this possible, in almost every major system from Stable Diffusion to DALL-E 2 to Midjourney, is called a diffusion model.
The name sounds technical, and the original papers are dense with probability theory. But the underlying idea is one of the most intuitive in all of machine learning. This guide strips away the math and gives you a clear mental model of what is actually happening when you press "generate." You will understand why these systems produce such high-quality images, why they are slow, why your prompt wording matters so much, and why this approach beat a decade of competing research.
No prior knowledge of neural networks is required, though familiarity with the general idea of machine learning (a model learns from examples) will help.
Problem Statement: What Came Before, and Why It Was Hard
Before diffusion models dominated the field, two approaches shared the spotlight: Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). Both had real strengths, and both had frustrating, sometimes fundamental weaknesses.
GANs work through competition. A generator network tries to produce convincing fake images, and a discriminator network tries to catch them. They train together, each improving in response to the other, like a forger and an art detective locked in an arms race. When it works, the results are spectacular. GAN-generated faces reached photorealistic quality years before diffusion models existed. But training a GAN is notoriously fragile. The generator and discriminator can fall into unstable feedback loops. One of the most common failure modes is called mode collapse, where the generator learns to produce only a narrow range of outputs that reliably fool the discriminator, ignoring the full diversity of the real data. Getting a GAN to produce a wide variety of high-quality images across many categories, rather than a narrow slice of them, was a persistent unsolved problem.
VAEs take a different approach. They compress images into a compact numerical summary (a latent vector) and then learn to reconstruct them. Because they explicitly model uncertainty in that compression, you can sample from the learned space to generate new images. VAEs are stable to train and produce diverse outputs, but the images tend to be blurry. The compression step throws away detail, and the reconstruction step cannot recover it perfectly.
Autoregressive models, the kind that power text generation, were also applied to images by generating them one pixel (or patch) at a time. This produced high-quality results but was extremely slow, and scaling to high resolutions was computationally punishing.
In short: the field could get sharpness without diversity (GANs at their best), diversity without sharpness (VAEs), or quality at the cost of speed (autoregressive). Diffusion models, introduced in their modern form by Ho et al. in 2020, found a way to get all three by reframing the problem entirely.
Core Concepts and Terminology
| Term | Plain English Definition |
|---|---|
| Diffusion process | The overall framework: gradually destroy an image by adding noise, then train a model to reverse that destruction. |
| Forward process | The noise-adding direction. A real training image is progressively corrupted, step by step, until it is indistinguishable from random noise. The model does not learn this; it is a fixed mathematical procedure. |
| Reverse process | The direction the model learns. Starting from pure noise, the model predicts and removes a small amount of noise at each step, gradually revealing a coherent image. |
| Noise schedule | A plan that controls how much noise is added at each step of the forward process, typically starting with very little and ramping up until the image is completely destroyed. |
| Denoising | The act of predicting and subtracting noise from a partially noisy image. This is what the neural network learns to do. |
| U-Net | The architecture most commonly used for the denoising network. It has an encoder that compresses the noisy image and a decoder that rebuilds it, with shortcut connections that preserve fine-grained detail. |
| Latent diffusion | A faster variant where the diffusion process happens in a compressed latent space rather than on full-resolution pixels. Stable Diffusion uses this approach, which is why it is more efficient than operating directly on pixels. |
| CLIP | A model from OpenAI trained to understand the relationship between images and text. In text-to-image systems, CLIP (or a similar encoder) converts your text prompt into a numerical representation that guides the denoising network. |
| Conditioning | The mechanism by which external information, such as a text prompt, an edge map, or a reference image, is fed into the denoising network to steer what kind of image gets generated. |
| Classifier-free guidance | A technique that strengthens the influence of your prompt on the generated image. The model runs two denoising predictions at each step, one with the prompt and one without, and amplifies the difference. Higher guidance scale means stronger prompt adherence, but too high and quality suffers. |
How It Works: The Four Phases
Understanding diffusion models means understanding four distinct phases: forward corruption during training data preparation, the training objective itself, inference (generating new images), and text conditioning. Each phase builds on the last.
Phase 1: The Forward Process (Destroying Images to Build a Teacher)
Imagine you have a beautiful photograph of a mountain at dawn. Now imagine someone sprinkles a light dusting of television static over it. You can still make out the mountain, but there is noise. They add more static. More still. After hundreds of rounds, the original photograph is completely buried. You are left with a grey, featureless haze.
This is the forward process. For every image in the training dataset, the system creates a long sequence of progressively noisier versions of that image, from the original all the way to pure random noise. The crucial insight is that every step of this destruction is precisely known. At step 50 out of 1,000, you know exactly how much noise was added and exactly what the partially noisy image looks like. This is not learned; it is a fixed recipe.
This gives the training process an enormous, free supply of labelled examples: for every image at every noise level, we know exactly what noise was added.
Phase 2: Training (Teaching the Model to See Through Noise)
The neural network, typically a U-Net, is handed a noisy image and told what noise level it is at. Its job is to predict the noise that was added. If it can do that accurately, it can subtract the noise and recover a cleaner version.
Think of it like an art restoration expert who has seen thousands of damaged paintings. They have learned the patterns of how deterioration works, what canvas looks like under grime, what brush strokes suggest beneath a layer of varnish. Given a damaged painting and told roughly how degraded it is, they can make educated guesses about what to clean away.
Because we have millions of training images and hundreds of noise levels per image, the model sees hundreds of millions of training examples and builds a deep, rich understanding of what makes images look coherent. Importantly, there is no adversary, no discriminator, and no fragile balancing act. The loss function is straightforward: how close was the model's noise prediction to the actual noise? This stability is one reason diffusion models train so reliably.
Phase 3: Inference (Sculpting from Static)
Once trained, the model can generate new images from scratch. You start with an image of pure random noise, the equivalent of a block of unmarked marble. You ask the model: "if this were a noisy image at step 1,000, what noise would you predict?" The model makes a prediction, you subtract a small amount of that predicted noise, and you have a slightly less noisy image. Repeat this for all 1,000 steps and, like a sculptor progressively revealing a form, a coherent image emerges.
At first the image will just look like a blurry blob with vague structure. By the midpoint you might see rough shapes and colour zones. In the final steps, fine details snap into focus: textures, edges, facial features. The process is like developing a photograph in a darkroom, where the image gradually materialises out of the chemical solution.
This sequential nature is why diffusion models are slow. There is no shortcut to skip from noise to finished image in one step (though recent research has reduced step counts dramatically, from 1,000 to as few as 4 or 8 with certain samplers).
Phase 4: Text Conditioning (How Your Prompt Steers the Process)
A diffusion model trained only on images without any guidance will generate images at random. To steer it toward a specific subject, you need conditioning.
In systems like Stable Diffusion and DALL-E 2, your text prompt is passed through a text encoder, most often one trained with CLIP, which converts the words into a rich numerical representation. This representation is fed into the U-Net at every denoising step, nudging the predicted noise in a direction that makes the emerging image more consistent with the prompt.
Think of it as the sculptor having a reference photograph on the table while they work. Each time they pick up the chisel, they glance at the reference and make sure the form they are revealing is moving toward the intended subject. The guidance scale controls how tightly they follow that reference. At a low guidance scale, the sculptor feels free to improvise. At a high guidance scale, they stick closely to the reference, sometimes at the cost of a natural, flowing finish.
Practical Example: "A Red Fox Sitting in a Snowy Forest at Sunset"
Let us walk through exactly what happens, step by step, when you type this prompt into a system like Stable Diffusion.
- Prompt encoding. Your text is tokenised and passed through the CLIP text encoder. The output is a sequence of vectors, each capturing the meaning and relationships between the words: red, fox, sitting, snowy, forest, sunset, and the relationships between them.
- Sampling the starting noise. The system draws a random sample of pure Gaussian noise. This is your blank canvas. Every pixel is an independent random value. There is no image here yet.
- First denoising step. The U-Net receives the noisy canvas, the CLIP encoding of your prompt, and the current timestep (1,000 out of 1,000). It predicts the noise component. Because of the prompt conditioning, the predicted noise is not neutral; it is biased toward removing noise in ways that would move the remaining signal toward a fox in a snowy setting.
- Gradual refinement. Over many steps (say, 50 steps with a modern sampler), the same process repeats. By step 15 or so, you might see an orange-tinged blob against a pale background. By step 30, the shape of an animal begins to distinguish itself. By step 45, fur texture, snow detail, and the warm glow of a low sun start to appear.
- Latent to pixel space. In Stable Diffusion specifically, all of the above happens in a compressed latent space (roughly 64x64 for a 512x512 output). Once denoising is complete, a separate decoder network (the VAE decoder) expands this compressed representation back to full-resolution pixels, recovering fine texture and colour detail.
- Final image. You see a 512x512 (or higher) image of a fox in a winter forest, lit by sunset light, that did not exist before you pressed generate.
The entire process, from random noise to rendered image, typically takes one to ten seconds on modern hardware, depending on step count and resolution.
Advantages: Why Diffusion Models Beat GANs
For years, GANs held the crown for image generation quality. Diffusion models displaced them for several interconnected reasons.
Training Stability
GANs require careful balancing of two networks that are in direct competition. If the discriminator gets too strong too fast, the generator receives no useful gradient signal and stops learning. If the generator improves too quickly, the discriminator collapses. Practitioners spend enormous effort tuning learning rates, regularisation techniques, and architectural choices just to keep training from diverging.
Diffusion models have none of this. The training objective, predict the noise that was added, is a straightforward supervised learning problem. There is a single network, a single loss, and gradients flow cleanly. Training a diffusion model is about as stable as training a standard image classifier.
Mode Coverage and Diversity
Because GANs optimise for fooling a discriminator, they are prone to finding and exploiting gaps in the discriminator's knowledge, rather than learning a complete model of the data distribution. Mode collapse, where the generator produces only a subset of the possible outputs, is a persistent problem.
Diffusion models learn to model the full data distribution by training on all noise levels simultaneously. They must learn what coherent images look like across all scales, from broad composition to fine texture. The result is dramatically better diversity: ask for "a dog" and you might get a poodle, a labrador, a terrier, a cartoon dog, or a painterly dog, not the same GAN-optimal dog face every time.
Image Quality and Resolution
When combined with latent diffusion (operating in compressed space) and large-scale training, diffusion models produce images that surpass the sharpest GANs on standard benchmarks and, perhaps more importantly, hold up to close human inspection. The iterative refinement process allows the model to add detail progressively, without having to commit to fine structure before the broader composition is established.
Controllability
Because conditioning is built into the architecture at a fundamental level, diffusion models accept a rich variety of guidance signals: text prompts, reference images, depth maps, edge maps, pose skeletons. ControlNet extensions, for example, allow you to specify the exact pose of a figure while letting the model freely generate the appearance. This kind of fine-grained control was significantly harder to achieve with GANs.
Limitations and Trade-offs
Diffusion models are not without significant costs and weaknesses.
Slow Inference
Generating one image requires running the neural network hundreds of times, once per denoising step. Compare this to a GAN, which makes a single forward pass. Even with modern fast samplers (DDIM, DPM-Solver, LCM) that reduce step counts from 1,000 to 20 or fewer, diffusion models are still fundamentally sequential. Each step depends on the result of the previous one, so you cannot parallelise the process.
Compute Cost
Training a large diffusion model requires enormous computational resources. Stable Diffusion's training run cost hundreds of thousands of dollars in GPU time. Running inference, while cheap per image on consumer hardware, becomes expensive when generating thousands of images for commercial applications.
Prompt Sensitivity
Small changes in wording can produce dramatically different outputs. Adding or removing a single word, reordering phrases, or using synonyms can shift the image significantly. This makes diffusion models powerful but somewhat unpredictable for users who have not developed intuition for prompt engineering. The relationship between prompt and output is not always transparent or consistent.
Memorisation Concerns
Research has shown that diffusion models can, in certain conditions, reproduce near-exact copies of training images, particularly for images that appeared many times in the training set. This raises intellectual property and privacy concerns, especially for models trained on internet-scraped data without explicit consent from image creators. The legal and ethical landscape around this remains unsettled.
Compositionality Failures
Diffusion models sometimes struggle with prompts that require precise spatial relationships or counting. "Three red balls on a blue shelf with a green lamp to the left" may produce something that captures the gist but misplaces elements. Compositional reasoning, which comes naturally to language models, does not translate perfectly to the image generation process.
Common Mistakes
Misunderstanding What "Steps" Means
Many new users assume that more steps always means better quality, without limit. In practice, returns diminish quickly. Going from 10 to 30 steps makes a large visual difference. Going from 50 to 200 steps in most samplers makes almost no perceptible difference and just wastes time. The right step count depends on the sampler being used: DDIM and DPM-Solver converge faster than the original DDPM sampler.
Over-Prompting and Under-Prompting
Over-prompting means stuffing your prompt with every adjective and style keyword you can think of, hoping more instructions equals better results. In practice, overly long prompts can cause the model to pay uneven attention to different parts, sometimes ignoring important elements entirely. Under-prompting means giving so little information that the model defaults to its most average interpretation. Effective prompts are specific where it matters and concise where detail is not needed.
Treating Guidance Scale as "Quality"
Guidance scale is often described as a "quality" or "prompt adherence" slider, which leads users to push it to extreme values. Very high guidance scale (above 15 or 20, depending on the model) tends to produce over-saturated, artificial-looking images with distorted details, because the model is being pushed too hard away from naturalness and toward prompt matching. A guidance scale between 7 and 12 is a reasonable starting range for most models.
Using the Wrong Model for the Task
Different models have different strengths. A model fine-tuned for photorealism will produce poor anime-style images. A model fine-tuned for concept art may not produce accurate text overlays. Using the base Stable Diffusion model for a task that a specialised fine-tune handles much better is a common mistake when starting out.
Ignoring the Negative Prompt
The negative prompt field in most UIs tells the model what to avoid generating. Ignoring it means accepting whatever artifacts, watermarks, or compositional issues the model defaults to. Using a basic negative prompt like "blurry, low quality, deformed hands, watermark" can substantially improve output quality with no extra effort.
Best Practices
Choosing Step Count
Start with 20 to 30 steps for rapid iteration when exploring prompts. Increase to 40 to 50 for final outputs. With LCM (Latent Consistency Models) or Turbo variants, 4 to 8 steps can produce surprisingly strong results. Avoid spending compute budget on step counts above 50 unless you are using a specific sampler known to benefit from them.
Setting Guidance Scale
For photorealistic models, try guidance scale 7 to 9 as a default. For artistic or stylised models, 5 to 7 often feels more natural. If your image looks plastic, oversaturated, or has strange edge artifacts, lower the guidance scale before trying anything else.
Model Selection: Stable Diffusion vs DALL-E vs Midjourney
| System | Best For | Key Strength | Key Weakness |
|---|---|---|---|
| Stable Diffusion (open-source) | Custom workflows, fine-tuning, local use | Fully open, extensible, large community ecosystem of fine-tunes | Requires technical setup; quality varies widely by model version |
| DALL-E 3 (OpenAI) | Prompt-accurate generation, text in images | Best prompt-following of any major system; handles complex instructions well | Closed API only; less stylistic flexibility |
| Midjourney | Aesthetic, editorial, and artistic images | Consistently beautiful default outputs; strong stylistic coherence | Less controllable; Discord-based interface; closed |
| Adobe Firefly | Commercial use with IP safety | Trained on licensed content; safe for commercial projects | More conservative outputs; less cutting-edge quality |
Using ControlNet for Compositional Control
If you need control over the layout of an image rather than just the content, ControlNet extensions for Stable Diffusion let you provide a skeleton, depth map, or edge map that the model must respect. This is the most reliable way to specify exact spatial arrangement without fighting the model's own compositional tendencies.
Seeding for Reproducibility
Every image generation starts from a random noise sample. Setting a fixed seed lets you reproduce a result exactly, or vary just one element (the prompt, the guidance scale) while keeping everything else constant. This is invaluable for iterative refinement.
Comparison: Diffusion vs GAN vs VAE vs Autoregressive
| Property | Diffusion Model | GAN | VAE | Autoregressive (e.g. DALL-E 1) |
|---|---|---|---|---|
| Image Quality | Very high; rivals or exceeds human photography | High; best GANs are photorealistic | Moderate; tends toward blurriness | High for its era; can be sharp |
| Diversity | Very high; covers the full data distribution well | Low to moderate; mode collapse is common | High; samples from a well-defined latent space | High; sequential generation naturally explores diversity |
| Training Stability | High; single supervised objective, no adversarial games | Low; adversarial balance is fragile | High; straightforward reconstruction loss | High; standard cross-entropy training |
| Inference Speed | Slow; hundreds of sequential neural network calls | Fast; single forward pass | Fast; single forward pass | Very slow; generates one token at a time |
| Controllability | Very high; rich conditioning (text, image, depth, pose) | Moderate; conditioning possible but complex | Moderate; latent space interpolation works well | Moderate; token-level control of attributes |
| Notable Systems | Stable Diffusion, DALL-E 2/3, Midjourney, Imagen | StyleGAN, BigGAN, CycleGAN | VQVAE, early image synthesis experiments | DALL-E 1, ImageGPT, PixelCNN |
Frequently Asked Questions
Is Midjourney a diffusion model?
Midjourney has not published technical details about its architecture, so we cannot say with certainty. However, the behaviour of Midjourney outputs, the iterative refinement process visible when you watch a generation, the response to prompt guidance, and the general output characteristics, are all consistent with a diffusion-based approach. The overwhelming majority of production text-to-image systems built after 2022 use diffusion as their core mechanism, and Midjourney almost certainly does too, possibly with proprietary modifications.
Why do more steps improve quality up to a point?
Each denoising step is an approximation. The model predicts the noise at the current noise level, removes a portion of it, and hands off to the next step. With very few steps, each approximation is large and can accumulate errors, leading to artifacts and incoherence. With more steps, each individual approximation is smaller and more accurate. Beyond a certain threshold, the approximations are already accurate enough that adding more steps does not meaningfully reduce error, which is why quality plateaus. The exact threshold depends on the sampler: some samplers are mathematically designed to converge faster and require fewer steps.
What is LoRA for image models?
LoRA stands for Low-Rank Adaptation. It is a fine-tuning technique that allows you to teach a pre-trained model new concepts (a specific person's face, a particular art style, a custom object) without retraining the entire model. Instead of updating all of a model's billions of parameters, LoRA adds a small set of new parameters that modify specific layers. The resulting LoRA file is tiny (often just a few megabytes) compared to the full model. You can download community-created LoRAs to add a character, a painting style, or a photography aesthetic to an otherwise general-purpose base model.
Can diffusion models generate video?
Yes. Extending diffusion models to video is an active and fast-moving research area. Systems like Sora (OpenAI), Stable Video Diffusion, and others treat video frames as sequences and apply diffusion across both the spatial (pixel) and temporal (frame) dimensions. The core mechanism, learn to reverse a noising process, applies directly. The main challenge is the vastly increased computational cost: generating even a few seconds of video requires orders of magnitude more compute than a single image.
Are the images generated by diffusion models copyrightable?
This is an active legal question with no definitive global answer as of mid-2026. In the United States, the Copyright Office has held that purely AI-generated content without meaningful human authorship is not copyrightable, but that images where a human made substantial creative choices in the process may be eligible for some protection. The situation varies by jurisdiction. Additionally, lawsuits are ongoing in multiple countries regarding whether training on copyrighted images without consent constitutes infringement. Anyone using AI-generated images commercially should consult legal advice specific to their jurisdiction and intended use.
References
- Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems (NeurIPS) 33. The original paper establishing the modern DDPM framework.
- Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). The paper introducing latent diffusion, the foundation of Stable Diffusion.
- Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125. The DALL-E 2 paper describing the use of CLIP embeddings for text-conditioned diffusion.
- Song, J., Meng, C., and Ermon, S. (2020). Denoising Diffusion Implicit Models. arXiv preprint arXiv:2010.02502. Introduced DDIM, a faster sampler that reduced required inference steps from thousands to dozens.
- Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Gontijo-Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D. J., and Norouzi, M. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. NeurIPS. The Imagen paper from Google Brain, demonstrating the importance of large language models for text understanding in image generation.
- Ho, J., and Salimans, T. (2022). Classifier-Free Diffusion Guidance. arXiv preprint arXiv:2207.12598. Introduced the guidance technique that most production systems use to balance prompt adherence and image quality.
Key Takeaways
- Diffusion models generate images by learning to reverse a carefully structured noise-adding process. The core loop is simple: destroy images with noise during training, learn to undo that destruction, then apply that knowledge starting from pure noise at inference time.
- The training stability of diffusion models, rooted in a straightforward supervised objective rather than an adversarial game, is a primary reason they outpaced GANs in quality, diversity, and reliability.
- Text prompts guide generation through a CLIP encoder that translates language into a numerical representation. Classifier-free guidance amplifies the influence of this representation, and the guidance scale controls that amplification.
- Latent diffusion, used in Stable Diffusion, dramatically reduces compute by running the denoising process in a compressed space and only expanding to full resolution at the final step.
- The main trade-off is inference speed: sequential denoising steps cannot be parallelised, making image generation fundamentally slower than GAN alternatives, though modern samplers have reduced this cost significantly.
- Understanding step count, guidance scale, model selection, and negative prompting gives you practical leverage over outputs and helps you diagnose quality issues when they arise.
Related Articles