Words painting thousands of pictures: AI image generation

Artificial Intelligence (AI) has leapt from writing stories to painting pictures, unlocking a new dimension of creativity. With just a few words, you can summon images of almost anything your imagination can conjure: a “pirate penguin playing poker” or “a llama at a disco wearing sunglasses.” These are not stock photos or premade assets but entirely original creations. The result feels like wizardry, yet beneath the surface lies a remarkable synthesis of art and science.

Let’s start on the ground level – how can computers create images on demand?

Imagine a person who’s spent years studying pictures – millions of them. They’ve looked at countless photos, each labelled with descriptions like, “This is a duck,” “This is a duck in a hat,” or “This is a duck reading a book in a café.” Over time, they begin to notice patterns. Ducks have webbed feet. Hats usually go on heads. Books are often near people sitting down. They don’t just memorise the pictures; they internalise what these objects and scenes typically look like.

Now, if someone were to ask them to draw “a duck reading a newspaper on a roller coaster,” they wouldn’t copy an image from memory. Instead, they’d mix the patterns they’ve learned together – drawing on what they know about ducks, roller coasters, and newspapers – to create something entirely new. It’s an educated guess based on patterns they’ve observed.

But here’s the catch: they don’t truly understand what a duck or a roller coaster is. They don’t know why the duck shouldn’t be holding the newspaper upside down (unless someone specifically asked for that). Nor do they understand why roller coasters can’t just shoot into the sky infinitely or do 15 loop-the-loops back-to-back. Their approach is more about connecting words and visuals in a way that feels right, rather than comprehending the deeper meaning behind them.

This is also why AI can sometimes generate weird images of humans, animals, or objects – like hands with extra fingers, clocks with warped faces, or text that looks like an alien language. The root cause often lies in the data these models are trained on. The old adage “garbage in, garbage out” applies here. AI learns from massive datasets of images, but those datasets aren’t perfect.

For instance, while there are countless photos of hands in these datasets, very few clearly show all five digits in natural poses. As a result, the AI develops a fragmented understanding of what a hand looks like. It knows hands have fingers, but not always how many or how they align.

Ultimately, this is how AI models like DALL-E, Stable Diffusion, and MidJourney operate. They don’t truly “know” what ducks, roller coasters, or hands are – they process millions of examples and learn to predict how elements of a prompt interact based on patterns in the data. When the data is incomplete or inconsistent, the quirks and limitations of these patterns show through.

Now we’ve covered the basics, let’s look at each step in more detail. To truly understand how an AI image generator works, we’ll break the process into distinct steps. Each stage plays a crucial role, transforming a simple text prompt into a fully realised visual creation.

Parsing the Prompt

The process begins with the input prompt – a description of the desired image. The AI parses this text to extract its meaning, breaking it into interpretable components like objects, relationships, styles, and contexts.

For example, if you prompt the model with “a city floating on clouds in a cyberpunk style,” the AI identifies key elements: the objects (“city,” “clouds”), their relationships (“floating on”), and stylistic cues (“cyberpunk”).

Here’s the Techie Bits

To process the prompt, the text is tokenised into smaller units such as words or sub words. These tokens are then converted into numerical vectors using pre-trained models such as Word2Vec or BERT. These vectors encapsulate the semantic relationships between words, allowing the AI to understand, for example, that “city” is conceptually closer to “town” than “forest.”

A transformer-based architecture, often using attention mechanisms, evaluates the importance of each word within the context of the entire prompt. For instance, “cyberpunk” modifies the visual attributes of “city” without influencing the appearance of “clouds.” This process results in a structured representation of the prompt, forming the foundation for subsequent image generation steps.

From Words to Wonders

Once the components of the prompt have been mapped, the model begins the process of generating the image itself. Historically, the first major breakthrough in image generation came from Generative Adversarial Networks (GANs), which laid the foundation for what we now see in modern tools. However, they’ve since been largely superseded by Diffusion Models, which offer greater flexibility and precision.

The Era of GANs

GANs operate as a competition between two neural networks: the generator and the discriminator. The generator creates images starting from random noise, while the discriminator evaluates these images and determines whether they look real or fake. Based on the discriminator’s feedback, the generator refines its output over multiple iterations, improving with each step until the discriminator can no longer distinguish between real and generated images.

For example, if tasked with generating human faces, the generator might initially produce crude, blurry shapes resembling a face. The discriminator would identify flaws – perhaps the eyes are mismatched, or the skin texture looks unnatural – and provide feedback. In the next iteration, the generator adjusts, improving features like symmetry or texture. After thousands of iterations, the generator produces faces so realistic that even the discriminator struggles to differentiate them from actual photographs.

Here’s the Techie Bits

At the core of GANs is the adversarial relationship between the generator and discriminator. The generator transforms random noise into images using convolutional and deconvolutional layers, while the discriminator compares these generated images to real ones from the training data.

This interaction is driven by adversarial loss functions, with the generator learning to “fool” the discriminator, and the discriminator learning to detect fakes.

One significant challenge is mode collapse, where the generator produces repetitive outputs, such as creating only cats from a dataset of animals, neglecting the diversity of other species. To address this, techniques like mini-batch discrimination encourage variety by evaluating groups of generated images collectively, ensuring the generator explores a broader range of possibilities.

Meanwhile, feature matching refines the generator’s objective, shifting its focus from merely deceiving the discriminator to accurately reproducing the patterns and textures that define real images.

The Rise of Diffusion Models

Diffusion Models offer a more robust approach to image generation compared to GANs. Instead of relying on competition, they refine images step by step. The process begins with random noise, resembling static, which is gradually removed over many iterations until a coherent image emerges.

During training, Diffusion Models learn to add noise to images incrementally, helping them understand how visual features degrade. When generating an image, this process is reversed: starting with noise, the model predicts and removes it step by step. This allows for precise refinement, closely aligning the image with the input prompt.

For instance, generating “a city floating on clouds” might begin with blurry shapes hinting at buildings and clouds. As the process continues, details like glowing windows and soft cloud textures are added, culminating in a highly detailed and stylistically accurate final image.

Here’s the Techie Bits

Diffusion Models rely on a probabilistic framework to model the process of adding and removing noise. During training, images are gradually degraded by adding Gaussian noise in small increments. This forward process teaches the model how noise impacts visual features. The reverse process – used during generation – starts with noise and iteratively refines the image by predicting and removing this noise.

The model employs denoising autoencoders to predict what the image should look like at each step, gradually revealing details. Techniques like classifier-free guidance enhance alignment with prompts by amplifying features described in the input text. This ensures that key aspects, such as glowing windows or soft cloud edges, are preserved throughout the refinement process.

The iterative nature of Diffusion Models not only improves precision but also allows errors introduced early in the process to be corrected in later steps, making them particularly effective for generating complex or stylistically rich images.

The Final Touches

Once the image is generated, the process isn’t necessarily complete. In some cases, additional steps are taken to validate and refine the output. These involve checking the image for consistency with the prompt, enhancing quality, and sometimes introducing human feedback for high-stakes applications like advertising or entertainment.

For example, if the output is “a cyberpunk city floating on clouds,” post-processing tools might upscale the resolution, refine textures, or adjust lighting to make the image more visually striking.

Here’s the Techie Bits

Validation often involves additional neural networks designed to critique and enhance the generated image. For upscaling, models like ESRGAN (Enhanced Super-Resolution GAN) are commonly used. These models analyse low-resolution outputs and predict high-frequency details, effectively adding sharpness and clarity to fine textures like neon lights or cloud edges.

Fine-tuning may also include feedback loops, where an additional model (or human) evaluates the generated image against the original prompt. This is especially useful for ensuring stylistic alignment or correcting small inconsistencies.

A Final Word

AI image generation has opened a door to unprecedented creative possibilities. From GANs that brought lifelike realism to Diffusion Models that enabled intricate and imaginative outputs.

What’s striking is that while these systems lack human understanding of the world, they excel at crafting visuals that resonate, inspire, and communicate. They may not comprehend the emotions or stories behind their creations, but they provide tools that amplify our ability to express them.

The future of image generation isn’t just about how AI sees the world but how it helps us see it in entirely new ways.

About the author

Nathan Marlor leads the development and implementation of data and AI strategies at Version 1, driving innovation and business value. With experience at a leading Global Integrator and Thales, he leveraged ML and AI in several digital products, including solutions for capital markets, logistics optimisation, predictive maintenance, and quantum computing. Nathan has a passion for simplifying concepts, focussing on addressing real-world challenges to support businesses in harnessing data and AI for growth and for good.

Learn more about our many innovative AI solutions and how to reach out to our experts here

Let’s start on the ground level – how can computers create images on demand?

Parsing the Prompt

Here’s the Techie Bits

From Words to Wonders

The Era of GANs

Here’s the Techie Bits

The Rise of Diffusion Models

Here’s the Techie Bits

The Final Touches

Here’s the Techie Bits

A Final Word

About the author

Talk to us