[HiDream, Flux.1 Krea] Three Years Since the Dawn of Image Generation AI! A Deep Dive into Running the Latest Models at High Speed.

2025-8-222025-10-24

an animated female character stands on a lit city street at night sporting long brown hair purple eyes a black shirt and gold earrings. a red car is parked nearby

Load the Text Encoder into RAM
Use high-precision models and leverage caching
Boldly reduce the number of steps

Introduction

Hello, I’m Easygoing!

In this post, we’ll explore how to run the latest image generation AI models at high speed.

Anime-style illustration of a brown-haired woman smiling in a rainy alley with a red classic car

Image Generation AI Models Are Getting Bigger

The journey of local image generation AI began exactly three years ago on August 22, 2022, with the release of Stable Diffusion 1.


gantt
    title Local Image Generation AI
    dateFormat YYYY-MM-DD
    tickInterval 12month
    axisFormat %Y

        Stable Diffusion 1 :done, a1, 2022-08-22, 2025-08-23
        Stable Diffusion XL 1.0 :done, c2, 2023-07-27, 2025-08-23
        Stable Diffusion 3 : d1, 2024-06-12, 2025-08-23
        Flux.1   : d2, 2024-08-01, 2025-08-23
        HiDream   : d3, 2025-04-06, 2025-08-23
        Qwen-Image   : d4, 2025-08-04, 2025-08-23

Over the past three years, image generation AI has evolved significantly, but with that progress, model sizes have also grown massively.

Models That Don’t Fit in VRAM!

Currently, the NVIDIA RTX 5090, the highest-performing consumer GPU, has 32GB of VRAM. However, models like Flux.1 and beyond exceed 32GB in size, meaning even the RTX 5090 cannot load the entire model into VRAM.

AI tasks heavily depend on efficient VRAM usage, and the way models are split and loaded into memory is key to generating images efficiently.

Testing Environment

For this experiment, we’ll measure generation times in the following setup:

Windows 11
RTX 4060 Ti 16GB
RAM 64GB

Anime-style illustration of a blonde woman smiling in a rainy alley, wearing a black shirt

First and Second Generation Times Are the Same

Let’s start by generating an illustration using HiDream, the largest image generation AI model.

We’ll generate an image once, then change the prompt and generate a second image.

HiDream-I1-Dev, 1024 x 1024, 8 Steps

In ComfyUI’s default workflow, both the Text Encoder and Transformer are loaded into VRAM and processed by the GPU.

Text Encoder ➡️ VRAM (contention)
Transformer ➡️ VRAM (contention)

Since HiDream’s model size exceeds VRAM capacity, contention occurs between the Text Encoder and Transformer for VRAM space.

Once a model is used, it’s retained in RAM, so the second generation doesn’t require loading from storage.

However, swapping the Text Encoder and Transformer between VRAM and RAM still occurs, resulting in the second generation taking the same amount of time as the first.

Loading the Text Encoder into RAM Doubles the Speed!

To avoid VRAM contention, let’s try explicitly loading the Text Encoder into RAM and processing it on the CPU.

Screenshot of QuadrupleCLIPLoaderMultiGPU node with device set to CPU — Set device to CPU

To switch encoding from GPU to CPU, we use the QuadrupleCLIPLoaderMultiGPU node from the ComfyUI-MultiGPU custom node pack, setting the device to CPU.

Text Encoder ➡️ RAM
Transformer ➡️ VRAM

Loading the Text Encoder into RAM and processing it on the CPU doubles the prompt encoding time, but image generation becomes faster, cutting the second generation time in half.

HiDream’s large model size means that moving the model takes far longer than encoding the prompt.

Anime-style illustration of a black-haired woman smiling in a rainy alley, wearing a black T-shirt

With this setup, the Text Encoder stays in RAM and the Transformer in VRAM, eliminating model swapping, which speeds up image generation.

CPU Processing Uses FP32 Format

Let’s dive deeper into CPU processing.

AI tasks rely on floating-point calculations (FP) for computations.

Types and Precision of Floating-Point Calculations

	Sign	Exponent	Mantissa	Significant Digits	Precision
FP32	1 bit	8 bits	23 bits	6–7 digits	🌟
FP16	1 bit	5 bits	10 bits	3–4 digits	✅
BF16	1 bit	8 bits	7 bits	3 digits	✅
FP8(e4m3)	1 bit	4 bits	3 bits	1–2 digits	⚠️
FP8(e5m2)	1 bit	5 bits	2 bits	1–2 digits	⚠️
FP4	1 bit	2 bits	1 bit	Less than 1 digit	❌

FP32: High precision, high computational load
FP16: Lower precision, faster processing

GPU and Floating-Point Calculation Support

	FP32	FP16	BF16	FP8	FP4
NVIDIA RTX 5000 Series (2025–)	✅	✅	✅	✅	✅
NVIDIA RTX 4000 Series (2022–)	✅	✅	✅	✅	❌
NVIDIA RTX 3000 Series (2020–)	✅	✅	✅	❌	❌
NVIDIA RTX 2000 Series (2018–)	✅	✅	❌	❌	❌
NVIDIA GTX 1000 Series (2016–)	✅	⚠️	❌	❌	❌
AMD Radeon	✅	⚠️	❌	❌	❌
Intel Arc	✅	⚠️	❌	❌	❌

CPU and Floating-Point Calculation Support

	FP32	FP16	BF16	FP8	FP4
AMD Ryzen (Zen4 and later)	✅	✅	✅	❌	❌
AMD Ryzen (up to Zen3)	✅	❌	❌	❌	❌
Intel Xeon (Sapphire Rapids and later)	✅	✅	⚠️	❌	❌
Intel Core i Series	✅	⚠️	⚠️	❌	❌

FP16 and lower formats enable faster processing and have gained popularity in recent years, but support is limited outside NVIDIA’s RTX series GPUs and CPUs. Most CPUs, except the latest flagship models, process in the FP32 format.

ComfyUI Defaults to FP16 Format

By default, ComfyUI processes the Text Encoder in FP16 format.

[ComfyUI Intermediate] Settings to Control VRAM and Unlock Peak Performance! | AI Image Journey

If the device (like our CPU) doesn’t support FP16, it emulates FP16 on FP32, resulting in the same processing time as FP32.

Screenshot of Stability Matrix with ComfyUI’s --fp32-text-enc startup setting — Stability Matrix’s --fp32-text-enc startup setting

When processing encoding on the CPU, if an FP32 Text Encoder is available, specifying --fp32-text-enc at ComfyUI’s startup allows processing in FP32 format, improving precision without increasing processing time.

Is Your Illustration High-Quality? Comparing T5xxl and CLIP-L with Real Data! | AI Image Journey

Using FP32 Format Reduces Steps!

Now, let’s compare illustrations generated with FP16 and FP32 Text Encoders using the Flux.1 Krea [dev] model.

Screenshot of DualCLIPLoader node with device set to CPU — Set device to CPU in DualCLIPLoader node

FP16 Text Encoder, 18 Steps

The Flux.1 Krea [dev] model uses more noise for high-resolution illustrations, requiring 18 steps to converge in FP16 format.

FP16 Text Encoder, 16 Steps

At 16 steps with FP16, the icons on the monitor and the background remain incomplete.

FP32 Text Encoder, 16 Steps

Using an FP32 Text Encoder not only improves image quality but also speeds up convergence, completing the illustration in 16 steps.

Speeding Up with Caching!

When generating multiple images with the same prompt, the prompt’s encoding result (conditioning) is cached, allowing subsequent generations to skip encoding entirely.

By generating a high-precision FP32 Text Encoder cache once, it can be reused indefinitely.

FP16 Text Encoder, 18 Steps

Second generation takes 324 seconds.

FP32 Text Encoder, 16 Steps

Second generation is reduced to 287 seconds, with improved image quality.

The FP32 Text Encoder improves quality and reduces the number of steps, making it faster than FP16.

High-Precision Models Converge Faster!

This experiment focused on Text Encoder precision and generation time, but the same principle applies to Transformers: higher-precision models converge faster.

When using next-generation image generation AI, allocate all VRAM to the Transformer and use the highest-precision model possible.

Minimum Split Sizes for Transformers in ComfyUI

In low-VRAM environments, model quantization like FP8 or GGUF is effective, but in environments with sufficient VRAM, quantization can slow down convergence. Keep this in mind.

Comparing Flux.1 Compression Formats: The Impact on Composition! | AI Image Journey

Workflows

Here are the workflows used in this experiment.

HiDream-I1-Dev, 8 Steps

HiDream-I1-Dev_20250822.json

Models Used

About HiDream

The Most Powerful Image Generation AI "HiDream" Runs on 12GB VRAM! | AI Image Journey

Flux.1 Krea [dev], 16 Steps

Flux1_Krea_dev_20250822.json

Models Used

Custom Nodes

Here are the custom nodes used in this experiment.

ComfyUI-Dev-Utils

ty0x2333/ComfyUI-Dev-Utils: Execution Time Analysis, Reroute Enhancement, Remote Python Logs, For ComfyUI developers.

ComfyUI-Dev-Utils is a custom node that displays the processing time and VRAM usage of each node during ComfyUI execution.

Screenshot of a node showing processing time and VRAM usage with ComfyUI-Dev-Utils

Table listing processing time and VRAM usage output by ComfyUI-Dev-Utils

With ComfyUI-Dev-Utils, processing times are displayed in the top-left corner of nodes, and a summary table is also generated.

ComfyUI-MultiGPU

pollockjj/ComfyUI-MultiGPU: This custom_node for ComfyUI adds one-click "Virtual VRAM" for any UNet and CLIP loader as well MultiGPU integration in WanVideoWrapper, managing the offload/Block Swap of layers to DRAM or VRAM to maximize the latent space of your card. Also includes nodes for directly loading entire components (UNet, CLIP, VAE) onto the device you choose

ComfyUI-MultiGPU is a custom node that enables the use of multiple GPUs’ VRAM in ComfyUI and allows switching processing to the CPU.

About ComfyUI-MultiGPU

[ComfyUI Advanced] Speeding Up with Custom Nodes! The Tradeoff Between Speed and Quality | AI Image Journey

Conclusion: Load the Text Encoder into RAM

Load the Text Encoder into RAM
Use high-precision models and leverage caching
Boldly reduce the number of steps

Next-generation image generation AI models are large and take time to generate images.

However, image quality and generation speed are not mutually exclusive. With optimal settings tailored to your PC’s performance and model size, you can achieve both.

Anime-style illustration of a woman smiling head-on in a rainy alley, wearing a black T-shirt

There are various techniques for speeding up image generation, but those that compromise precision may extend generation time. It’s essential to understand their advantages and drawbacks.

Why not try generating beautiful illustrations with next-generation models using high-speed settings without drawbacks?

Thank you for reading to the end!