[HiDream, Flux.1 Krea] Three Years Since the Dawn of Image Generation AI! A Deep Dive into Running the Latest Models at High Speed.

an animated female character stands on a lit city street at night sporting long brown hair purple eyes a black shirt and gold earrings. a red car is parked nearby
  • Load the Text Encoder into RAM
  • Use high-precision models and leverage caching
  • Boldly reduce the number of steps

Introduction

Hello, I’m Easygoing!

In this post, we’ll explore how to run the latest image generation AI models at high speed.

Anime-style illustration of a brown-haired woman smiling in a rainy alley with a red classic car

Image Generation AI Models Are Getting Bigger

The journey of local image generation AI began exactly three years ago on August 22, 2022, with the release of Stable Diffusion 1.


gantt
    title Local Image Generation AI
    dateFormat YYYY-MM-DD
    tickInterval 12month
    axisFormat %Y

        Stable Diffusion 1 :done, a1, 2022-08-22, 2025-08-23
        Stable Diffusion XL 1.0 :done, c2, 2023-07-27, 2025-08-23
        Stable Diffusion 3 : d1, 2024-06-12, 2025-08-23
        Flux.1   : d2, 2024-08-01, 2025-08-23
        HiDream   : d3, 2025-04-06, 2025-08-23
        Qwen-Image   : d4, 2025-08-04, 2025-08-23

Over the past three years, image generation AI has evolved significantly, but with that progress, model sizes have also grown massively.

Models That Don’t Fit in VRAM!

Currently, the NVIDIA RTX 5090, the highest-performing consumer GPU, has 32GB of VRAM. However, models like Flux.1 and beyond exceed 32GB in size, meaning even the RTX 5090 cannot load the entire model into VRAM.

AI tasks heavily depend on efficient VRAM usage, and the way models are split and loaded into memory is key to generating images efficiently.

Testing Environment

For this experiment, we’ll measure generation times in the following setup:

  • Windows 11
  • RTX 4060 Ti 16GB
  • RAM 64GB
Anime-style illustration of a blonde woman smiling in a rainy alley, wearing a black shirt

First and Second Generation Times Are the Same

Let’s start by generating an illustration using HiDream, the largest image generation AI model.

We’ll generate an image once, then change the prompt and generate a second image.

HiDream-I1-Dev, 1024 x 1024, 8 Steps

Night scene illustration of a computer monitor displaying the HiDream logo

In ComfyUI’s default workflow, both the Text Encoder and Transformer are loaded into VRAM and processed by the GPU.

  • Text Encoder ➡️ VRAM (contention)
  • Transformer ➡️ VRAM (contention)

Since HiDream’s model size exceeds VRAM capacity, contention occurs between the Text Encoder and Transformer for VRAM space.

Once a model is used, it’s retained in RAM, so the second generation doesn’t require loading from storage.

However, swapping the Text Encoder and Transformer between VRAM and RAM still occurs, resulting in the second generation taking the same amount of time as the first.

Loading the Text Encoder into RAM Doubles the Speed!

To avoid VRAM contention, let’s try explicitly loading the Text Encoder into RAM and processing it on the CPU.

Screenshot of QuadrupleCLIPLoaderMultiGPU node with device set to CPU
Set device to CPU

To switch encoding from GPU to CPU, we use the QuadrupleCLIPLoaderMultiGPU node from the ComfyUI-MultiGPU custom node pack, setting the device to CPU.

  • Text Encoder ➡️ RAM
  • Transformer ➡️ VRAM

Loading the Text Encoder into RAM and processing it on the CPU doubles the prompt encoding time, but image generation becomes faster, cutting the second generation time in half.

HiDream’s large model size means that moving the model takes far longer than encoding the prompt.

Anime-style illustration of a black-haired woman smiling in a rainy alley, wearing a black T-shirt

With this setup, the Text Encoder stays in RAM and the Transformer in VRAM, eliminating model swapping, which speeds up image generation.

CPU Processing Uses FP32 Format

Let’s dive deeper into CPU processing.

AI tasks rely on floating-point calculations (FP) for computations.

Types and Precision of Floating-Point Calculations

Sign Exponent Mantissa Significant Digits Precision
FP32 1 bit 8 bits 23 bits 6–7 digits 🌟
FP16 1 bit 5 bits 10 bits 3–4 digits
BF16 1 bit 8 bits 7 bits 3 digits
FP8(e4m3) 1 bit 4 bits 3 bits 1–2 digits ⚠️
FP8(e5m2) 1 bit 5 bits 2 bits 1–2 digits ⚠️
FP4 1 bit 2 bits 1 bit Less than 1 digit
  • FP32: High precision, high computational load
  • FP16: Lower precision, faster processing

GPU and Floating-Point Calculation Support

FP32 FP16 BF16 FP8 FP4
NVIDIA RTX 5000 Series (2025–)
NVIDIA RTX 4000 Series (2022–)
NVIDIA RTX 3000 Series (2020–)
NVIDIA RTX 2000 Series (2018–)
NVIDIA GTX 1000 Series (2016–) ⚠️
AMD Radeon ⚠️
Intel Arc ⚠️

CPU and Floating-Point Calculation Support

FP32 FP16 BF16 FP8 FP4
AMD Ryzen (up to Zen3)
AMD Ryzen (Zen4 and later)
Intel Core i Series ⚠️ ⚠️
Intel Xeon (Sapphire Rapids and later) ⚠️

FP16 and lower formats enable faster processing and have gained popularity in recent years, but support is limited outside NVIDIA’s RTX series GPUs and CPUs. Most CPUs, except the latest flagship models, process in the FP32 format.

ComfyUI Defaults to FP16 Format

By default, ComfyUI processes the Text Encoder in FP16 format.

If the device (like our CPU) doesn’t support FP16, it emulates FP16 on FP32, resulting in the same processing time as FP32.

Screenshot of Stability Matrix with ComfyUI’s --fp32-text-enc startup setting
Stability Matrix’s --fp32-text-enc startup setting

When processing encoding on the CPU, if an FP32 Text Encoder is available, specifying --fp32-text-enc at ComfyUI’s startup allows processing in FP32 format, improving precision without increasing processing time.

Using FP32 Format Reduces Steps!

Now, let’s compare illustrations generated with FP16 and FP32 Text Encoders using the Flux.1 Krea [dev] model.

Screenshot of DualCLIPLoader node with device set to CPU
Set device to CPU in DualCLIPLoader node

FP16 Text Encoder, 18 Steps

Desk monitor illustration generated with Flux.1 Krea [dev] FP16 Text Encoder, 18 steps

The Flux.1 Krea [dev] model uses more noise for high-resolution illustrations, requiring 18 steps to converge in FP16 format.

FP16 Text Encoder, 16 Steps

Incomplete desk monitor illustration generated with Flux.1 Krea [dev] FP16 Text Encoder, 16 steps
Icons and background are incomplete

At 16 steps with FP16, the icons on the monitor and the background remain incomplete.

FP32 Text Encoder, 16 Steps

Completed desk monitor illustration generated with Flux.1 Krea [dev] FP32 Text Encoder, 16 steps
Improved quality, illustration completed in 16 steps

Using an FP32 Text Encoder not only improves image quality but also speeds up convergence, completing the illustration in 16 steps.

Speeding Up with Caching!

When generating multiple images with the same prompt, the prompt’s encoding result (conditioning) is cached, allowing subsequent generations to skip encoding entirely.

By generating a high-precision FP32 Text Encoder cache once, it can be reused indefinitely.

FP16 Text Encoder, 18 Steps

Second generation takes 324 seconds.

FP32 Text Encoder, 16 Steps

Second generation is reduced to 287 seconds, with improved image quality.

The FP32 Text Encoder improves quality and reduces the number of steps, making it faster than FP16.

High-Precision Models Converge Faster!

This experiment focused on Text Encoder precision and generation time, but the same principle applies to Transformers: higher-precision models converge faster.

When using next-generation image generation AI, allocate all VRAM to the Transformer and use the highest-precision model possible.

Minimum Split Sizes for Transformers in ComfyUI

In low-VRAM environments, model quantization like FP8 or GGUF is effective, but in environments with sufficient VRAM, quantization can slow down convergence. Keep this in mind.

Workflows

Here are the workflows used in this experiment.

HiDream-I1-Dev, 8 Steps

ComfyUI workflow diagram for HiDream-I1-Dev, 8 steps

Models Used

About HiDream

Flux.1 Krea [dev], 16 Steps

ComfyUI workflow diagram for Flux.1 Krea [dev], 16 steps

Models Used

Custom Nodes

Here are the custom nodes used in this experiment.

ComfyUI-Dev-Utils

Screenshot of ComfyUI Manager searching for ComfyUI-Dev-Utils

ComfyUI-Dev-Utils is a custom node that displays the processing time and VRAM usage of each node during ComfyUI execution.

Screenshot of a node showing processing time and VRAM usage with ComfyUI-Dev-Utils
Table listing processing time and VRAM usage output by ComfyUI-Dev-Utils

With ComfyUI-Dev-Utils, processing times are displayed in the top-left corner of nodes, and a summary table is also generated.

ComfyUI-MultiGPU

Screenshot of ComfyUI Manager searching for ComfyUI-MultiGPU

ComfyUI-MultiGPU is a custom node that enables the use of multiple GPUs’ VRAM in ComfyUI and allows switching processing to the CPU.

About ComfyUI-MultiGPU

Conclusion: Load the Text Encoder into RAM

  • Load the Text Encoder into RAM
  • Use high-precision models and leverage caching
  • Boldly reduce the number of steps

Next-generation image generation AI models are large and take time to generate images.

However, image quality and generation speed are not mutually exclusive. With optimal settings tailored to your PC’s performance and model size, you can achieve both.

Anime-style illustration of a woman smiling head-on in a rainy alley, wearing a black T-shirt

There are various techniques for speeding up image generation, but those that compromise precision may extend generation time. It’s essential to understand their advantages and drawbacks.

Why not try generating beautiful illustrations with next-generation models using high-speed settings without drawbacks?

Thank you for reading to the end!


Optimizing VRAM Management in ComfyUI

When using next-generation models in ComfyUI, optimizing VRAM management is highly recommended.