[HiDream, Flux.1 Krea] Three Years Since the Dawn of Image Generation AI! A Deep Dive into Running the Latest Models at High Speed.


- Load the Text Encoder into RAM
- Use high-precision models and leverage caching
- Boldly reduce the number of steps
Introduction
Hello, I’m Easygoing!
In this post, we’ll explore how to run the latest image generation AI models at high speed.

Image Generation AI Models Are Getting Bigger
The journey of local image generation AI began exactly three years ago on August 22, 2022, with the release of Stable Diffusion 1.
gantt
title Local Image Generation AI
dateFormat YYYY-MM-DD
tickInterval 12month
axisFormat %Y
Stable Diffusion 1 :done, a1, 2022-08-22, 2025-08-23
Stable Diffusion XL 1.0 :done, c2, 2023-07-27, 2025-08-23
Stable Diffusion 3 : d1, 2024-06-12, 2025-08-23
Flux.1 : d2, 2024-08-01, 2025-08-23
HiDream : d3, 2025-04-06, 2025-08-23
Qwen-Image : d4, 2025-08-04, 2025-08-23
Over the past three years, image generation AI has evolved significantly, but with that progress, model sizes have also grown massively.
Models That Don’t Fit in VRAM!
Currently, the NVIDIA RTX 5090, the highest-performing consumer GPU, has 32GB of VRAM. However, models like Flux.1 and beyond exceed 32GB in size, meaning even the RTX 5090 cannot load the entire model into VRAM.
AI tasks heavily depend on efficient VRAM usage, and the way models are split and loaded into memory is key to generating images efficiently.
Testing Environment
For this experiment, we’ll measure generation times in the following setup:
- Windows 11
- RTX 4060 Ti 16GB
- RAM 64GB

First and Second Generation Times Are the Same
Let’s start by generating an illustration using HiDream, the largest image generation AI model.
We’ll generate an image once, then change the prompt and generate a second image.
HiDream-I1-Dev, 1024 x 1024, 8 Steps

In ComfyUI’s default workflow, both the Text Encoder and Transformer are loaded into VRAM and processed by the GPU.
- Text Encoder ➡️ VRAM (contention)
- Transformer ➡️ VRAM (contention)
Since HiDream’s model size exceeds VRAM capacity, contention occurs between the Text Encoder and Transformer for VRAM space.
Once a model is used, it’s retained in RAM, so the second generation doesn’t require loading from storage.
However, swapping the Text Encoder and Transformer between VRAM and RAM still occurs, resulting in the second generation taking the same amount of time as the first.
Loading the Text Encoder into RAM Doubles the Speed!
To avoid VRAM contention, let’s try explicitly loading the Text Encoder into RAM and processing it on the CPU.

To switch encoding from GPU to CPU, we use the QuadrupleCLIPLoaderMultiGPU node from the ComfyUI-MultiGPU custom node pack, setting the device to CPU.
- Text Encoder ➡️ RAM
- Transformer ➡️ VRAM
Loading the Text Encoder into RAM and processing it on the CPU doubles the prompt encoding time, but image generation becomes faster, cutting the second generation time in half.
HiDream’s large model size means that moving the model takes far longer than encoding the prompt.

With this setup, the Text Encoder stays in RAM and the Transformer in VRAM, eliminating model swapping, which speeds up image generation.
CPU Processing Uses FP32 Format
Let’s dive deeper into CPU processing.
AI tasks rely on floating-point calculations (FP) for computations.
Types and Precision of Floating-Point Calculations
Sign | Exponent | Mantissa | Significant Digits | Precision | |
---|---|---|---|---|---|
FP32 | 1 bit | 8 bits | 23 bits | 6–7 digits | 🌟 |
FP16 | 1 bit | 5 bits | 10 bits | 3–4 digits | ✅ |
BF16 | 1 bit | 8 bits | 7 bits | 3 digits | ✅ |
FP8(e4m3) | 1 bit | 4 bits | 3 bits | 1–2 digits | ⚠️ |
FP8(e5m2) | 1 bit | 5 bits | 2 bits | 1–2 digits | ⚠️ |
FP4 | 1 bit | 2 bits | 1 bit | Less than 1 digit | ❌ |
- FP32: High precision, high computational load
- FP16: Lower precision, faster processing
GPU and Floating-Point Calculation Support
FP32 | FP16 | BF16 | FP8 | FP4 | |
---|---|---|---|---|---|
NVIDIA RTX 5000 Series (2025–) | ✅ | ✅ | ✅ | ✅ | ✅ |
NVIDIA RTX 4000 Series (2022–) | ✅ | ✅ | ✅ | ✅ | ❌ |
NVIDIA RTX 3000 Series (2020–) | ✅ | ✅ | ✅ | ❌ | ❌ |
NVIDIA RTX 2000 Series (2018–) | ✅ | ✅ | ❌ | ❌ | ❌ |
NVIDIA GTX 1000 Series (2016–) | ✅ | ⚠️ | ❌ | ❌ | ❌ |
AMD Radeon | ✅ | ⚠️ | ❌ | ❌ | ❌ |
Intel Arc | ✅ | ⚠️ | ❌ | ❌ | ❌ |
CPU and Floating-Point Calculation Support
FP32 | FP16 | BF16 | FP8 | FP4 | |
---|---|---|---|---|---|
AMD Ryzen (up to Zen3) | ✅ | ❌ | ❌ | ❌ | ❌ |
AMD Ryzen (Zen4 and later) | ✅ | ✅ | ✅ | ❌ | ❌ |
Intel Core i Series | ✅ | ⚠️ | ⚠️ | ❌ | ❌ |
Intel Xeon (Sapphire Rapids and later) | ✅ | ✅ | ⚠️ | ❌ | ❌ |
FP16 and lower formats enable faster processing and have gained popularity in recent years, but support is limited outside NVIDIA’s RTX series GPUs and CPUs. Most CPUs, except the latest flagship models, process in the FP32 format.
ComfyUI Defaults to FP16 Format
By default, ComfyUI processes the Text Encoder in FP16 format.
If the device (like our CPU) doesn’t support FP16, it emulates FP16 on FP32, resulting in the same processing time as FP32.

When processing encoding on the CPU, if an FP32 Text Encoder is available, specifying --fp32-text-enc at ComfyUI’s startup allows processing in FP32 format, improving precision without increasing processing time.
Using FP32 Format Reduces Steps!
Now, let’s compare illustrations generated with FP16 and FP32 Text Encoders using the Flux.1 Krea [dev] model.

FP16 Text Encoder, 18 Steps
![Desk monitor illustration generated with Flux.1 Krea [dev] FP16 Text Encoder, 18 steps](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjF8apNmxjUgGp3mBM47XayTn1BBiPuXtITg4mVlXxihTA5s5XovhwnKNZ0cl5wu1FPpy5RzZxlpFjeJvvZIlFy_AMhSLFxG1HVmQJIwQ54Z76Q60HWkLjsrvXRomEpX9XoBZzZnOwPVuzfssTGDKbM-WaZjxE8lTfUrPAUtDNgwdWf3Q/w800-e90-rw/FP16TE_heumpp2_18steps_00001_.png)
The Flux.1 Krea [dev] model uses more noise for high-resolution illustrations, requiring 18 steps to converge in FP16 format.
FP16 Text Encoder, 16 Steps
![Incomplete desk monitor illustration generated with Flux.1 Krea [dev] FP16 Text Encoder, 16 steps](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjQ9ZLUnZf8n9U5koPH7c0moXAFcknDdQSCVA1e-Y6O0vWrOpQWA3Izk03_dVaAv0kmvRN47u-2BT53P0eyBDOoh9M3u0J69qcvjrvOQW11CfJ2-TwhinyAHZF1D5D9UYQOh2bej8f8JzhIhyphenhyphenPz_-PDIPRn40ZGWaNQcLMMMBNIxLFYQ/w800-e90-rw/FP16TE_heumpp2_16steps_00001_.png)
At 16 steps with FP16, the icons on the monitor and the background remain incomplete.
FP32 Text Encoder, 16 Steps
![Completed desk monitor illustration generated with Flux.1 Krea [dev] FP32 Text Encoder, 16 steps](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgY_1AJi8U7a9XHmzQGoD74bl0Tp1gPQjPzJiAHt41473-xxLdT06041i88cQTNjn-ZwNA6CPQzi87xV9soRwA1I1rLynI0NB1k1PZ-qikAPI5HY82cedbYF52cUBJ86rF6-dF1P6Uz8NeBARaoFMbHMiaxqt3M5hqwq6HL6LXcTQrVBg/w800-e90-rw/FP32TE_heumpp2_16steps_00003_.png)
Using an FP32 Text Encoder not only improves image quality but also speeds up convergence, completing the illustration in 16 steps.
Speeding Up with Caching!
When generating multiple images with the same prompt, the prompt’s encoding result (conditioning) is cached, allowing subsequent generations to skip encoding entirely.
By generating a high-precision FP32 Text Encoder cache once, it can be reused indefinitely.
FP16 Text Encoder, 18 Steps
FP32 Text Encoder, 16 Steps
The FP32 Text Encoder improves quality and reduces the number of steps, making it faster than FP16.
High-Precision Models Converge Faster!
This experiment focused on Text Encoder precision and generation time, but the same principle applies to Transformers: higher-precision models converge faster.
When using next-generation image generation AI, allocate all VRAM to the Transformer and use the highest-precision model possible.
Minimum Split Sizes for Transformers in ComfyUI
In low-VRAM environments, model quantization like FP8 or GGUF is effective, but in environments with sufficient VRAM, quantization can slow down convergence. Keep this in mind.
Workflows
Here are the workflows used in this experiment.
HiDream-I1-Dev, 8 Steps

Models Used
- HiDream_I1_Dev_BF16.safetensors
- Llama-3.1-8b-instruct-BF16.safetensors
- flan-t5-xxl_TE-only_FP32.safetensors
- CLIP-ViT-bigG-14-laion2B-39B-b160k-FP32.safetensors
- CLIP-SAE-ViT-L-14 -FP32.safetensors
- FLUX1-schnell-AE-FP32.safetensors
About HiDream
Flux.1 Krea [dev], 16 Steps
![ComfyUI workflow diagram for Flux.1 Krea [dev], 16 steps](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1WATTemQBThfqcQ3oBEoXbqzTDzkpaJfhcfdkPcE_ukBe5oN1IRK9gY4Z5S62rR63E_HNAcMvDogmZl-K0a7jRyft_Taf8xkgr63zINmoQlmwccQqiV5WQkYVOAetsxm9fkGOtBWGYpSYzpiDi7O2DaCTk9Nx43sUOf2oZe2oUpyWug/w800-e90-rw/Flux1_Krea_dev_20250822.png)
Models Used
Custom Nodes
Here are the custom nodes used in this experiment.
ComfyUI-Dev-Utils

ComfyUI-Dev-Utils is a custom node that displays the processing time and VRAM usage of each node during ComfyUI execution.


With ComfyUI-Dev-Utils, processing times are displayed in the top-left corner of nodes, and a summary table is also generated.
ComfyUI-MultiGPU

ComfyUI-MultiGPU is a custom node that enables the use of multiple GPUs’ VRAM in ComfyUI and allows switching processing to the CPU.
About ComfyUI-MultiGPU
Conclusion: Load the Text Encoder into RAM
- Load the Text Encoder into RAM
- Use high-precision models and leverage caching
- Boldly reduce the number of steps
Next-generation image generation AI models are large and take time to generate images.
However, image quality and generation speed are not mutually exclusive. With optimal settings tailored to your PC’s performance and model size, you can achieve both.

There are various techniques for speeding up image generation, but those that compromise precision may extend generation time. It’s essential to understand their advantages and drawbacks.
Why not try generating beautiful illustrations with next-generation models using high-speed settings without drawbacks?
Thank you for reading to the end!
Related Articles
Optimizing VRAM Management in ComfyUI
When using next-generation models in ComfyUI, optimizing VRAM management is highly recommended.