Goodbye Latent Space? How HiDream-O1-Image is Revolutionizing General-Purpose AI Drawing
- HiDream-O1-Image uses a general-purpose UiT architecture
- The “work is visible” to humans
- Potential for dramatic performance improvements in Qwen-VL models
Introduction
Hello, this is Easygoing.
This time, I’d like to introduce the new image generation AI HiDream-O1-Image, which was released on May 8, 2026.
HiDream.ai is a Chinese AI Startup
HiDream.ai is an AI startup company headquartered in Beijing, China.
gantt
title HiDream.ai models
dateFormat YYYY-MM-DD
tickInterval 12month
axisFormat %Y
HiDream-I1-Full : 2025-04-06, 2026-05-15
HiDream-I1-Dev : 2025-04-06, 2026-05-15
HiDream-I1-Fast : 2025-04-06, 2026-05-15
HiDream-E1-Full : 2025-04-27, 2026-05-15
HiDream-E1.1 : 2025-07-16, 2026-05-15
HiDream-O1-Image : crit, 2026-05-15, 2026-05-15
The HiDream-I1 model released by HiDream.ai in April 2025 was a high-performance image generation AI equipped with four text encoders. It caused a global sensation because it was released under the MIT license, allowing free development and commercial use.
About HiDream-I1
While the HiDream-I1 model had excellent prompt understanding, it was computationally heavy for local execution. Additionally, because it was trained on images converted to JPEG, it had the drawback of reproducing JPEG noise. Unfortunately, it did not gain widespread adoption among general users.
However, the HiDream-O1-Image model that appeared on May 8, 2026, brings an even greater impact than the previous HiDream-I1, so I’d like to introduce its features to you all.
Image Generation is Handled by Three Specialized AIs Working Together
Image generation is performed through the division of labor among three AIs.
flowchart TB
subgraph Checkpoint
A1(Text Encoder)
B1(Unet / Transformer)
C1(VAE)
end
- Text Encoder: Analyzes the prompt
- UNet / Transformer: Generates the image
- VAE: Compresses the space
First, when the user says “Draw a picture,” an AI that understands human language (mainly English and Chinese) converts the instruction into machine language (vectors).
Image Generation Workflow
flowchart TD
A1(User)
B1(Text Encoder)
C1(UNet / Transformer)
D1(VAE)
A1 -- "Draw a picture" --> B1
B1 -- "[1, 0, 128, 2, 4, 6, 0, 2]" --> C1
C1 -- "[4, 2, 0, 64, 8, 2, 1, 4]" --> D1
D1 -- "Here you go!" --> A1
subgraph Latent Space
C1
end
Then, based on that vector, an AI specialized in drawing retreats into its own dedicated studio (latent space) and works 12 times more efficiently, diligently creating the image.
The finished painting is in a format only the drawing AI can understand (it looks like static noise to humans), so it is decoded (VAE decode) to produce an image visible to humans.
gantt
title Image Generative AI Roadmap
dateFormat YYYY-MM-DD
tickInterval 12month
axisFormat %Y
section Stability AI
Stable Diffusion 1 : 2022-08-22, 2026-05-15
Stable Diffusion XL : 2023-07-26, 2026-05-15
Stable Diffusion 3 : 2024-06-12, 2026-05-15
section Fal.ai
AuraFlow : 2024-07-12, 2026-05-15
section Black Forest Labs
Flux.1 : 2024-08-01, 2026-05-15
Flux.2 : 2025-11-25, 2026-05-15
section DeepSeek.ai
janus-pro : 2025-01-25, 2026-05-15
section Zhipu AI
CogVideoX : 2024-08-06, 2026-05-15
GLM-Image : 2026-01-12, 2026-05-15
section Rhymes AI
Allegro : 2024-10-22, 2026-05-15
section Genmo
Mochi : 2024-10-25, 2026-05-15
section Tencent
Hunyuan video : 2024-12-03, 2026-05-15
Hunyuan image : 2025-09-09, 2026-05-15
section lllyasviel
Framepack : 2025-04-17, 2026-05-15
section Lightricks
LTX : 2024-12-11, 2026-05-15
section StepFun
Step-Video-T2V : 2025-02-17, 2026-05-15
section Alibaba
Wan : 2025-02-25, 2026-05-15
Qwen-Image : 2025-08-04, 2026-05-15
Z-Image : 2025-11-25, 2026-05-15
section NVIDIA
Cosmos-Predict2 : 2025-04-30, 2026-05-15
section CircleStone Labs
Anima : 2026-01-26, 2026-05-15
section Baidu
ERNIE-Image : 2026-04-07, 2026-05-15
section HiDream.ai
HiDream-I1 : 2025-04-06, 2026-05-15
HiDream-O1-Image : crit, 2026-05-08, 2026-05-15
This has been the method used by all image generation AIs since the release of Stable Diffusion 1.
HiDream-O1-Image Does Everything by Itself!
Now, let’s take a look at how HiDream-O1-Image processes images.
HiDream-O1-Image is a model that extends the large language model (chat AI) called Qwen3-VL by adding image generation capabilities.
HiDream-O1-Image Workflow
flowchart TB
A1(User)
B1(HiDream-O1-Image)
A1--"Draw a picture"-->B1
B1--"Here you go!"-->A1
HiDream-O1-Image understands language and images in the same dimension (UiT: Pixel-level Unified Transformer architecture) and draws pictures by itself.
Since it doesn’t retreat into its own dedicated studio, humans can sequentially check what parts it is modifying.
Furthermore, because it draws human-visible images directly, there is no need for decoding, and thus no image quality degradation caused by decoding.
About Image Quality Degradation Caused by VAE
HiDream-O1-Image is a groundbreaking AI model that proves a general-purpose AI can generate illustrations without needing a specialized drawing AI.
General-Purpose AI Has Many Advantages
General-purpose AI offers numerous benefits.
Simplified Workflow
As you can see from the diagram above, using a general-purpose model greatly simplifies the workflow.
A simpler workflow naturally leads to faster processing. It also eliminates the need to carefully manage information passed between AIs, making adjustments much easier.
Additionally, there have been frequent cases recently of malware being injected into libraries used by AI. With fewer models to use, fewer libraries are required, which also reduces security risks.
Lighter than Z-Image
The common belief for the past four years since image generation AI emerged has been that “latent space is necessary for AI to generate images efficiently.” HiDream-O1-Image shattered this belief with the UiT architecture using a model that is only one-third the size of its predecessor — which is truly astonishing.
Processing Language and Images in the Same Dimension
By processing language and images in the same dimension, HiDream-O1-Image can perform image editing much more naturally than before.
Image Editing Examples with HiDream-O1-Image
※ The author has not yet been able to reproduce image editing.
AI acquires desired functions by ingesting massive amounts of information, but the details remain a black box.
With the arrival of the HiDream-O1-Image model, which processes language and images in the same dimension without using latent space, we can expect progress in the understanding of image generation AI models themselves.
What Will Happen Next?
HiDream-O1-Image is built on Alibaba’s Qwen3-VL model.
As of May 2026, the Qwen series has established itself as the de facto standard text encoder for image generation thanks to its high performance and open license.
Text Encoders for Image Generation AI Models
| Developer | Model | Text Encoder | Encoder Developer |
|---|---|---|---|
| — CLIP Generation (~2023) — | |||
| Stability AI | Stable Diffusion 1.x | CLIP-L (0.1B) | OpenAI |
| Stable Diffusion XL | CLIP-L (0.1B) OpenCLIP-G (0.7B) |
OpenAI LAION |
|
| — T5 Generation (2024~) — | |||
| Stability AI | Stable Diffusion 3 | CLIP-L (0.1B) OpenCLIP-G (0.7B) T5-XXL-v1.1 (11B) |
OpenAI LAION |
| Fal.ai | AuraFlow | pile-T5-XL (3B) | EleutherAI / Google |
| Black Forest Labs | Flux.1 [schnell / dev] | CLIP-L (0.1B) T5-XXL-v1.1 (11B) |
OpenAI |
| DeepSeek | Janus-Pro | SigLIP-L (0.4B) DeepSeek-LLM (7B) |
Google DeepSeek |
| Zhipu AI | CogVideoX | T5-XXL (11B) | |
| GLM-Image | GLM-4-9B (9B) Glyph Encoder |
Zhipu AI | |
| Genmo | Mochi | T5-XXL-v1.1 (11B) | |
| Rhymes AI | Allegro | T5-XXL (11B) | |
| Lightricks | LTX-Video | T5-XXL-v1.1 (11B) | |
| NVIDIA | Cosmos-Predict2 | T5-XXL (11B) | |
| — Proprietary LLM Generation (2025~) — | |||
| HiDream-ai | HiDream-I1 | CLIP-L (0.1B) OpenCLIP-G (0.7B) T5-XXL-v1.1 (11B) Llama-3.1-Instruct (8B) |
OpenAI LAION Meta |
| Tencent | Hunyuan Video | LLaVA-LLaMA-3 (8B) CLIP-L (0.1B) |
Xtuner / Meta OpenAI |
| Hunyuan Image | Proprietary MLLM | Tencent | |
| StepFun | Step-Video-T2V | Hunyuan-CLIP Step-LLM |
Tencent StepFun |
| Alibaba | Wan (2.1 / 2.2) | UMT5-XXL (13B) | |
| Qwen-Image | Qwen2.5-VL (7B) | Alibaba | |
| Z-Image | Qwen3 (4B) | Alibaba | |
| Black Forest Labs | Flux.2 [dev] | Mistral Small 3.2 / Pixtral (24B) | Mistral AI |
| Flux.2 [klein] 9B | Qwen3 (8B) | Alibaba | |
| Flux.2 [klein] 4B | Qwen3 (4B) | Alibaba | |
| CircleStone Labs | Anima | Qwen3-Base (0.6B) | Alibaba |
| Baidu | ERNIE-Image | Mistral3 Pixtral (3.3B) |
Mistral AI |
| HiDream-ai | HiDream-O1-Image | Qwen3-VL (8B) | Alibaba |
When it comes to image generation, it’s Qwen.
HiDream-O1-Image’s technology can naturally be reverse-imported back into the main Qwen project, so it is certain that the image recognition and generation capabilities of the Qwen-VL series will improve dramatically.
Furthermore, it has already been revealed that a high-performance model with more than 200B parameters based on the HiDream-O1-Image model exists.
The UiT architecture released under the MIT license with HiDream-O1-Image is likely to become the new standard for image generation. We can foresee a future in which current image generation AIs will be restructured under the UiT architecture.
And if the UiT architecture is introduced into ultra-large-scale cloud AIs such as ChatGPT or Gemini, it is beyond the author’s imagination what will become possible.
How to Use HiDream-O1-Image!
Let me show you how to use the HiDream-O1-Image model in ComfyUI.
As of May 15, 2026, ComfyUI does not yet have official support for HiDream-O1-Image, so we use a community custom node.
HiDream_O1-ComfyUI Custom Node
It cannot be found when searching in ComfyUI-Manager v3 (custom nodes), so install it manually.
Two models are available for HiDream-O1-Image:
- HiDream-O1-Image: Base model
- HiDream-O1-Image-Dev: Distilled model for faster execution
Since HiDream-O1-Image does not use VAE, colors will not degrade even if you use the base model as-is.
ComfyUI is optimized for latent space rather than UiT architecture, so the standard Sampler and Scheduler cannot be used at this time. The Sampler and Scheduler are fixed for each model as follows:
- HiDream-O1-Image: FlowUniPCMultistepScheduler
- HiDream-O1-Image-Dev: FlashFlowMatchEulerDiscreteScheduler (28 steps fixed)
HiDream O1 Sampler Node Settings
- guidance_scale: Equivalent to CFG
-
shift: Whether to concentrate steps in the first half (<1) or the second half (>1)
- Default is -1, Dev: 1, Full: 3
- noise_scale_start: 7.5 (initial noise strength)
- noise_scale_end: 7.5 (final noise strength)
- noise_clip_std: 2.5 (noise change threshold)
The items below “shift” allow manual scheduler adjustment, but manually adjusting the scheduler is quite difficult in practice.
In the future, easy-to-use UiT architecture presets will likely appear, but for now, it is best to use the initial settings that were used during training.
Summary: The Future Where General-Purpose AI Freely Draws Images
- HiDream-O1-Image uses a general-purpose UiT architecture
- The “work is visible” to humans
- Potential for dramatic performance improvements in Qwen-VL models
This time, I introduced the HiDream-O1 model.
HiDream is one of my favorite model series, and I am delighted that they have released a new model that directly challenges the fundamental propositions of image generation AI and breaks common sense.
Two years have passed since Alibaba began releasing the Qwen models under open licenses, and one year since the HiDream models. Based on their track record so far, the author trusts both companies’ commitment to open source.
The field of image generation AI is still full of potential for major transformation, and I am excited to see what kind of future awaits us.
Thank you for reading until the end!