Goodbye Latent Space? How HiDream-O1-Image is Revolutionizing General-Purpose AI Drawing

2026-5-15

HiDream-O1-Image uses a general-purpose UiT architecture
The “work is visible” to humans
Potential for dramatic performance improvements in Qwen-VL models

Introduction

Hello, this is Easygoing.

This time, I’d like to introduce the new image generation AI HiDream-O1-Image, which was released on May 8, 2026.

LED-lit colorful PC interior — HiDream-O1-Image_clear_v01_alpha

HiDream.ai is a Chinese AI Startup

HiDream.ai is an AI startup company headquartered in Beijing, China.


gantt
    title HiDream.ai models
    dateFormat YYYY-MM-DD
    tickInterval 12month
	axisFormat %Y

        HiDream-I1-Full :  2025-04-06, 2026-05-15
        HiDream-I1-Dev :  2025-04-06, 2026-05-15
        HiDream-I1-Fast :  2025-04-06, 2026-05-15
        HiDream-E1-Full :  2025-04-27, 2026-05-15
        HiDream-E1.1 :  2025-07-16, 2026-05-15
        HiDream-O1-Image :  crit, 2026-05-15, 2026-05-15

The HiDream-I1 model released by HiDream.ai in April 2025 was a high-performance image generation AI equipped with four text encoders. It caused a global sensation because it was released under the MIT license, allowing free development and commercial use.

About HiDream-I1

The Most Powerful Image Generation AI "HiDream" Runs on 12GB VRAM! | AI Image Journey

While the HiDream-I1 model had excellent prompt understanding, it was computationally heavy for local execution. Additionally, because it was trained on images converted to JPEG, it had the drawback of reproducing JPEG noise. Unfortunately, it did not gain widespread adoption among general users.

However, the HiDream-O1-Image model that appeared on May 8, 2026, brings an even greater impact than the previous HiDream-I1, so I’d like to introduce its features to you all.

Image Generation is Handled by Three Specialized AIs Working Together

Image generation is performed through the division of labor among three AIs.


flowchart TB

subgraph Checkpoint

A1(Text Encoder)

B1(Unet / Transformer)

C1(VAE)

end

Text Encoder: Analyzes the prompt
UNet / Transformer: Generates the image
VAE: Compresses the space

First, when the user says “Draw a picture,” an AI that understands human language (mainly English and Chinese) converts the instruction into machine language (vectors).

Image Generation Workflow


flowchart TD
    A1(User)
    B1(Text Encoder)
    C1(UNet / Transformer)
    D1(VAE)

    A1 -- "Draw a picture" --> B1
    B1 -- "[1, 0, 128, 2, 4, 6, 0, 2]" --> C1
    C1 -- "[4, 2, 0, 64, 8, 2, 1, 4]" --> D1
    D1 -- "Here you go!" --> A1

    subgraph Latent Space
        C1
    end

Then, based on that vector, an AI specialized in drawing retreats into its own dedicated studio (latent space) and works 12 times more efficiently, diligently creating the image.

The finished painting is in a format only the drawing AI can understand (it looks like static noise to humans), so it is decoded (VAE decode) to produce an image visible to humans.

gantt
    title Image Generative AI Roadmap
    dateFormat YYYY-MM-DD
    tickInterval 12month
    axisFormat %Y

    section Stability AI
        Stable Diffusion 1 : 2022-08-22, 2026-05-15
        Stable Diffusion XL : 2023-07-26, 2026-05-15
        Stable Diffusion 3 : 2024-06-12, 2026-05-15

    section Fal.ai
        AuraFlow : 2024-07-12, 2026-05-15

    section Black Forest Labs
        Flux.1 : 2024-08-01, 2026-05-15
        Flux.2 : 2025-11-25, 2026-05-15

	section DeepSeek.ai
		janus-pro : 2025-01-25,  2026-05-15

	section Zhipu AI
        CogVideoX : 2024-08-06, 2026-05-15
        GLM-Image : 2026-01-12, 2026-05-15

    section Rhymes AI
        Allegro : 2024-10-22, 2026-05-15

	section Genmo
        Mochi : 2024-10-25, 2026-05-15

    section Tencent
		Hunyuan video : 2024-12-03, 2026-05-15
		Hunyuan image : 2025-09-09, 2026-05-15

    section lllyasviel
		Framepack : 2025-04-17, 2026-05-15

	section Lightricks
		LTX : 2024-12-11,  2026-05-15

    section StepFun
        Step-Video-T2V : 2025-02-17, 2026-05-15

    section Alibaba
        Wan : 2025-02-25, 2026-05-15
        Qwen-Image : 2025-08-04, 2026-05-15
		Z-Image : 2025-11-25, 2026-05-15

	section NVIDIA
        Cosmos-Predict2 : 2025-04-30, 2026-05-15

    section CircleStone Labs
        Anima : 2026-01-26, 2026-05-15

    section Baidu
		ERNIE-Image : 2026-04-07, 2026-05-15

    section HiDream.ai
        HiDream-I1 : 2025-04-06, 2026-05-15
		HiDream-O1-Image :  crit, 2026-05-08, 2026-05-15

This has been the method used by all image generation AIs since the release of Stable Diffusion 1.

HiDream-O1-Image Does Everything by Itself!

Now, let’s take a look at how HiDream-O1-Image processes images.

HiDream-O1-Image is a model that extends the large language model (chat AI) called Qwen3-VL by adding image generation capabilities.

HiDream-O1-Image Workflow

flowchart TB
A1(User)
B1(HiDream-O1-Image)
A1--"Draw a picture"-->B1
B1--"Here you go!"-->A1

HiDream-O1-Image understands language and images in the same dimension (UiT: Pixel-level Unified Transformer architecture) and draws pictures by itself.

Since it doesn’t retreat into its own dedicated studio, humans can sequentially check what parts it is modifying.

Furthermore, because it draws human-visible images directly, there is no need for decoding, and thus no image quality degradation caused by decoding.

About Image Quality Degradation Caused by VAE

Does AI Illustration Turn Green? A Thorough Comparison of VAE Precision and Color Accuracy Between SDXL and Flux! | AI Image Journey

HiDream-O1-Image is a groundbreaking AI model that proves a general-purpose AI can generate illustrations without needing a specialized drawing AI.

General-Purpose AI Has Many Advantages

General-purpose AI offers numerous benefits.

Simplified Workflow

As you can see from the diagram above, using a general-purpose model greatly simplifies the workflow.

A simpler workflow naturally leads to faster processing. It also eliminates the need to carefully manage information passed between AIs, making adjustments much easier.

Additionally, there have been frequent cases recently of malware being injected into libraries used by AI. With fewer models to use, fewer libraries are required, which also reduces security risks.

HiDream-O1-Image uses only UiT
Lighter than Z-Image

The common belief for the past four years since image generation AI emerged has been that “latent space is necessary for AI to generate images efficiently.” HiDream-O1-Image shattered this belief with the UiT architecture using a model that is only one-third the size of its predecessor — which is truly astonishing.

Processing Language and Images in the Same Dimension

By processing language and images in the same dimension, HiDream-O1-Image can perform image editing much more naturally than before.

Image Editing Examples with HiDream-O1-Image

※ The author has not yet been able to reproduce image editing.

HiDream-O1-Image image editing examples: original, anime, text board, car toy

HiDream-O1-Image image editing examples: aging, anime character next to it, two-shot with a weird person from afar, sitting in a car driver’s seat — High-precision image editing

AI acquires desired functions by ingesting massive amounts of information, but the details remain a black box.

With the arrival of the HiDream-O1-Image model, which processes language and images in the same dimension without using latent space, we can expect progress in the understanding of image generation AI models themselves.

What Will Happen Next?

HiDream-O1-Image is built on Alibaba’s Qwen3-VL model.

As of May 2026, the Qwen series has established itself as the de facto standard text encoder for image generation thanks to its high performance and open license.

Text Encoders for Image Generation AI Models

Developer	Model	Text Encoder	Encoder Developer
— CLIP Generation (~2023) —
Stability AI	Stable Diffusion 1.x	CLIP-L (0.1B)	OpenAI
Stability AI	Stable Diffusion XL	CLIP-L (0.1B) OpenCLIP-G (0.7B)	OpenAI LAION
— T5 Generation (2024~) —
Stability AI	Stable Diffusion 3	CLIP-L (0.1B) OpenCLIP-G (0.7B) T5-XXL-v1.1 (11B)	OpenAI LAION Google
Fal.ai	AuraFlow	pile-T5-XL (3B)	EleutherAI / Google
Black Forest Labs	Flux.1 [schnell / dev]	CLIP-L (0.1B) T5-XXL-v1.1 (11B)	OpenAI Google
DeepSeek	Janus-Pro	SigLIP-L (0.4B) DeepSeek-LLM (7B)	Google DeepSeek
Zhipu AI	CogVideoX	T5-XXL (11B)	Google
Zhipu AI	GLM-Image	GLM-4-9B (9B) Glyph Encoder	Zhipu AI
Genmo	Mochi	T5-XXL-v1.1 (11B)	Google
Rhymes AI	Allegro	T5-XXL (11B)	Google
Lightricks	LTX-Video	T5-XXL-v1.1 (11B)	Google
NVIDIA	Cosmos-Predict2	T5-XXL (11B)	Google
— Proprietary LLM Generation (2025~) —
HiDream-ai	HiDream-I1	CLIP-L (0.1B) OpenCLIP-G (0.7B) T5-XXL-v1.1 (11B) Llama-3.1-Instruct (8B)	OpenAI LAION Google Meta
Tencent	Hunyuan Video	LLaVA-LLaMA-3 (8B) CLIP-L (0.1B)	Xtuner / Meta OpenAI
Tencent	Hunyuan Image	Proprietary MLLM	Tencent
StepFun	Step-Video-T2V	Hunyuan-CLIP Step-LLM	Tencent StepFun
Alibaba	Wan (2.1 / 2.2)	UMT5-XXL (13B)	Google
	Qwen-Image	Qwen2.5-VL (7B)	Alibaba
	Z-Image	Qwen3 (4B)	Alibaba
Black Forest Labs	Flux.2 [dev]	Mistral Small 3.2 / Pixtral (24B)	Mistral AI
	Flux.2 [klein] 9B	Qwen3 (8B)	Alibaba
	Flux.2 [klein] 4B	Qwen3 (4B)	Alibaba
CircleStone Labs	Anima	Qwen3-Base (0.6B)	Alibaba
Baidu	ERNIE-Image	Mistral3 Pixtral (3.3B)	Mistral AI
HiDream-ai	HiDream-O1-Image	Qwen3-VL (8B)	Alibaba

When it comes to image generation, it’s Qwen.

HiDream-O1-Image’s technology can naturally be reverse-imported back into the main Qwen project, so it is certain that the image recognition and generation capabilities of the Qwen-VL series will improve dramatically.

Furthermore, it has already been revealed that a high-performance model with more than 200B parameters based on the HiDream-O1-Image model exists.

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

The UiT architecture released under the MIT license with HiDream-O1-Image is likely to become the new standard for image generation. We can foresee a future in which current image generation AIs will be restructured under the UiT architecture.

And if the UiT architecture is introduced into ultra-large-scale cloud AIs such as ChatGPT or Gemini, it is beyond the author’s imagination what will become possible.

Anime illustration of a silver-haired girl smiling in front of a colorful LED PC — Will GPT-Image or Nano Banana become 10 times more powerful?

How to Use HiDream-O1-Image!

Let me show you how to use the HiDream-O1-Image model in ComfyUI.

As of May 15, 2026, ComfyUI does not yet have official support for HiDream-O1-Image, so we use a community custom node.

HiDream_O1-ComfyUI Custom Node

Saganaki22/HiDream_O1-ComfyUI

ComfyUI-Manager v4 HiDream_O1 custom node search screen — Search example in ComfyUI-Manager v4. (built into ComfyUI)
It cannot be found when searching in ComfyUI-Manager v3 (custom nodes), so install it manually.

Two models are available for HiDream-O1-Image:

HiDream-O1-Image: Base model
HiDream-O1-Image-Dev: Distilled model for faster execution

Since HiDream-O1-Image does not use VAE, colors will not degrade even if you use the base model as-is.

ComfyUI is optimized for latent space rather than UiT architecture, so the standard Sampler and Scheduler cannot be used at this time. The Sampler and Scheduler are fixed for each model as follows:

HiDream-O1-Image: FlowUniPCMultistepScheduler
HiDream-O1-Image-Dev: FlashFlowMatchEulerDiscreteScheduler (28 steps fixed)

HiDream O1 Sampler Node Settings

guidance_scale: Equivalent to CFG
shift: Whether to concentrate steps in the first half (<1) or the second half (>1)
- Default is -1, Dev: 1, Full: 3
noise_scale_start: 7.5 (initial noise strength)
noise_scale_end: 7.5 (final noise strength)
noise_clip_std: 2.5 (noise change threshold)

The items below “shift” allow manual scheduler adjustment, but manually adjusting the scheduler is quite difficult in practice.

In the future, easy-to-use UiT architecture presets will likely appear, but for now, it is best to use the initial settings that were used during training.

Summary: The Future Where General-Purpose AI Freely Draws Images

HiDream-O1-Image uses a general-purpose UiT architecture
The “work is visible” to humans
Potential for dramatic performance improvements in Qwen-VL models

This time, I introduced the HiDream-O1 model.

HiDream is one of my favorite model series, and I am delighted that they have released a new model that directly challenges the fundamental propositions of image generation AI and breaks common sense.

Anime illustration of a silver-haired girl smiling with a colorful LED PC in the background — Qwen and HiDream are open source

Two years have passed since Alibaba began releasing the Qwen models under open licenses, and one year since the HiDream models. Based on their track record so far, the author trusts both companies’ commitment to open source.

The field of image generation AI is still full of potential for major transformation, and I am excited to see what kind of future awaits us.

Thank you for reading until the end!