Goodbye Latent Space? How HiDream-O1-Image is Revolutionizing General-Purpose AI Drawing

superpop_1
  • HiDream-O1-Image uses a general-purpose UiT architecture
  • The “work is visible” to humans
  • Potential for dramatic performance improvements in Qwen-VL models

Introduction

Hello, this is Easygoing.

This time, I’d like to introduce the new image generation AI HiDream-O1-Image, which was released on May 8, 2026.

LED-lit colorful PC interior
HiDream-O1-Image_clear_v01_alpha

HiDream.ai is a Chinese AI Startup

HiDream.ai is an AI startup company headquartered in Beijing, China.


gantt
    title HiDream.ai models
    dateFormat YYYY-MM-DD
    tickInterval 12month
	axisFormat %Y

        HiDream-I1-Full :  2025-04-06, 2026-05-15
        HiDream-I1-Dev :  2025-04-06, 2026-05-15
        HiDream-I1-Fast :  2025-04-06, 2026-05-15
        HiDream-E1-Full :  2025-04-27, 2026-05-15
        HiDream-E1.1 :  2025-07-16, 2026-05-15
        HiDream-O1-Image :  crit, 2026-05-15, 2026-05-15

The HiDream-I1 model released by HiDream.ai in April 2025 was a high-performance image generation AI equipped with four text encoders. It caused a global sensation because it was released under the MIT license, allowing free development and commercial use.

About HiDream-I1

While the HiDream-I1 model had excellent prompt understanding, it was computationally heavy for local execution. Additionally, because it was trained on images converted to JPEG, it had the drawback of reproducing JPEG noise. Unfortunately, it did not gain widespread adoption among general users.

However, the HiDream-O1-Image model that appeared on May 8, 2026, brings an even greater impact than the previous HiDream-I1, so I’d like to introduce its features to you all.

Image Generation is Handled by Three Specialized AIs Working Together

Image generation is performed through the division of labor among three AIs.


flowchart TB

subgraph Checkpoint

A1(Text Encoder)

B1(Unet / Transformer)

C1(VAE)

end

  • Text Encoder: Analyzes the prompt
  • UNet / Transformer: Generates the image
  • VAE: Compresses the space

First, when the user says “Draw a picture,” an AI that understands human language (mainly English and Chinese) converts the instruction into machine language (vectors).

Image Generation Workflow

image_generative_ai_workflow_black
image_generative_ai_workflow_white

Then, based on that vector, an AI specialized in drawing retreats into its own dedicated studio (latent space) and works 12 times more efficiently, diligently creating the image.

The finished painting is in a format only the drawing AI can understand (it looks like static noise to humans), so it is decoded (VAE decode) to produce an image visible to humans.

gantt
    title Image Generative AI Roadmap
    dateFormat YYYY-MM-DD
    tickInterval 12month
    axisFormat %Y

    section Stability AI
        Stable Diffusion 1 : 2022-08-22, 2026-05-15
        Stable Diffusion XL : 2023-07-26, 2026-05-15
        Stable Diffusion 3 : 2024-06-12, 2026-05-15

    section Fal.ai
        AuraFlow : 2024-07-12, 2026-05-15

    section Black Forest Labs
        Flux.1 : 2024-08-01, 2026-05-15
        Flux.2 : 2025-11-25, 2026-05-15

	section DeepSeek.ai
		janus-pro : 2025-01-25,  2026-05-15

	section Zhipu AI
        CogVideoX : 2024-08-06, 2026-05-15
        GLM-Image : 2026-01-12, 2026-05-15

    section Rhymes AI
        Allegro : 2024-10-22, 2026-05-15

	section Genmo
        Mochi : 2024-10-25, 2026-05-15

    section Tencent
		Hunyuan video : 2024-12-03, 2026-05-15
		Hunyuan image : 2025-09-09, 2026-05-15

    section lllyasviel
		Framepack : 2025-04-17, 2026-05-15

	section Lightricks
		LTX : 2024-12-11,  2026-05-15

    section StepFun
        Step-Video-T2V : 2025-02-17, 2026-05-15

    section Alibaba
        Wan : 2025-02-25, 2026-05-15
        Qwen-Image : 2025-08-04, 2026-05-15
		Z-Image : 2025-11-25, 2026-05-15

	section NVIDIA
        Cosmos-Predict2 : 2025-04-30, 2026-05-15

    section CircleStone Labs
        Anima : 2026-01-26, 2026-05-15

    section Baidu
		ERNIE-Image : 2026-04-07, 2026-05-15

    section HiDream.ai
        HiDream-I1 : 2025-04-06, 2026-05-15
		HiDream-O1-Image :  crit, 2026-05-08, 2026-05-15

This has been the method used by all image generation AIs since the release of Stable Diffusion 1.

HiDream-O1-Image Does Everything by Itself!

Now, let’s take a look at how HiDream-O1-Image processes images.

HiDream-O1-Image is a model that extends the large language model (chat AI) called Qwen3-VL by adding image generation capabilities.

HiDream-O1-Image Workflow

flowchart TB
A1(User)
B1(HiDream-O1-Image)
A1--"Draw a picture"-->B1
B1--"Here you go!"-->A1

HiDream-O1-Image understands language and images in the same dimension (UiT: Pixel-level Unified Transformer architecture) and draws pictures by itself.

Since it doesn’t retreat into its own dedicated studio, humans can sequentially check what parts it is modifying.

Furthermore, because it draws human-visible images directly, there is no need for decoding, and thus no image quality degradation caused by decoding.

About Image Quality Degradation Caused by VAE

HiDream-O1-Image is a groundbreaking AI model that proves a general-purpose AI can generate illustrations without needing a specialized drawing AI.

General-Purpose AI Has Many Advantages

General-purpose AI offers numerous benefits.

Simplified Workflow

As you can see from the diagram above, using a general-purpose model greatly simplifies the workflow.

A simpler workflow naturally leads to faster processing. It also eliminates the need to carefully manage information passed between AIs, making adjustments much easier.

Additionally, there have been frequent cases recently of malware being injected into libraries used by AI. With fewer models to use, fewer libraries are required, which also reduces security risks.

HiDream-O1-Image uses only UiT
Lighter than Z-Image

The common belief for the past four years since image generation AI emerged has been that “latent space is necessary for AI to generate images efficiently.” HiDream-O1-Image shattered this belief with the UiT architecture using a model that is only one-third the size of its predecessor — which is truly astonishing.

Processing Language and Images in the Same Dimension

By processing language and images in the same dimension, HiDream-O1-Image can perform image editing much more naturally than before.

Image Editing Examples with HiDream-O1-Image

hidream_o1_image_edit_sepia hidream_o1_image_edit_anime hidream_o1_image_edit_with_letter_board hidream_o1_image_edit_with_car_toy hidream_o1_image_edit_old hidream_o1_image_edit_next_to_anime_character hidream_o1_image_edit_bith_black_cloth_man hidream_o1_image_edit_driving_car
High-precision image editing

AI acquires desired functions by ingesting massive amounts of information, but the details remain a black box.

With the arrival of the HiDream-O1-Image model, which processes language and images in the same dimension without using latent space, we can expect progress in the understanding of image generation AI models themselves.

What Will Happen Next?

HiDream-O1-Image is built on Alibaba’s Qwen3-VL model.

As of May 2026, the Qwen series has established itself as the de facto standard text encoder for image generation thanks to its high performance and open license.

Text Encoders for Image Generation AI Models

Developer Model Text Encoder Encoder Developer
— CLIP Generation (~2023) —
Stability AI Stable Diffusion 1.x CLIP-L (0.1B) OpenAI
Stable Diffusion XL CLIP-L (0.1B)
OpenCLIP-G (0.7B)
OpenAI
LAION
— T5 Generation (2024~) —
Stability AI Stable Diffusion 3 CLIP-L (0.1B)
OpenCLIP-G (0.7B)
T5-XXL-v1.1 (11B)
OpenAI
LAION
Google
Fal.ai AuraFlow pile-T5-XL (3B) EleutherAI / Google
Black Forest Labs Flux.1 [schnell / dev] CLIP-L (0.1B)
T5-XXL-v1.1 (11B)
OpenAI
Google
DeepSeek Janus-Pro SigLIP-L (0.4B)
DeepSeek-LLM (7B)
Google
DeepSeek
Zhipu AI CogVideoX T5-XXL (11B) Google
GLM-Image GLM-4-9B (9B)
Glyph Encoder
Zhipu AI
Genmo Mochi T5-XXL-v1.1 (11B) Google
Rhymes AI Allegro T5-XXL (11B) Google
Lightricks LTX-Video T5-XXL-v1.1 (11B) Google
NVIDIA Cosmos-Predict2 T5-XXL (11B) Google
— Proprietary LLM Generation (2025~) —
HiDream-ai HiDream-I1 CLIP-L (0.1B)
OpenCLIP-G (0.7B)
T5-XXL-v1.1 (11B)
Llama-3.1-Instruct (8B)
OpenAI
LAION
Google
Meta
Tencent Hunyuan Video LLaVA-LLaMA-3 (8B)
CLIP-L (0.1B)
Xtuner / Meta
OpenAI
Hunyuan Image Proprietary MLLM Tencent
StepFun Step-Video-T2V Hunyuan-CLIP
Step-LLM
Tencent
StepFun
Alibaba Wan (2.1 / 2.2) UMT5-XXL (13B) Google
Qwen-Image Qwen2.5-VL (7B) Alibaba
Z-Image Qwen3 (4B) Alibaba
Black Forest Labs Flux.2 [dev] Mistral Small 3.2 / Pixtral (24B) Mistral AI
Flux.2 [klein] 9B Qwen3 (8B) Alibaba
Flux.2 [klein] 4B Qwen3 (4B) Alibaba
CircleStone Labs Anima Qwen3-Base (0.6B) Alibaba
Baidu ERNIE-Image Mistral3
Pixtral (3.3B)
Mistral AI
HiDream-ai HiDream-O1-Image Qwen3-VL (8B) Alibaba

HiDream-O1-Image’s technology can naturally be reverse-imported back into the main Qwen project, so it is certain that the image recognition and generation capabilities of the Qwen-VL series will improve dramatically.

Furthermore, it has already been revealed that a high-performance model with more than 200B parameters based on the HiDream-O1-Image model exists.

The UiT architecture released under the MIT license with HiDream-O1-Image is likely to become the new standard for image generation. We can foresee a future in which current image generation AIs will be restructured under the UiT architecture.

And if the UiT architecture is introduced into ultra-large-scale cloud AIs such as ChatGPT or Gemini, it is beyond the author’s imagination what will become possible.

Anime illustration of a silver-haired girl smiling in front of a colorful LED PC
Will GPT-Image or Nano Banana become 10 times more powerful?

How to Use HiDream-O1-Image!

Let me show you how to use the HiDream-O1-Image model in ComfyUI.

ComfyUI added support for HiDream-O1-Image on May 13, 2026, but the default workflow is quite complex. Here, I’ll introduce my own custom node that implements HiDream-O1-Image in a simple way.

ComfyUI-uit-hidream-o1 Custom Node

uit-hidream-o1-search-node-manager.png
Install from ComfyUI-Manager

Models

Text to Image

HiDream-O1-Image text-to-image workflow

Image to Image

HiDream-O1-Image image-to-image workflow
HiDream-O1-Image image-to-image workflow

Image Edit

HiDream-O1-Image image-edit workflow

uit-hidream-o1-image-edit_20260520.json

Summary: The Future Where General-Purpose AI Freely Draws Images

  • HiDream-O1-Image uses a general-purpose UiT architecture
  • The “work is visible” to humans
  • Potential for dramatic performance improvements in Qwen-VL models

This time, I introduced the HiDream-O1 model.

HiDream is one of my favorite model series, and I am delighted that they have released a new model that directly challenges the fundamental propositions of image generation AI and breaks common sense.

Anime illustration of a silver-haired girl smiling with a colorful LED PC in the background
Qwen and HiDream are open source

Two years have passed since Alibaba began releasing the Qwen models under open licenses, and one year since the HiDream models. Based on their track record so far, the author trusts both companies’ commitment to open source.

The field of image generation AI is still full of potential for major transformation, and I am excited to see what kind of future awaits us.

Thank you for reading until the end!


Update History

2026.5.31

Replaced the custom node with ComfyUI-uit-hidream-o1

2026.5.17

Corrected the section regarding ComfyUI support for HiDream-O1-Image