Goodbye Latent Space? How HiDream-O1-Image is Revolutionizing General-Purpose AI Drawing

superpop_1
  • HiDream-O1-Image uses a general-purpose UiT architecture
  • The “work is visible” to humans
  • Potential for dramatic performance improvements in Qwen-VL models

Introduction

Hello, this is Easygoing.

This time, I’d like to introduce the new image generation AI HiDream-O1-Image, which was released on May 8, 2026.

LED-lit colorful PC interior
HiDream-O1-Image_clear_v01_alpha

HiDream.ai is a Chinese AI Startup

HiDream.ai is an AI startup company headquartered in Beijing, China.


gantt
    title HiDream.ai models
    dateFormat YYYY-MM-DD
    tickInterval 12month
	axisFormat %Y

        HiDream-I1-Full :  2025-04-06, 2026-05-15
        HiDream-I1-Dev :  2025-04-06, 2026-05-15
        HiDream-I1-Fast :  2025-04-06, 2026-05-15
        HiDream-E1-Full :  2025-04-27, 2026-05-15
        HiDream-E1.1 :  2025-07-16, 2026-05-15
        HiDream-O1-Image :  crit, 2026-05-15, 2026-05-15

The HiDream-I1 model released by HiDream.ai in April 2025 was a high-performance image generation AI equipped with four text encoders. It caused a global sensation because it was released under the MIT license, allowing free development and commercial use.

About HiDream-I1

While the HiDream-I1 model had excellent prompt understanding, it was computationally heavy for local execution. Additionally, because it was trained on images converted to JPEG, it had the drawback of reproducing JPEG noise. Unfortunately, it did not gain widespread adoption among general users.

However, the HiDream-O1-Image model that appeared on May 8, 2026, brings an even greater impact than the previous HiDream-I1, so I’d like to introduce its features to you all.

Image Generation is Handled by Three Specialized AIs Working Together

Image generation is performed through the division of labor among three AIs.


flowchart TB

subgraph Checkpoint

A1(Text Encoder)

B1(Unet / Transformer)

C1(VAE)

end

  • Text Encoder: Analyzes the prompt
  • UNet / Transformer: Generates the image
  • VAE: Compresses the space

First, when the user says “Draw a picture,” an AI that understands human language (mainly English and Chinese) converts the instruction into machine language (vectors).

Image Generation Workflow


flowchart TD
    A1(User)
    B1(Text Encoder)
    C1(UNet / Transformer)
    D1(VAE)

    A1 -- "Draw a picture" --> B1
    B1 -- "[1, 0, 128, 2, 4, 6, 0, 2]" --> C1
    C1 -- "[4, 2, 0, 64, 8, 2, 1, 4]" --> D1
    D1 -- "Here you go!" --> A1

    subgraph Latent Space
        C1
    end

Then, based on that vector, an AI specialized in drawing retreats into its own dedicated studio (latent space) and works 12 times more efficiently, diligently creating the image.

The finished painting is in a format only the drawing AI can understand (it looks like static noise to humans), so it is decoded (VAE decode) to produce an image visible to humans.

gantt
    title Image Generative AI Roadmap
    dateFormat YYYY-MM-DD
    tickInterval 12month
    axisFormat %Y

    section Stability AI
        Stable Diffusion 1 : 2022-08-22, 2026-05-15
        Stable Diffusion XL : 2023-07-26, 2026-05-15
        Stable Diffusion 3 : 2024-06-12, 2026-05-15

    section Fal.ai
        AuraFlow : 2024-07-12, 2026-05-15

    section Black Forest Labs
        Flux.1 : 2024-08-01, 2026-05-15
        Flux.2 : 2025-11-25, 2026-05-15

	section DeepSeek.ai
		janus-pro : 2025-01-25,  2026-05-15

	section Zhipu AI
        CogVideoX : 2024-08-06, 2026-05-15
        GLM-Image : 2026-01-12, 2026-05-15

    section Rhymes AI
        Allegro : 2024-10-22, 2026-05-15

	section Genmo
        Mochi : 2024-10-25, 2026-05-15

    section Tencent
		Hunyuan video : 2024-12-03, 2026-05-15
		Hunyuan image : 2025-09-09, 2026-05-15

    section lllyasviel
		Framepack : 2025-04-17, 2026-05-15

	section Lightricks
		LTX : 2024-12-11,  2026-05-15

    section StepFun
        Step-Video-T2V : 2025-02-17, 2026-05-15

    section Alibaba
        Wan : 2025-02-25, 2026-05-15
        Qwen-Image : 2025-08-04, 2026-05-15
		Z-Image : 2025-11-25, 2026-05-15

	section NVIDIA
        Cosmos-Predict2 : 2025-04-30, 2026-05-15

    section CircleStone Labs
        Anima : 2026-01-26, 2026-05-15

    section Baidu
		ERNIE-Image : 2026-04-07, 2026-05-15

    section HiDream.ai
        HiDream-I1 : 2025-04-06, 2026-05-15
		HiDream-O1-Image :  crit, 2026-05-08, 2026-05-15

This has been the method used by all image generation AIs since the release of Stable Diffusion 1.

HiDream-O1-Image Does Everything by Itself!

Now, let’s take a look at how HiDream-O1-Image processes images.

HiDream-O1-Image is a model that extends the large language model (chat AI) called Qwen3-VL by adding image generation capabilities.

HiDream-O1-Image Workflow

flowchart TB
A1(User)
B1(HiDream-O1-Image)
A1--"Draw a picture"-->B1
B1--"Here you go!"-->A1

HiDream-O1-Image understands language and images in the same dimension (UiT: Pixel-level Unified Transformer architecture) and draws pictures by itself.

Since it doesn’t retreat into its own dedicated studio, humans can sequentially check what parts it is modifying.

Furthermore, because it draws human-visible images directly, there is no need for decoding, and thus no image quality degradation caused by decoding.

About Image Quality Degradation Caused by VAE

HiDream-O1-Image is a groundbreaking AI model that proves a general-purpose AI can generate illustrations without needing a specialized drawing AI.

General-Purpose AI Has Many Advantages

General-purpose AI offers numerous benefits.

Simplified Workflow

As you can see from the diagram above, using a general-purpose model greatly simplifies the workflow.

A simpler workflow naturally leads to faster processing. It also eliminates the need to carefully manage information passed between AIs, making adjustments much easier.

Additionally, there have been frequent cases recently of malware being injected into libraries used by AI. With fewer models to use, fewer libraries are required, which also reduces security risks.

HiDream-O1-Image uses only UiT
Lighter than Z-Image

The common belief for the past four years since image generation AI emerged has been that “latent space is necessary for AI to generate images efficiently.” HiDream-O1-Image shattered this belief with the UiT architecture using a model that is only one-third the size of its predecessor — which is truly astonishing.

Processing Language and Images in the Same Dimension

By processing language and images in the same dimension, HiDream-O1-Image can perform image editing much more naturally than before.

Image Editing Examples with HiDream-O1-Image

※ The author has not yet been able to reproduce image editing.

HiDream-O1-Image image editing examples: original, anime, text board, car toy
HiDream-O1-Image image editing examples: aging, anime character next to it, two-shot with a weird person from afar, sitting in a car driver’s seat
High-precision image editing

AI acquires desired functions by ingesting massive amounts of information, but the details remain a black box.

With the arrival of the HiDream-O1-Image model, which processes language and images in the same dimension without using latent space, we can expect progress in the understanding of image generation AI models themselves.

What Will Happen Next?

HiDream-O1-Image is built on Alibaba’s Qwen3-VL model.

As of May 2026, the Qwen series has established itself as the de facto standard text encoder for image generation thanks to its high performance and open license.

Text Encoders for Image Generation AI Models

Developer Model Text Encoder Encoder Developer
— CLIP Generation (~2023) —
Stability AI Stable Diffusion 1.x CLIP-L (0.1B) OpenAI
Stable Diffusion XL CLIP-L (0.1B)
OpenCLIP-G (0.7B)
OpenAI
LAION
— T5 Generation (2024~) —
Stability AI Stable Diffusion 3 CLIP-L (0.1B)
OpenCLIP-G (0.7B)
T5-XXL-v1.1 (11B)
OpenAI
LAION
Google
Fal.ai AuraFlow pile-T5-XL (3B) EleutherAI / Google
Black Forest Labs Flux.1 [schnell / dev] CLIP-L (0.1B)
T5-XXL-v1.1 (11B)
OpenAI
Google
DeepSeek Janus-Pro SigLIP-L (0.4B)
DeepSeek-LLM (7B)
Google
DeepSeek
Zhipu AI CogVideoX T5-XXL (11B) Google
GLM-Image GLM-4-9B (9B)
Glyph Encoder
Zhipu AI
Genmo Mochi T5-XXL-v1.1 (11B) Google
Rhymes AI Allegro T5-XXL (11B) Google
Lightricks LTX-Video T5-XXL-v1.1 (11B) Google
NVIDIA Cosmos-Predict2 T5-XXL (11B) Google
— Proprietary LLM Generation (2025~) —
HiDream-ai HiDream-I1 CLIP-L (0.1B)
OpenCLIP-G (0.7B)
T5-XXL-v1.1 (11B)
Llama-3.1-Instruct (8B)
OpenAI
LAION
Google
Meta
Tencent Hunyuan Video LLaVA-LLaMA-3 (8B)
CLIP-L (0.1B)
Xtuner / Meta
OpenAI
Hunyuan Image Proprietary MLLM Tencent
StepFun Step-Video-T2V Hunyuan-CLIP
Step-LLM
Tencent
StepFun
Alibaba Wan (2.1 / 2.2) UMT5-XXL (13B) Google
Qwen-Image Qwen2.5-VL (7B) Alibaba
Z-Image Qwen3 (4B) Alibaba
Black Forest Labs Flux.2 [dev] Mistral Small 3.2 / Pixtral (24B) Mistral AI
Flux.2 [klein] 9B Qwen3 (8B) Alibaba
Flux.2 [klein] 4B Qwen3 (4B) Alibaba
CircleStone Labs Anima Qwen3-Base (0.6B) Alibaba
Baidu ERNIE-Image Mistral3
Pixtral (3.3B)
Mistral AI
HiDream-ai HiDream-O1-Image Qwen3-VL (8B) Alibaba

When it comes to image generation, it’s Qwen.

HiDream-O1-Image’s technology can naturally be reverse-imported back into the main Qwen project, so it is certain that the image recognition and generation capabilities of the Qwen-VL series will improve dramatically.

Furthermore, it has already been revealed that a high-performance model with more than 200B parameters based on the HiDream-O1-Image model exists.

The UiT architecture released under the MIT license with HiDream-O1-Image is likely to become the new standard for image generation. We can foresee a future in which current image generation AIs will be restructured under the UiT architecture.

And if the UiT architecture is introduced into ultra-large-scale cloud AIs such as ChatGPT or Gemini, it is beyond the author’s imagination what will become possible.

Anime illustration of a silver-haired girl smiling in front of a colorful LED PC
Will GPT-Image or Nano Banana become 10 times more powerful?

How to Use HiDream-O1-Image!

Let me show you how to use the HiDream-O1-Image model in ComfyUI.

As of May 15, 2026, ComfyUI does not yet have official support for HiDream-O1-Image, so we use a community custom node.

HiDream_O1-ComfyUI Custom Node

ComfyUI-Manager v4 HiDream_O1 custom node search screen
Search example in ComfyUI-Manager v4. (built into ComfyUI)
It cannot be found when searching in ComfyUI-Manager v3 (custom nodes), so install it manually.

Two models are available for HiDream-O1-Image:

Since HiDream-O1-Image does not use VAE, colors will not degrade even if you use the base model as-is.

ComfyUI is optimized for latent space rather than UiT architecture, so the standard Sampler and Scheduler cannot be used at this time. The Sampler and Scheduler are fixed for each model as follows:

  • HiDream-O1-Image: FlowUniPCMultistepScheduler
  • HiDream-O1-Image-Dev: FlashFlowMatchEulerDiscreteScheduler (28 steps fixed)

HiDream O1 Sampler Node Settings

HiDream O1 Sampler node settings screen (guidance_scale, shift, noise_scale, etc.)
  • guidance_scale: Equivalent to CFG
  • shift: Whether to concentrate steps in the first half (<1) or the second half (>1)
    • Default is -1, Dev: 1, Full: 3
  • noise_scale_start: 7.5 (initial noise strength)
  • noise_scale_end: 7.5 (final noise strength)
  • noise_clip_std: 2.5 (noise change threshold)

The items below “shift” allow manual scheduler adjustment, but manually adjusting the scheduler is quite difficult in practice.

In the future, easy-to-use UiT architecture presets will likely appear, but for now, it is best to use the initial settings that were used during training.

Summary: The Future Where General-Purpose AI Freely Draws Images

  • HiDream-O1-Image uses a general-purpose UiT architecture
  • The “work is visible” to humans
  • Potential for dramatic performance improvements in Qwen-VL models

This time, I introduced the HiDream-O1 model.

HiDream is one of my favorite model series, and I am delighted that they have released a new model that directly challenges the fundamental propositions of image generation AI and breaks common sense.

Anime illustration of a silver-haired girl smiling with a colorful LED PC in the background
Qwen and HiDream are open source

Two years have passed since Alibaba began releasing the Qwen models under open licenses, and one year since the HiDream models. Based on their track record so far, the author trusts both companies’ commitment to open source.

The field of image generation AI is still full of potential for major transformation, and I am excited to see what kind of future awaits us.

Thank you for reading until the end!