Qwen-Image: High-Parameter Model with Chinese Language Understanding and Recommended Settings

Qwen-Image_13
  • Qwen-Image excels at understanding and rendering Chinese
  • English prompt understanding is somewhat inferior
  • Requires 16GB VRAM

Introduction

Hello, I'm Easygoing.

Today, I'd like to introduce Qwen, a chat AI released by China's Alibaba Group, and its image generation model Qwen-Image.

Anime illustration generated with Qwen-Image. A woman in blue clothes stands against the backdrop of a theme park castle illuminated at night.
SDXL -> Qwen-Image

Qwen is a Chat AI That Understands Chinese

Qwen is a large language model (chat AI) released by China's Alibaba Group.

Major Large Language Models

Model Name Launch Date Developer Country
ChatGPT November 2022 OpenAI USA
Claude December 2022 Anthropic USA
Llama February 2023 Meta USA
Qwen August 2023 Alibaba China
Mistral September 2023 Mistral AI France
DeepSeek October 2023 DeepSeek AI China
Grok November 2023 xAI USA
Gemini December 2023 Google USA
Phi December 2023 Microsoft USA

Qwen has been trained in both English and Chinese, giving it native-level Chinese comprehension abilities.

Since Chinese is the second most spoken language in the world, the Qwen model, which understands both English and Chinese, has access to an enormous user base.

Qwen-Image is the Highest-Parameter AI Image Generation Model

Qwen-Image is an AI image generation model newly released by Alibaba on August 4, 2025.

gantt
   title Stable Diffusion 3 and Derivative Models
   dateFormat YYYY-MM-DD
   tickInterval 6month
       Stable Diffusion 3 : done, d1, 2024-06-12, 2025-08-16
       AuraFlow : done, d2, 2024-07-12, 2025-08-16
       Flux.1   : d3, 2024-08-01, 2025-08-16
       HiDream   : d5, 2025-04-06, 2025-08-16
       Qwen-Image   : crit, d4, 2025-08-04, 2025-08-16
Country Parameters Open Source Commercial Use
Stable Diffusion 3 UK 8 billion
Flux.1 Germany 12 billion ⚠️ ⚠️
HiDream China 17 billion
Qwen-Image China 20 billion

Qwen-Image uses Qwen2.5-7B-VT, which has image recognition capabilities from Qwen, for its text encoder component, and a new MMDiT-based model evolved from Stable Diffusion 3 for its transformer component.

  • Text Encoder: The part that analyzes prompts
  • Transformer: The part that actually generates images

AI Image Generation Model Sizes

Qwen-Image has an overall capacity comparable to HiDream, but with an even larger transformer component than HiDream, making Qwen-Image's parameter count exceed HiDream's to become the largest local model.

16GB VRAM Required!

Because Qwen-Image has a large transformer capacity, it requires more VRAM than conventional models to run.

Minimum Split Capacity for Models in ComfyUI

The minimum split capacity for the Qwen-Image-BF16 model when running in ComfyUI is 14.2 GB, requiring 16 GB VRAM to operate.

Single Text Encoder Only

While recent AI image generation models commonly incorporate multiple text encoders to improve prompt understanding, Qwen-Image only has a single Qwen2.5-7B-VL encoder, making it more compact compared to other AI image generation models.

Text Encoders Used in AI Image Generation

Release Date Language Size Parameters
CLIP-L January 2021 English 0.9 GB 430 million
CLIP-G January 2023 English 5.2 GB 1.37 billion
T5-XXL (v1.1) March 2022 Multilingual 22.8 GB 11 billion
Llama-3.1-8B-Instruct July 2024 Multilingual 16.1 GB 8 billion
Qwen2.5-7B-VL January 2025 Multilingual (especially Chinese) 19.3 GB 8.29 billion

Text Encoders in Flux.1, HiDream, and Qwen-Image

Flux.1 HiDream Qwen-Image
CLIP-L
CLIP-G
T5xxl
Llama3.1-8b-instruct
Qwen2.5-7B-VL

While Qwen-Image has high Chinese comprehension, its prompt understanding in English and other languages is somewhat inferior to other models.

Qwen-Image Excels at Chinese Text Rendering!

Qwen-Image's greatest strength is its Chinese text rendering capability.

English uses 26 alphabetic characters that are reused frequently, but Chinese characters are far more numerous with each appearing less frequently, making them difficult for AI image generation models to learn.

HiDream

Image of Chinese cuisine and menu generated with HiDream. The Chinese characters on the menu are inaccurately rendered.
Cannot reproduce Chinese characters

Qwen-Image

Qwen-Image_8
Accurate Chinese characters

In addition to native Chinese comprehension, Qwen-Image has undergone specialized additional training on Chinese characters, enabling it to express Chinese text quite accurately.

Japanese Performance is Lackluster

Image generated with Qwen-Image showing a metal desk with a nameplate reading "Tokyo Patent Permission Office Director" and a tablet, but the hiragana and katakana characters are inaccurately rendered.
Tokyo Patent Permission Office Director

While Qwen-Image appears to have learned hiragana and katakana similarly to Chinese characters, its Japanese comprehension is lacking.

Although its Japanese accuracy is higher compared to other AI image generation models, using generated images with Japanese text at a practical level remains challenging.

Comparing Anime Illustrations with Other Models!

Let's compare Qwen-Image's anime illustrations with other models.

The illustrations are generated using image-to-image from SDXL and compared with other AI image generation models.

SDXL (1024 x 1024)

Anime illustration generated with SDXL (1024x1024). A woman in blue kimono standing inside a Japanese house.

This is the original illustration generated with SDXL.

Qwen-Image (1328 x 1328)

This is the illustration redrawn with Qwen-Image from the previous image.

Anime illustration redrawn with Qwen-Image (1024x1024). A woman in blue kimono standing inside a Japanese house.
denoise: 0.7

Since Qwen-Image supports high resolutions up to 1328 x 1328, it produces cleaner illustrations with more refined details than SDXL.

True to its high-parameter nature, Qwen-Image renders hands accurately.

Flux.1 [dev] (1440 x 1440)

Next, let's try redrawing the same SDXL illustration using the Flux.1 [dev] model.

Anime illustration redrawn with Flux.1 [dev] (1024x1024). A woman in blue kimono standing inside a Japanese house.
denoise: 0.5

Flux.1 [dev] significantly stylizes the character's face.

With the highest resolution at 1440 x 1440, Flux.1 [dev] produces clear illustrations with transparency.

HiDream-I1-Dev (1024 x 1024)

Anime illustration redrawn with HiDream-I1-Dev (1024x1024). A woman in blue kimono standing inside a Japanese house.
denoise: 0.5

Using HiDream results in a significantly different expression from the original SDXL illustration.

HiDream employs a MoE (Mixture of Experts) architecture combining multiple models, and for anime illustrations, it likely operates with a dedicated anime model.

This illustration uses long natural language prompts containing keywords like Japanese house and blur, and HiDream most accurately reproduces the prompt.

Can CFG Scale be Disabled?

While Flux.1 and HiDream have released distilled models that can disable CFG scale for faster generation, Qwen-Image lacks such models, resulting in twice the generation time compared to Flux.1 and HiDream.

About CFG Scale and Disabling It

Let's try generating illustrations with Qwen-Image by setting CFG scale to 1 to disable it.

CFG scale: 1

CFG1_1
Illustration breaks down

CFG scale: 2.5

Anime illustration generated with Qwen-Image with CFG scale set to 2.5. Clean finish.
Normal generation

Unfortunately, setting CFG scale to 1 causes the illustration to collapse.

Qwen-Image appears unsuitable for high-speed operation with disabled CFG.

CFG scale stabilizes around 2.5, so this is a good baseline for regular use.

Steps Should be Around 8-12!

Let's now test the step count required for illustration generation.

Using the high-precision heunpp2 sampler, we aim to generate illustrations with the fewest possible steps.

4 steps

Anime illustration generated with Qwen-Image in 4 steps. Incomplete with rough details.
Incomplete

8 steps

Anime illustration generated with Qwen-Image in 8 steps. A woman in blue clothes standing against a theme park castle backdrop, with a complete finish.
Nearly complete

12 steps

Anime illustration generated with Qwen-Image in 12 steps. Almost identical to 8 steps, with a complete finish.
No difference from 8 steps

New-generation MMDiT models converge illustrations even with fewer steps.

Qwen-Image also converges around 8 steps, with little change beyond that.

For image-to-image, 8 steps is sufficient.

For text-to-image, around 12 steps provides adequate computation.

While ComfyUI's default workflow setting is 20 steps, reducing this to a necessary and sufficient step count can shorten rendering time.

Increasing Noise with ModelSamplingAuraFlow

The ModelSamplingAuraFlow node used in Qwen-Image's default workflow adjusts the noise amount during illustration generation.

ModelSamplingAuraFlow Node

Screenshot of ComfyUI's ModelSamplingAuraFlow node showing the noise adjustment settings screen.

Noise Changes by ModelSamplingAuraFlow Shift Values

Step bypass shift: 3.1 shift: 6
1 0.9260 0.9249 0.9598
2 0.8732 0.8713 0.9291
3 0.8027 0.7998 0.8855
4 0.7103 0.7073 0.8238
5 0.5917 0.5877 0.7340
6 0.4438 0.4397 0.6030
7 0.2714 0.2688 0.4157
8 0.0998 0.1011 0.1787
9 0.0000 0.0000 0.0000

The default shift: 3.1 setting in the ModelSamplingAuraFlow node is nearly identical to bypassing the node.

Setting higher shift values increases noise, resulting in rich illustrations with strong Qwen-Image influence.

shift: 3.1

Anime illustration generated with Qwen-Image using ModelSamplingAuraFlow shift value 3.1. Expression close to the original illustration.
Expression following the original illustration

shift: 6

Anime illustration generated with Qwen-Image using ModelSamplingAuraFlow shift value 6. Significantly changed from the original illustration with strong Qwen-Image influence.
Strong Qwen-Image influence

The ModelSamplingAuraFlow shift value should be set as high as possible within the range that maintains illustration integrity.

High-Speed Generation with Distilled Models and LoRA

While Qwen-Image is computationally heavy, there are community-developed CFG-disabled distilled models and acceleration LoRAs.

Shift8.5_1
Qwen-Image

Qwen-Image-distill (CFG-Disabled High-Speed Distilled Model)

Qwen-Image-lightning (Acceleration LoRA)

Both can accelerate rendering in exchange for some image quality degradation.

Workflow

Here's the workflow used in this verification and recommended Qwen-Image settings.

This workflow is configured for high precision at 8-12 steps, so none of the acceleration techniques are used.

Workflow

Screenshot of the ComfyUI workflow used for Qwen-Image image generation, showing node connections and settings.

Model Downloads (Hugging Face Comfy-Org Repository)

Changes from ComfyUI Official Workflow

  • Uses BF16 format models
  • Loads text encoder to RAM
  • Uses ConditioningZeroOut instead of Negative Prompt
  • Added color correction

About Color Correction

  • sampler: heunpp2
  • scheduler: beta
  • steps: 12 (8 steps for image-to-image)
  • ModelSamplingAuraFlow shift value: 6

Conclusion: Qwen-Image Excels with Chinese

  • Qwen-Image excels at understanding and rendering Chinese
  • English prompt understanding is somewhat inferior
  • Requires 16GB VRAM

Qwen is an AI model released by China's Alibaba Group.

Qwen-Image is exceptionally superior in Chinese comprehension compared to previous AI image generation models.

However, it has usability drawbacks compared to other models, including the requirement for 16GB VRAM and the lack of an official distilled model with disabled CFG.

Anime illustration generated with Qwen-Image. A woman in blue kimono gazing at the viewer with a simple background setting.

Nevertheless, while image generation models integrated with their own chat AIs from OpenAI's GPT-Image-1, Google's Imagen, and X.ai's Aurora remain proprietary and unreleased, Alibaba's release of Qwen-Image under a free license holds significant meaning for AI open-sourcing.

I look forward to continuing to explore new expressions as new AI image generation models are released.

Thank you for reading to the end!

-->