Qwen-Image: High-Parameter Model with Chinese Language Understanding and Recommended Settings


- Qwen-Image excels at understanding and rendering Chinese
- English prompt understanding is somewhat inferior
- Requires 16GB VRAM
Introduction
Hello, I'm Easygoing.
Today, I'd like to introduce Qwen, a chat AI released by China's Alibaba Group, and its image generation model Qwen-Image.

Qwen is a Chat AI That Understands Chinese
Qwen is a large language model (chat AI) released by China's Alibaba Group.
Major Large Language Models
Model Name | Launch Date | Developer | Country |
---|---|---|---|
ChatGPT | November 2022 | OpenAI | USA |
Claude | December 2022 | Anthropic | USA |
Llama | February 2023 | Meta | USA |
Qwen | August 2023 | Alibaba | China |
Mistral | September 2023 | Mistral AI | France |
DeepSeek | October 2023 | DeepSeek AI | China |
Grok | November 2023 | xAI | USA |
Gemini | December 2023 | USA | |
Phi | December 2023 | Microsoft | USA |
Qwen has been trained in both English and Chinese, giving it native-level Chinese comprehension abilities.
Since Chinese is the second most spoken language in the world, the Qwen model, which understands both English and Chinese, has access to an enormous user base.
Qwen-Image is the Highest-Parameter AI Image Generation Model
Qwen-Image is an AI image generation model newly released by Alibaba on August 4, 2025.
gantt
title Stable Diffusion 3 and Derivative Models
dateFormat YYYY-MM-DD
tickInterval 6month
Stable Diffusion 3 : done, d1, 2024-06-12, 2025-08-16
AuraFlow : done, d2, 2024-07-12, 2025-08-16
Flux.1 : d3, 2024-08-01, 2025-08-16
HiDream : d5, 2025-04-06, 2025-08-16
Qwen-Image : crit, d4, 2025-08-04, 2025-08-16
Country | Parameters | Open Source | Commercial Use | |
---|---|---|---|---|
Stable Diffusion 3 | UK | 8 billion | ✅ | ✅ |
Flux.1 | Germany | 12 billion | ⚠️ | ⚠️ |
HiDream | China | 17 billion | ✅ | ✅ |
Qwen-Image | China | 20 billion | ✅ | ✅ |
Qwen-Image uses Qwen2.5-7B-VT, which has image recognition capabilities from Qwen, for its text encoder component, and a new MMDiT-based model evolved from Stable Diffusion 3 for its transformer component.
- Text Encoder: The part that analyzes prompts
- Transformer: The part that actually generates images
AI Image Generation Model Sizes
Qwen-Image has an overall capacity comparable to HiDream, but with an even larger transformer component than HiDream, making Qwen-Image's parameter count exceed HiDream's to become the largest local model.
16GB VRAM Required!
Because Qwen-Image has a large transformer capacity, it requires more VRAM than conventional models to run.
Minimum Split Capacity for Models in ComfyUI
The minimum split capacity for the Qwen-Image-BF16 model when running in ComfyUI is 14.2 GB, requiring 16 GB VRAM to operate.
Single Text Encoder Only
While recent AI image generation models commonly incorporate multiple text encoders to improve prompt understanding, Qwen-Image only has a single Qwen2.5-7B-VL encoder, making it more compact compared to other AI image generation models.
Text Encoders Used in AI Image Generation
Release Date | Language | Size | Parameters | |
---|---|---|---|---|
CLIP-L | January 2021 | English | 0.9 GB | 430 million |
CLIP-G | January 2023 | English | 5.2 GB | 1.37 billion |
T5-XXL (v1.1) | March 2022 | Multilingual | 22.8 GB | 11 billion |
Llama-3.1-8B-Instruct | July 2024 | Multilingual | 16.1 GB | 8 billion |
Qwen2.5-7B-VL | January 2025 | Multilingual (especially Chinese) | 19.3 GB | 8.29 billion |
Text Encoders in Flux.1, HiDream, and Qwen-Image
Flux.1 | HiDream | Qwen-Image | |
---|---|---|---|
CLIP-L | ✅ | ✅ | ❌ |
CLIP-G | ❌ | ✅ | ❌ |
T5xxl | ✅ | ✅ | ❌ |
Llama3.1-8b-instruct | ❌ | ✅ | ❌ |
Qwen2.5-7B-VL | ❌ | ❌ | ✅ |
While Qwen-Image has high Chinese comprehension, its prompt understanding in English and other languages is somewhat inferior to other models.
Qwen-Image Excels at Chinese Text Rendering!
Qwen-Image's greatest strength is its Chinese text rendering capability.
English uses 26 alphabetic characters that are reused frequently, but Chinese characters are far more numerous with each appearing less frequently, making them difficult for AI image generation models to learn.
HiDream

Qwen-Image

In addition to native Chinese comprehension, Qwen-Image has undergone specialized additional training on Chinese characters, enabling it to express Chinese text quite accurately.
Japanese Performance is Lackluster

While Qwen-Image appears to have learned hiragana and katakana similarly to Chinese characters, its Japanese comprehension is lacking.
Although its Japanese accuracy is higher compared to other AI image generation models, using generated images with Japanese text at a practical level remains challenging.
Comparing Anime Illustrations with Other Models!
Let's compare Qwen-Image's anime illustrations with other models.
The illustrations are generated using image-to-image from SDXL and compared with other AI image generation models.
SDXL (1024 x 1024)

This is the original illustration generated with SDXL.
Qwen-Image (1328 x 1328)
This is the illustration redrawn with Qwen-Image from the previous image.

Since Qwen-Image supports high resolutions up to 1328 x 1328, it produces cleaner illustrations with more refined details than SDXL.
True to its high-parameter nature, Qwen-Image renders hands accurately.
Flux.1 [dev] (1440 x 1440)
Next, let's try redrawing the same SDXL illustration using the Flux.1 [dev] model.
![Anime illustration redrawn with Flux.1 [dev] (1024x1024). A woman in blue kimono standing inside a Japanese house.](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEju0f-OyigY_mpPKwGFnKZcfHR23Fo0JiSOlOeMmCCEWR8vARHDSLh7MX6O0Zb3uM6ERv6moEfLQN71rZ0Kv1E6lDXUGPJFwpf5aJfhqrrrcGH4RQm_GAgXKKL_d3CH6WrJWiUNUsgHthAHYo2k5cN1uTo75kvbALOsh4wYvcl3eoHS2w/w800-e90-rw/Flux.1_dev_00001_.png)
Flux.1 [dev] significantly stylizes the character's face.
With the highest resolution at 1440 x 1440, Flux.1 [dev] produces clear illustrations with transparency.
HiDream-I1-Dev (1024 x 1024)

Using HiDream results in a significantly different expression from the original SDXL illustration.
HiDream employs a MoE (Mixture of Experts) architecture combining multiple models, and for anime illustrations, it likely operates with a dedicated anime model.
This illustration uses long natural language prompts containing keywords like Japanese house and blur, and HiDream most accurately reproduces the prompt.
Can CFG Scale be Disabled?
While Flux.1 and HiDream have released distilled models that can disable CFG scale for faster generation, Qwen-Image lacks such models, resulting in twice the generation time compared to Flux.1 and HiDream.
About CFG Scale and Disabling It
Let's try generating illustrations with Qwen-Image by setting CFG scale to 1 to disable it.
CFG scale: 1

CFG scale: 2.5

Unfortunately, setting CFG scale to 1 causes the illustration to collapse.
Qwen-Image appears unsuitable for high-speed operation with disabled CFG.
CFG scale stabilizes around 2.5, so this is a good baseline for regular use.
Steps Should be Around 8-12!
Let's now test the step count required for illustration generation.
Using the high-precision heunpp2 sampler, we aim to generate illustrations with the fewest possible steps.
4 steps

8 steps

12 steps

New-generation MMDiT models converge illustrations even with fewer steps.
Qwen-Image also converges around 8 steps, with little change beyond that.
For image-to-image, 8 steps is sufficient.
For text-to-image, around 12 steps provides adequate computation.
While ComfyUI's default workflow setting is 20 steps, reducing this to a necessary and sufficient step count can shorten rendering time.
Increasing Noise with ModelSamplingAuraFlow
The ModelSamplingAuraFlow node used in Qwen-Image's default workflow adjusts the noise amount during illustration generation.
ModelSamplingAuraFlow Node

Noise Changes by ModelSamplingAuraFlow Shift Values
Step | bypass | shift: 3.1 | shift: 6 |
---|---|---|---|
1 | 0.9260 | 0.9249 | 0.9598 |
2 | 0.8732 | 0.8713 | 0.9291 |
3 | 0.8027 | 0.7998 | 0.8855 |
4 | 0.7103 | 0.7073 | 0.8238 |
5 | 0.5917 | 0.5877 | 0.7340 |
6 | 0.4438 | 0.4397 | 0.6030 |
7 | 0.2714 | 0.2688 | 0.4157 |
8 | 0.0998 | 0.1011 | 0.1787 |
9 | 0.0000 | 0.0000 | 0.0000 |
The default shift: 3.1 setting in the ModelSamplingAuraFlow node is nearly identical to bypassing the node.
Setting higher shift values increases noise, resulting in rich illustrations with strong Qwen-Image influence.
shift: 3.1

shift: 6

The ModelSamplingAuraFlow shift value should be set as high as possible within the range that maintains illustration integrity.
High-Speed Generation with Distilled Models and LoRA
While Qwen-Image is computationally heavy, there are community-developed CFG-disabled distilled models and acceleration LoRAs.

Qwen-Image-distill (CFG-Disabled High-Speed Distilled Model)
Qwen-Image-lightning (Acceleration LoRA)
Both can accelerate rendering in exchange for some image quality degradation.
Workflow
Here's the workflow used in this verification and recommended Qwen-Image settings.
This workflow is configured for high precision at 8-12 steps, so none of the acceleration techniques are used.
Workflow

Model Downloads (Hugging Face Comfy-Org Repository)
Changes from ComfyUI Official Workflow
- Uses BF16 format models
- Loads text encoder to RAM
- Uses ConditioningZeroOut instead of Negative Prompt
- Added color correction
About Color Correction
Recommended Settings
- sampler: heunpp2
- scheduler: beta
- steps: 12 (8 steps for image-to-image)
- ModelSamplingAuraFlow shift value: 6
Conclusion: Qwen-Image Excels with Chinese
- Qwen-Image excels at understanding and rendering Chinese
- English prompt understanding is somewhat inferior
- Requires 16GB VRAM
Qwen is an AI model released by China's Alibaba Group.
Qwen-Image is exceptionally superior in Chinese comprehension compared to previous AI image generation models.
However, it has usability drawbacks compared to other models, including the requirement for 16GB VRAM and the lack of an official distilled model with disabled CFG.

Nevertheless, while image generation models integrated with their own chat AIs from OpenAI's GPT-Image-1, Google's Imagen, and X.ai's Aurora remain proprietary and unreleased, Alibaba's release of Qwen-Image under a free license holds significant meaning for AI open-sourcing.
I look forward to continuing to explore new expressions as new AI image generation models are released.
Thank you for reading to the end!