Qwen-Image: High-Parameter Model with Chinese Language Understanding and Recommended Settings

2025-8-172025-10-24

Delicious Chinese food and menu photos generated by Qwen-Image

Qwen-Image excels at understanding and rendering Chinese
English prompt understanding is somewhat inferior
Requires 16GB VRAM

Introduction

Hello, I'm Easygoing.

Today, I'd like to introduce Qwen, a chat AI released by China's Alibaba Group, and its image generation model Qwen-Image.

Anime illustration generated with Qwen-Image. A woman in blue clothes stands against the backdrop of a theme park castle illuminated at night. — SDXL -> Qwen-Image

Qwen is a Chat AI That Understands Chinese

Qwen is a large language model (chat AI) released by China's Alibaba Group.

Major Large Language Models

Model Name	Launch Date	Developer	Country
ChatGPT	November 2022	OpenAI	USA
Claude	December 2022	Anthropic	USA
Llama	February 2023	Meta	USA
Qwen	August 2023	Alibaba	China
Mistral	September 2023	Mistral AI	France
DeepSeek	October 2023	DeepSeek AI	China
Grok	November 2023	xAI	USA
Gemini	December 2023	Google	USA
Phi	December 2023	Microsoft	USA

Qwen has been trained in both English and Chinese, giving it native-level Chinese comprehension abilities.

Since Chinese is the second most spoken language in the world, the Qwen model, which understands both English and Chinese, has access to an enormous user base.

Qwen-Image is the Highest-Parameter AI Image Generation Model

Qwen-Image is an AI image generation model newly released by Alibaba on August 4, 2025.

gantt
   title Stable Diffusion 3 and Derivative Models
   dateFormat YYYY-MM-DD
   tickInterval 6month
       Stable Diffusion 3 : done, d1, 2024-06-12, 2025-08-16
       Flux.1   : d2, 2024-08-01, 2025-08-16
   		HiDream   : d3, 2025-04-06, 2025-08-16
       Qwen-Image   : crit, d4, 2025-08-04, 2025-08-16

	Country	Parameters	Open Source	Commercial Use
Stable Diffusion 3	UK	8 billion	✅	✅
Flux.1	Germany	12 billion	⚠️	⚠️
HiDream	China	17 billion	✅	✅
Qwen-Image	China	20 billion	✅	✅

Qwen-Image uses Qwen2.5-7B-VT, which has image recognition capabilities from Qwen, for its text encoder component, and a new MMDiT-based model evolved from Stable Diffusion 3 for its transformer component.

Text Encoder: The part that analyzes prompts
Transformer: The part that actually generates images

AI Image Generation Model Sizes

Qwen-Image has an overall capacity comparable to HiDream, but with an even larger transformer component than HiDream, making Qwen-Image's parameter count exceed HiDream's to become the largest local model.

16GB VRAM Required!

Because Qwen-Image has a large transformer capacity, it requires more VRAM than conventional models to run.

Minimum Split Capacity for Models in ComfyUI

The minimum split capacity for the Qwen-Image-BF16 model when running in ComfyUI is 14.2 GB, requiring 16 GB VRAM to operate.

Single Text Encoder Only

While recent AI image generation models commonly incorporate multiple text encoders to improve prompt understanding, Qwen-Image only has a single Qwen2.5-7B-VL encoder, making it more compact compared to other AI image generation models.

Text Encoders Used in AI Image Generation

	Release Date	Language	Size	Parameters
CLIP-L	January 2021	English	0.9 GB	430 million
CLIP-G	January 2023	English	5.2 GB	1.37 billion
T5-XXL (v1.1)	March 2022	Multilingual	22.8 GB	11 billion
Llama-3.1-8B-Instruct	July 2024	Multilingual	16.1 GB	8 billion
Qwen2.5-7B-VL	January 2025	Multilingual (especially Chinese)	19.3 GB	8.29 billion

Text Encoders in Flux.1, HiDream, and Qwen-Image

	Flux.1	HiDream	Qwen-Image
CLIP-L	✅	✅
CLIP-G		✅
T5xxl	✅	✅
Llama3.1-8b-instruct		✅
Qwen2.5-7B-VL			✅

While Qwen-Image has high Chinese comprehension, its prompt understanding in English and other languages is somewhat inferior to other models.

Qwen-Image Excels at Chinese Text Rendering!

Qwen-Image's greatest strength is its Chinese text rendering capability.

English uses 26 alphabetic characters that are reused frequently, but Chinese characters are far more numerous with each appearing less frequently, making them difficult for AI image generation models to learn.

HiDream

Qwen-Image

In addition to native Chinese comprehension, Qwen-Image has undergone specialized additional training on Chinese characters, enabling it to express Chinese text quite accurately.

Japanese Performance is Lackluster

Image generated with Qwen-Image showing a metal desk with a nameplate reading — Tokyo Patent Permission Office Director

While Qwen-Image appears to have learned hiragana and katakana similarly to Chinese characters, its Japanese comprehension is lacking.

Although its Japanese accuracy is higher compared to other AI image generation models, using generated images with Japanese text at a practical level remains challenging.

Comparing Anime Illustrations with Other Models!

Let's compare Qwen-Image's anime illustrations with other models.

The illustrations are generated using image-to-image from SDXL and compared with other AI image generation models.

SDXL (1024 x 1024)

This is the original illustration generated with SDXL.

Qwen-Image (1328 x 1328)

This is the illustration redrawn with Qwen-Image from the previous image.

Anime illustration redrawn with Qwen-Image (1024x1024). A woman in blue kimono standing inside a Japanese house. — denoise: 0.7

Since Qwen-Image supports high resolutions up to 1328 x 1328, it produces cleaner illustrations with more refined details than SDXL.

True to its high-parameter nature, Qwen-Image renders hands accurately.

Flux.1 [dev] (1440 x 1440)

Next, let's try redrawing the same SDXL illustration using the Flux.1 [dev] model.

Anime illustration redrawn with Flux.1 [dev] (1024x1024). A woman in blue kimono standing inside a Japanese house. — denoise: 0.5

Flux.1 [dev] significantly stylizes the character's face.

With the highest resolution at 1440 x 1440, Flux.1 [dev] produces clear illustrations with transparency.

HiDream-I1-Dev (1024 x 1024)

Using HiDream results in a significantly different expression from the original SDXL illustration.

HiDream employs a MoE (Mixture of Experts) architecture combining multiple models, and for anime illustrations, it likely operates with a dedicated anime model.

This illustration uses long natural language prompts containing keywords like Japanese house and blur, and HiDream most accurately reproduces the prompt.

Can CFG Scale be Disabled?

While Flux.1 and HiDream have released distilled models that can disable CFG scale for faster generation, Qwen-Image lacks such models, resulting in twice the generation time compared to Flux.1 and HiDream.

About CFG Scale and Disabling It

Is Negative Prompt Necessary? Unleashing AI’s Creativity! | AI Image Journey

Let's try generating illustrations with Qwen-Image by setting CFG scale to 1 to disable it.

CFG scale: 1

CFG scale: 2.5

Anime illustration generated with Qwen-Image with CFG scale set to 2.5. Clean finish. — Normal generation

Unfortunately, setting CFG scale to 1 causes the illustration to collapse.

Qwen-Image appears unsuitable for high-speed operation with disabled CFG.

CFG scale stabilizes around 2.5, so this is a good baseline for regular use.

Steps Should be Around 8-12!

Let's now test the step count required for illustration generation.

Using the high-precision heunpp2 sampler, we aim to generate illustrations with the fewest possible steps.

4 steps

8 steps

12 steps

New-generation MMDiT models converge illustrations even with fewer steps.

Qwen-Image also converges around 8 steps, with little change beyond that.

For image-to-image, 8 steps is sufficient.

For text-to-image, around 12 steps provides adequate computation.

While ComfyUI's default workflow setting is 20 steps, reducing this to a necessary and sufficient step count can shorten rendering time.

Increasing Noise with ModelSamplingAuraFlow

The ModelSamplingAuraFlow node used in Qwen-Image's default workflow adjusts the noise amount during illustration generation.

ModelSamplingAuraFlow Node

Noise Changes by ModelSamplingAuraFlow Shift Values

Step	bypass	shift: 3.1	shift: 6
1	0.9260	0.9249	0.9598
2	0.8732	0.8713	0.9291
3	0.8027	0.7998	0.8855
4	0.7103	0.7073	0.8238
5	0.5917	0.5877	0.7340
6	0.4438	0.4397	0.6030
7	0.2714	0.2688	0.4157
8	0.0998	0.1011	0.1787
9	0.0000	0.0000	0.0000

The default shift: 3.1 setting in the ModelSamplingAuraFlow node is nearly identical to bypassing the node.

Setting higher shift values increases noise, resulting in rich illustrations with strong Qwen-Image influence.

shift: 3.1

Anime illustration generated with Qwen-Image using ModelSamplingAuraFlow shift value 3.1. Expression close to the original illustration. — Expression following the original illustration

shift: 6

Anime illustration generated with Qwen-Image using ModelSamplingAuraFlow shift value 6. Significantly changed from the original illustration with strong Qwen-Image influence. — Strong Qwen-Image influence

The ModelSamplingAuraFlow shift value should be set as high as possible within the range that maintains illustration integrity.

High-Speed Generation with Distilled Models and LoRA

While Qwen-Image is computationally heavy, there are community-developed CFG-disabled distilled models and acceleration LoRAs.

Qwen-Image-distill (CFG-Disabled High-Speed Distilled Model)

qwen_image_distill_full_bf16.safetensors

Qwen-Image-lightning (Acceleration LoRA)

Qwen-Image-Lightning-8steps-V1.1-bf16.safetensors

Both can accelerate rendering in exchange for some image quality degradation.

Workflow

Here's the workflow used in this verification and recommended Qwen-Image settings.

This workflow is configured for high precision at 8-12 steps, so none of the acceleration techniques are used.

Workflow

Qwen-Image_20250816.json

Model Downloads (Hugging Face Comfy-Org Repository)

Changes from ComfyUI Official Workflow

Uses BF16 format models
Loads text encoder to RAM
Uses ConditioningZeroOut instead of Negative Prompt
Added color correction

About Color Correction

The Ultimate AI Auto Color Correction! How to Use ComfyUI-SuperBeasts | AI Image Journey

Recommended Settings

sampler: heunpp2
scheduler: beta
steps: 12 (8 steps for image-to-image)
ModelSamplingAuraFlow shift value: 6

Conclusion: Qwen-Image Excels with Chinese

Qwen-Image excels at understanding and rendering Chinese
English prompt understanding is somewhat inferior
Requires 16GB VRAM

Qwen is an AI model released by China's Alibaba Group.

Qwen-Image is exceptionally superior in Chinese comprehension compared to previous AI image generation models.

However, it has usability drawbacks compared to other models, including the requirement for 16GB VRAM and the lack of an official distilled model with disabled CFG.

Anime illustration generated with Qwen-Image. A woman in blue kimono gazing at the viewer with a simple background setting.

Nevertheless, while image generation models integrated with their own chat AIs from OpenAI's GPT-Image-1, Google's Imagen, and X.ai's Aurora remain proprietary and unreleased, Alibaba's release of Qwen-Image under a free license holds significant meaning for AI open-sourcing.

I look forward to continuing to explore new expressions as new AI image generation models are released.

Thank you for reading to the end!