What Are CLIP and T5xxl ? How Text Encoders Can Make Illustrations Stunning!


- CLIP converts text to vectors
- T5xxl understands context
- Enhanced text encoders are publicly available
Introduction
Hello, I'm Easygoing!
Today, we're diving into text encoders - the "brains" that help AI understand your prompts in image generation.
Text Encoders: The Universal Translator
AI systems need to convert the text we input into a format machines can understand and process.
flowchart LR
subgraph Prompt
A1(Text)
end
subgraph Text Encoder
B1(Words)
B2(Tokens)
B3(Vectors)
end
subgraph Transformer / UNET
C1(Generate Image)
end
A1-->B1
B1-->B2
B2-->B3
B3-->C1
Text encoders handle this translation, essentially functioning as dictionaries that convert human language into machine language.
So how does changing text encoders affect image quality in AI art generation?
Real Image Comparisons!
Let's examine how images actually change when we swap text encoders using the latest Flux.1 model.
Flux.1 comes equipped with two different text encoders:
flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Flux.1
D2(CLIP-L)
D1(T5xxl)
D3(Transformer)
end
X1-->D1
X1-->D2
D2-->D3
D1-->D3
- T5xxl: Primarily understands prompt context
- CLIP-L: Primarily understands short phrases and individual words
Today, we'll upgrade both T5xxl and CLIP-L to higher-precision versions.
T5xxl-FP16 + CLIP-L-FP16 (original)

Flan-T5xxl-FP16 + CLIP-L-FP16

Flan-T5xxl-FP32 + CLIP-L-FP32

Flan-T5xxl-FP32 + CLIP-GmP-ViT-L-14-FP32

Flan-T5xxl-FP32 + Long-CLIP-GmP-ViT-L-14-FP32

The text encoders listed become progressively more powerful as you go down the list.
You can clearly see how changing text encoders improves image quality, particularly in the architectural details on the right side of the images.
Note that the bottom Long-Clip-L model can be used in ComfyUI but is not compatible with Stable Diffusion webUI Forge.
Also, using FP32 format text encoders requires the --fp32-text-enc setting mentioned later.
Deep Dive into Text Encoders
Let's take a closer look at text encoders in more detail.
Here are the main text encoders used in popular image generation AI systems:
flowchart LR
subgraph Input
X1(Prompt)
end
subgraph "Image Generation"
Y1(UNET)
Y2(Transformer)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
end
subgraph "Stable Diffusion 3"
C1(CLIP-L)
C2(CLIP-G)
C3(T5xxl)
end
subgraph "Stable Diffusion XL"
B1(CLIP-L)
B2(CLIP-G)
end
subgraph "Stable Diffusion 1"
A1(CLIP-L)
end
X1-->A1
X1-->B1
X1-->B2
X1-->C1
X1-->C2
X1-->C3
X1-->D1
X1-->D2
A1-->Y1
B1-->Y1
B2-->Y1
C1-->Y2
C2-->Y2
C3-->Y2
D1-->Y2
D2-->Y2
T5xxl and CLIP are the text encoders, while UNET and Transformer handle the actual image generation based on the analyzed information.
CLIP: The Foundation of Everything
CLIP is an AI developed by OpenAI that can understand both text and images simultaneously.
Open-CLIP was developed through open-source reverse engineering of CLIP, and nowadays "CLIP" commonly refers to the Open-CLIP implementation.
CLIP comes in several variants based on performance:
Model Name | Release | Parameters | Token | Comprehensible Text |
---|---|---|---|---|
CLIP-B (Base) | November 2021 | 149 million | 77 | Words & Short Sentences |
CLIP-L (Large) | January 2022 | 355 million | 77 | Words & Short Sentences |
Long-CLIP-L | April 2024 | 355 million | 248 | Long Sentences |
CLIP-G (Giant) | January 2023 | 750 million | 77 | Long Sentences |
CLIP-L is an improved version of CLIP-B, and most image generation AI systems use CLIP-L.
Long-CLIP-L is an enhanced version of CLIP-L designed to understand longer text passages.
CLIP-G increases the parameter count over CLIP-L to improve overall performance. While the token count remains the same, it can handle prompts over 200 words by emphasizing and reproducing important elements more effectively.
T5xxl: Understanding Context
T5xxl is a text-to-text generation model developed by Google that forms the foundational technology behind AI chat systems and translation AI services we use today.
While T5xxl can theoretically handle very long texts, accuracy still decreases as text length increases.
Model Name | Release | Parameters | Token | Comprehensible Text |
---|---|---|---|---|
T5xxl | October 2020 | 11 billion | 32,000 | Long Sentences & Context |
T5xxl v1.1 | June 2021 | 11 billion | 32,000 | Long Sentences & Context |
Flan-T5xxl | October 2022 | 11 billion | 32,000 | Long Sentences & Context |
T5xxl v1.1 and Flan-T5xxl maintain the same parameter count but achieve improved overall accuracy through efficient additional training.
The Rise of Multiple Text Encoders
Modern image generation AI systems incorporate multiple text encoders to improve prompt understanding accuracy.
Stable Diffusion 1: Understanding Words and Short Phrases
flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Stable Diffusion 1
A1(CLIP-L)
Y1(UNET)
end
X1-->A1
A1-->Y1
Stable Diffusion 1, released in July 2022, used CLIP-L as its text encoder.
Since CLIP-L could understand limited tokens, prompts needed to be written in short, word-separated phrases, with important keywords placed at the beginning.
Stable Diffusion XL: Understanding Long Text
flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Stable Diffusion XL
B1(CLIP-L)
B2(CLIP-G)
Y1(UNET)
end
X1-->B1
X1-->B2
B1-->Y1
B2-->Y1
Stable Diffusion XL, released in July 2023, added CLIP-G alongside the existing CLIP-L text encoder.
CLIP-G offers superior performance to CLIP-L and can understand longer passages, enabling prompts to be written in natural language long-form text.

While SDXL models are 7GB in size, text encoders account for 1.8GB of that, demonstrating SDXL's emphasis on prompt understanding.
Stable Diffusion 3: Understanding Context
Stable Diffusion 3, released in June 2024, added T5xxl alongside CLIP-L and CLIP-G, significantly improving text comprehension.
While T5xxl is high-performing, it's also large - the encoder alone is 9GB in size.
flowchart LR
subgraph Input
X1(Prompt)
end
subgraph "Stable Diffusion 3"
C1(CLIP-L)
C2(CLIP-G)
C3(T5xxl)
Y2(Transformer)
end
X1-->C1
X1-->C2
X1-->C3
C1-->Y2
C2-->Y2
C3-->Y2
Due to the massive size of text encoders, running text encoders separately from the main model became standard practice starting with Stable Diffusion 3.
Flux.1: No CLIP-G Included
flowchart LR
subgraph Input
X1(Prompt)
end
subgraph "Flux.1"
D1(CLIP-L)
D2(T5xxl)
Y2(Transformer)
end
X1-->D1
X1-->D2
D1-->Y2
D2-->Y2
Flux.1, released in August 2024, uses two text encoders - CLIP-L and T5xxl - but notably excludes CLIP-G.
This is likely because T5xxl adequately covers CLIP-G's functionality.

Based on CLIP-G's absence, Stability AI claims that Stable Diffusion 3.5 has superior language understanding compared to Flux.1. However, in practice, there are rarely situations where Flux.1's prompt understanding feels insufficient.
Even the previous generation CLIP-G provides practically sufficient understanding for long prompts.
Enhanced Text Encoders!
Let's look at the links for the improved text encoders used in today's comparison.
Enhanced CLIP-L
CLIP-GmP-ViT-L-14
CLIP-GmP-ViT-L-14 is an improved CLIP-L model developed and freely shared by individual developer Zer0int.
The development motivation was simply "because I love CLIP" - they train models at home using an RTX 4090.
CLIP-GmP-ViT-L-14 uses Global mean Pooling (GmP: Geometric Parameterization) to achieve higher accuracy than standard CLIP-L. On ImageNet/ObjectNet benchmarks, it achieves 90% accuracy compared to the original CLIP-L's 85% - a significant performance improvement.
According to Zer0int, CLIP-GmP-ViT-L-14 addresses excessive fixation in image understanding that affects original CLIP-L.

The CLIP-GmP-ViT-L-14 download page offers multiple files, including the original FP32 version plus a further improved ViT-L-14-BEST-smooth-GmP-TE-only-HF-format.safetensors FP16 version.

When in doubt, download this FP16 version.
Long-CLIP-GmP-ViT-L-14 (ComfyUI Only)
Long-CLIP-L extends the standard CLIP-L model's 77-token limitation to support up to 248 tokens, enabling CLIP-L to handle longer prompts.
Currently, Long-CLIP-L only works with ComfyUI and cannot be used with Stable Diffusion webUI Forge.

The download page offers both the original FP32 version and an enhanced FP16 version: Long-ViT-L-14-BEST-GmP-smooth-ft.safetensors.
December 31, 2024 Update
I compared the effects of enhanced CLIP-L models with actual images.
Flan-T5xxl (Enhanced T5xxl)
Next up is the enhanced T5xxl. Flan-T5xxl underwent additional training on top of regular T5xxl to improve accuracy.
Original Flan-T5xxl (Split Version)
Google's original Flan-T5xxl is distributed in split files due to its large size (44GB for FP32 format).
Flan-T5xxl Merged Version
These are merged files based on the original, prepared for use in image generation AI.
Beyond the simple merged version, I also distribute a TE-only version that extracts only the text encoder components used by Flux.1 / SD 3.5 / HiDream.
How to Use Flan-T5xxl Models
Distribution formats include FP32, FP16, and GGUF formats.
Using FP32 Format Text Encoders
Text encoders normally process in FP16 format.
When using FP32 format text encoders, launch-time configuration is required.
Here's how to set this up in Stability Matrix.
ComfyUI
Add --fp32-text-enc to launch options.

Stable Diffusion webUI Forge
Add --clip-in-fp32 to launch options.

SDXL Text Encoder Upgrades Are Challenging
Today we upgraded text encoders in Flux.1.
Since Flux.1 and SD 3.5 run text encoders separately, upgrades are straightforward. However, SDXL and SD 1.5 have integrated encoders, making upgrades more complex.
I'll cover this topic in a future article!
Summary: Try Upgrading Your Text Encoders!
- CLIP converts text to vectors
- T5xxl understands context
- Enhanced text encoders are publicly available

When considering image generation quality, we often focus on the transformer components that generate images, leaving text encoders as an afterthought.
This investigation shows that text encoders significantly impact image quality as well.

The enhanced CLIP-L models introduced today offer clear image quality improvements despite their compact size, so I recommend everyone give them a try.
Thank you for reading to the end!
Update History
August 12, 2025
Minor content revisions made to the article
April 20, 2025
Added launch instructions for FP32 format text encoders in Stable Diffusion webUI Forge
March 9, 2025
Partial article revisions following the release of Flan-T5xxl_TE-only
December 15, 2024
Added usage instructions for FP32 format text encoders in ComfyUI