[ComfyUI Intermediate] Settings to Control VRAM and Unlock Peak Performance!


- Use multiple GPUs or the CPU’s built-in GPU.
- Set --reserve-vram to 0.
- Use --use-pytorch-cross-attention.
Introduction
Hello, I’m Easygoing.
This time, I’m diving into the intermediate level of ComfyUI to explore the best settings for maximizing its performance.

ComfyUI Is Fast!
When I switched from Stable Diffusion webUI Forge to ComfyUI, the first thing that blew me away was how fast it runs.
Stable Diffusion webUI Forge was already pretty snappy, but ComfyUI displays processes in real-time, letting you monitor VRAM and RAM usage at all times. This makes it easier to fine-tune and optimize your setup.
With the right settings, ComfyUI can run Flux.1, SD 3.5, and AuraFlow with just 12GB of VRAM, and SDXL with only 6GB of VRAM!
Let’s dive into the best practices for configuring ComfyUI.
VRAM Is Key
When generating images, the most critical factor is VRAM management.
If VRAM usage exceeds your GPU’s capacity during image generation, processing slows down significantly, and tasks can take several times longer to complete.

With ComfyUI, you can monitor VRAM usage and adjust settings to ensure it doesn’t exceed your GPU’s capacity.

Using Two GPUs!
A GPU’s primary role is outputting video to your monitor.
Monitor output typically uses 300 MB to 1GB of VRAM. By offloading this to a second GPU, you can free up that VRAM for ComfyUI.
For this secondary GPU, an entry-level model from any manufacturer will do the trick.
Entry-Level GPUs
GPU | VRAM | Used Price | |
---|---|---|---|
NVIDIA | GeForce GT 1030 | 2 GB | Around 5,000 yen |
AMD | Radeon RX 550 | 2-4 GB | Around 5,000 yen |
You can often find these at rock-bottom prices in second-hand shops like Hard Off.
Switching GPU output is a breeze—just swap the monitor cable to the secondary GPU.
Using the CPU’s Built-in GPU
For small PCs or laptops where multiple GPUs aren’t an option, you might still be able to use the CPU’s built-in GPU for monitor output.
If you’re using an NVIDIA GPU, right-click the NVIDIA icon in the system tray and open the NVIDIA Control Panel to check which GPU is handling the display.


If the CPU’s built-in GPU is handling monitor output, your dedicated GPU can allocate all its VRAM to ComfyUI.
Note that depending on your CPU model or motherboard settings, the built-in GPU might not be available for output. If it doesn’t work, try searching your CPU model online to confirm.
Specifying the Main GPU in Stability Matrix
When using multiple GPUs, you can set the primary GPU in Stability Matrix.

From the settings menu in the bottom left of Stability Matrix, go to System Settings and select your highest-performing GPU as the Default GPU for ComfyUI.
This ensures that when you install packages, Stability Matrix automatically picks a build optimized for that GPU.
ComfyUI’s Memory Management Logic
Now, let’s take a closer look at how ComfyUI handles memory.
The details are tucked away in the model_management.py file located at StabilityMatrix/Packages/ComfyUI/comfy
.
Startup: GPU Detection
ComfyUI automatically detects GPUs at startup.
flowchart TB
subgraph GPU Detection
A01{GPU Detection}
A11("NVIDIA (CUDA)")
A12("Radeon (ROCm)")
A13("Intel Arc (DirectML)")
A14("Apple Silicon (MPS)")
A15(CPU)
end
A01-->A11
A01-->A12
A01-->A13
A01-->A14
A01-->|No GPU detected<br>or --cpu|A15
It checks for NVIDIA (CUDA), Radeon (ROCm), Intel Arc (DirectML), Apple Silicon (MPS), or falls back to CPU if no GPU is detected or if you use the --cpu
option.
The detected GPU shows up in ComfyUI’s startup log.

VRAM State Settings
ComfyUI offers presets to determine how it uses VRAM.
flowchart TB
subgraph VRAM State
AA1{VRAM State}
AB1(Max Performance)
AB2(Normal Mode)
AB3(Low Memory)
AB4(Almost No VRAM)
end
AA1-->|--highvram<br>--gpu-only|AB1
AA1-->|--normalvram<br>or unset|AB2
AA1-->|--lowvram|AB3
AA1-->|--novram|AB4
--highvram
: Max performance, doesn’t unload models once loaded.--normalvram
or unset: Standard operation.--lowvram
: Saves VRAM but slows things down.--novram
: Barely uses VRAM.
.png)
VRAM-sufficient processing times
--highvram
is fast but keeps models loaded, often triggering Out Of Memory (OOM) errors.

Normally, leaving it unset at NORMAL_VRAM
works fine.
--lowvram
saves 200–300 MB of VRAM, but the speed trade-off isn’t worth it in most cases.
--novram
squeezes VRAM usage to the extreme, taking 1.5–2x longer but still much faster than exceeding VRAM limits.
FP32 and FP16 Format Decisions
Computers use floating-point (FP) calculations for processing. Here are the types:
Format | Sign | Exponent | Mantissa | Precision | Accuracy |
---|---|---|---|---|---|
FP32 | 1 bit | 8 bits | 23 bits | 6–7 digits | Excellent |
FP16 | 1 bit | 5 bits | 10 bits | 3–4 digits | Good |
BF16 | 1 bit | 8 bits | 7 bits | 3 digits | Good |
FP8e4m3 | 1 bit | 4 bits | 3 bits | 1–2 digits | Fair |
FP8e5m2 | 1 bit | 5 bits | 2 bits | 1–2 digits | Fair |
- FP32: High precision, high load.
- FP16/BF16: Fast, memory-efficient.
- FP8: Even faster, lower precision.
FP Format by GPU
flowchart TB
subgraph FP Format by GPU
B01("GTX 1000+<br>RX 5000+ (ROCm)<br>Intel Arc<br>Apple Silicon")
B02(GTX 1600 Series<br>RX Vega or earlier<br>CPU)
B11(FP16)
B12(FP32)
end
B01-->B11
B02-->B12
NVIDIA GTX 1000 series and later support FP16, but the GTX 1600 series runs FP32 due to unstable FP16 performance. AMD Radeon cards use FP16 under ROCm, though Windows support is limited.
Extra Reserved Memory (extra_reserved_memory) Settings
ComfyUI reserves some VRAM for non-ComfyUI tasks, mainly monitor output. If you’ve offloaded that to another GPU, you can free up this reserved space.
flowchart TB
subgraph minimum_inference_memory Settings
CA1{"extra_reserved_memory<br>(Extra Reserved Memory)"}
CB1("600 MB<br>(Windows)")
CB2("400 MB<br>(Other OS)")
CB3("Custom Amount<br>(Stability Matrix default: 900 MB)")
CC1("minimum_inference_memory<br>(Actual Reserved Memory)<br>=<br>extra_reserved_memory + 800 MB")
end
CA1-->|Windows|CB1
CA1-->|Other OS|CB2
CA1-->|Manual setting<br>--reserve-vram|CB3
CB1-->CC1
CB2-->CC1
CB3-->CC1
By default, it’s 600 MB on Windows, 400 MB on other OSes, or 900 MB in Stability Matrix. Setting it to
--reserve-vram 0
frees it all up, adding 800 MB to the actual reserved memory (minimum_inference_memory).

Cross Attention Method
Cross Attention Method defines how calculations, including memory management, are handled.

-
--use-quad-cross-attention
: Always splits data into four, slowest. -
--use-split-cross-attention
: Splits data to save VRAM, slower. -
--use-pytorch-cross-attention
: Fastest.
If unset, compatible GPUs default to --use-pytorch-cross-attention
. Start with that, and switch if you hit an Out Of Memory (OOM) error.
Model Loading and Unloading in ComfyUI
ComfyUI keeps loaded models in VRAM and unloads them based on scoring when VRAM runs low.
flowchart TB
subgraph Model Loading
DB1{Compare model VRAM usage<br>with available VRAM}
DC1(Load entire model)
DD1{Split model}
DE1(Normal inference)
DE2(Use RAM too<br>Model swaps between VRAM/RAM, slowing down)
end
DB1-->|VRAM > Model|DC1
DB1-->|VRAM < Model|DD1
DD1-->|Split model fits VRAM|DE1
DD1-->|Cannot split|DE2
subgraph Model Unloading
DF1{Loading a new model}
DH1(Load)
DG1(Unload based on scoring:<br>1. Model size<br>2. Used in other steps<br>3. Last loaded)
DI1(Inference)
end
DE1-->DF1
DE2-->DF1
DF1-->|Enough VRAM|DH1
DF1-->|Low VRAM|DG1
DG1-->DH1
DH1-->DI1
Scoring for Unloading Models
- Model size
- Used in other steps
- Last loaded
When VRAM capacity is insufficient, models are unloaded based on scoring, but the optimal model isn’t always the one unloaded.
For large models like Flux.1 or SD 3.5, using a custom node like ComfyUI-MultiGPU to explicitly assign model loading can smooth things out.
FP Formats for Text Encoder, UNET/Transformer, and VAE
Next, let’s check the FP formats for the text encoder, UNET/Transformer, and VAE.
Text Encoder
flowchart TB
subgraph Text Encoder Inference Format
GA1{Check manual settings}
GB1(FP32)
GB2(FP16)
GB3("FP8e4m3<br>(RTX 4000 series only)")
GB4("FP8e5m2<br>(RTX 4000 series only)")
end
GA1-->|--fp32-text-enc|GB1
GA1-->|--fp16-text-enc or unset|GB2
GA1-->|--fp8_e4m3fn-text-enc|GB3
GA1-->|--fp8_e5m2-text-enc|GB4
When using a text encoder other than FP16, pick the format matching your model. To use the FP32 text encoder I’ve mentioned before, enter --fp32-text-enc
in the bottom input field of the startup settings.

UNET / Transformer
flowchart TB
subgraph UNET / Transformer Inference Format
EA1{Check manual settings}
EB1(FP64)
EB2(FP32)
EB3(FP16)
EB4(BF16)
EB5("FP8e4m3<br>(RTX 4000 series only)")
EB6("FP8e5m2<br>(RTX 4000 series only)")
EC1{Determine data type}
ED1(FP16)
ED2(BF16)
ED3(FP32)
end
EA1--->|--fp64-unet|EB1
EA1--->|--fp32-unet|EB2
EA1--->|--fp16-unet|EB3
EA1--->|--bf16-unet|EB4
EA1--->|--fp8_e4m3fn-unet|EB5
EA1--->|--fp8-e5m2-unet|EB6
EA1-->|Unset|EC1
EC1-->|FP16|ED1
EC1-->|BF16|ED2
EC1-->|Other|ED3
UNET/Transformer automatically detects FP32, FP16, or BF16 models. For FP8 models, select the format in the startup settings or the model-loading node.
On RTX 4000 series, FP8 is fast but reduces image quality.
FP8e4m3 offers better precision than FP8e5m2.
VAE
flowchart TB
subgraph VAE Inference Format
HA1{Check manual settings}
HB1(FP32)
HB2(FP16)
HB3(BF16)
HB4(CPU)
HC1{GPU check}
HD1(FP32)
HD2(BF16)
end
HA1-->|--fp32-vae|HB1
HA1-->|--fp16-vae|HB2
HA1-->|--bf16-vae|HB3
HA1-->|--cpu-vae|HB4
HA1-->|Unset|HC1
HC1-->|NVIDIA RTX 3000+<br>or Apple Silicon|HD2
HC1-->|Other|HD1
The VAE selects BF16 format on RTX 3000 series or later, or Apple Silicon, and FP32 format for everything else. There’s almost no difference in image quality between the two.
Minimum Split Size by Model
Finally, let’s look at the minimum split sizes for each AI model. The largest non-splittable file is the UNET/Transformer. Here’s what I measured:
Test Environment
- RTX 4060 Ti 16GB
--use-pytorch-cross-attention
--normalvram
Results

- Flux.1 (BF16): 10.1 GB
- SD 3.5 Large (FP16 + BF16): 10.1 GB
- AuraFlow (FP16): 10.1 GB
- SDXL (FP16): 4.3 GB
This shows that with the right startup settings, Flux.1, SD 3.5, and AuraFlow run on 12GB VRAM, while SDXL needs just 6GB VRAM in FP16/BF16 formats.
Other Startup Options
Startup options beyond what I’ve covered here are summarized on the next page.
Conclusion: ComfyUI Best Practices
- Use multiple GPUs or the CPU’s built-in GPU.
- Set
--reserve-vram
to 0. - Use
--use-pytorch-cross-attention
.
When you install ComfyUI via Stability Matrix, it defaults to VRAM-saving, slower settings to ensure broad compatibility.
This is a necessary design choice for many users to avoid issues, but tweaking it for your setup can boost performance significantly.

To find the sweet spot, I recommend turning off all VRAM-saving options first, then reverting if you hit an Out Of Memory error.
Just applying the startup settings I’ve covered here will speed up ComfyUI a ton—give it a shot!
Thanks for reading to the end!