[ComfyUI Intermediate] Settings to Control VRAM and Unlock Peak Performance!

2025-3-252025-7-9

Anime Illustration of A woman with white hair and blue eyes stands in front of a mirror

Use multiple GPUs or the CPU’s built-in GPU.
Set --reserve-vram to 0.
Use --use-pytorch-cross-attention.

Introduction

Hello, I’m Easygoing.

This time, I’m diving into the intermediate level of ComfyUI to explore the best settings for maximizing its performance.

Two-frame anime illustration depicting an ice mage

ComfyUI Is Fast!

When I switched from Stable Diffusion webUI Forge to ComfyUI, the first thing that blew me away was how fast it runs.

Stable Diffusion webUI Forge was already pretty snappy, but ComfyUI displays processes in real-time, letting you monitor VRAM and RAM usage at all times. This makes it easier to fine-tune and optimize your setup.

With the right settings, ComfyUI can run Flux.1, SD 3.5, and AuraFlow with just 12GB of VRAM, and SDXL with only 6GB of VRAM!

Let’s dive into the best practices for configuring ComfyUI.

VRAM Is Key

When generating images, the most critical factor is VRAM management.

If VRAM usage exceeds your GPU’s capacity during image generation, processing slows down significantly, and tasks can take several times longer to complete.

Anime Illustration of A woman with long white hair and blue eyes wearing a blue dress

With ComfyUI, you can monitor VRAM usage and adjust settings to ensure it doesn’t exceed your GPU’s capacity.

windows talc manager performance display screen with comment — Right-click on Windows → Task Manager → Performance

Using Two GPUs!

A GPU’s primary role is outputting video to your monitor.

Monitor output typically uses 300 MB to 1GB of VRAM. By offloading this to a second GPU, you can free up that VRAM for ComfyUI.

For this secondary GPU, an entry-level model from any manufacturer will do the trick.

Entry-Level GPUs

	GPU	VRAM	Used Price
NVIDIA	GeForce GT 1030	2 GB	Around 5,000 yen
AMD	Radeon RX 550	2-4 GB	Around 5,000 yen

You can often find these at rock-bottom prices in second-hand shops like Hard Off.

Switching GPU output is a breeze—just swap the monitor cable to the secondary GPU.

Using the CPU’s Built-in GPU

For small PCs or laptops where multiple GPUs aren’t an option, you might still be able to use the CPU’s built-in GPU for monitor output.

If you’re using an NVIDIA GPU, right-click the NVIDIA icon in the system tray and open the NVIDIA Control Panel to check which GPU is handling the display.

Screenshot of the Nvidia icon in the task tray with English comment

Screenshot of the PsyX configuration screen in the Nvidia Control Panel with English comment

If the CPU’s built-in GPU is handling monitor output, your dedicated GPU can allocate all its VRAM to ComfyUI.

Note that depending on your CPU model or motherboard settings, the built-in GPU might not be available for output. If it doesn’t work, try searching your CPU model online to confirm.

Specifying the Main GPU in Stability Matrix

When using multiple GPUs, you can set the primary GPU in Stability Matrix.

GPU selection screen in the system settings of the Stability matrix with comment

From the settings menu in the bottom left of Stability Matrix, go to System Settings and select your highest-performing GPU as the Default GPU for ComfyUI.

This ensures that when you install packages, Stability Matrix automatically picks a build optimized for that GPU.

ComfyUI’s Memory Management Logic

Now, let’s take a closer look at how ComfyUI handles memory.

The details are tucked away in the model_management.py file located at StabilityMatrix/Packages/ComfyUI/comfy.

Startup: GPU Detection

ComfyUI automatically detects GPUs at startup.

flowchart TB
   subgraph GPU Detection
        A01{GPU Detection}
        A11("NVIDIA (CUDA)")
        A12("Radeon (ROCm)")
        A13("Intel Arc (DirectML)")
        A14("Apple Silicon (MPS)")
        A15(CPU)
    end

    A01-->A11
    A01-->A12
    A01-->A13
    A01-->A14
    A01-->|No GPU detected<br>or --cpu|A15

It checks for NVIDIA (CUDA), Radeon (ROCm), Intel Arc (DirectML), Apple Silicon (MPS), or falls back to CPU if no GPU is detected or if you use the --cpu option.

The detected GPU shows up in ComfyUI’s startup log.

Screenshot of the ComfyUI startup log screen of the Stability matrix with English comment

VRAM State Settings

ComfyUI offers presets to determine how it uses VRAM.

flowchart TB
    subgraph VRAM State
        AA1{VRAM State}
        AB1(Max Performance)
        AB2(Normal Mode)
        AB3(Low Memory)
        AB4(Almost No VRAM)
    end

    AA1-->|--highvram<br>--gpu-only|AB1
    AA1-->|--normalvram<br>or unset|AB2
    AA1-->|--lowvram|AB3
    AA1-->|--novram|AB4

--highvram: Max performance, doesn’t unload models once loaded.
--normalvram or unset: Standard operation.
--lowvram: Saves VRAM but slows things down.
--novram: Barely uses VRAM.

VRAM-sufficient processing times

--highvram is fast but keeps models loaded, often triggering Out Of Memory (OOM) errors.

Normally, leaving it unset at NORMAL_VRAM works fine.

--lowvram saves 200–300 MB of VRAM, but the speed trade-off isn’t worth it in most cases.

--novram squeezes VRAM usage to the extreme, taking 1.5–2x longer but still much faster than exceeding VRAM limits.

It runs Smoothly with 6GB VRAM! A Recommendation for Using --novram | AI image journey

FP32 and FP16 Format Decisions

Computers use floating-point (FP) calculations for processing. Here are the types:

Format	Sign	Exponent	Mantissa	Precision	Accuracy
FP32	1 bit	8 bits	23 bits	6–7 digits	Excellent
FP16	1 bit	5 bits	10 bits	3–4 digits	Good
BF16	1 bit	8 bits	7 bits	3 digits	Good
FP8e4m3	1 bit	4 bits	3 bits	1–2 digits	Fair
FP8e5m2	1 bit	5 bits	2 bits	1–2 digits	Fair

FP32: High precision, high load.
FP16/BF16: Fast, memory-efficient.
FP8: Even faster, lower precision.

FP Format by GPU

flowchart TB
    subgraph FP Format by GPU
        B01("GTX 1000+<br>RX 5000+ (ROCm)<br>Intel Arc<br>Apple Silicon")
        B02(GTX 1600 Series<br>RX Vega or earlier<br>CPU)
        B11(FP16)
        B12(FP32)
    end

    B01-->B11
    B02-->B12

NVIDIA GTX 1000 series and later support FP16, but the GTX 1600 series runs FP32 due to unstable FP16 performance. AMD Radeon cards use FP16 under ROCm, though Windows support is limited.

Extra Reserved Memory (extra_reserved_memory) Settings

ComfyUI reserves some VRAM for non-ComfyUI tasks, mainly monitor output. If you’ve offloaded that to another GPU, you can free up this reserved space.

flowchart TB
    subgraph minimum_inference_memory Settings
        CA1{"extra_reserved_memory<br>(Extra Reserved Memory)"}
        CB1("600 MB<br>(Windows)")
        CB2("400 MB<br>(Other OS)")
        CB3("Custom Amount<br>(Stability Matrix default: 900 MB)")
        CC1("minimum_inference_memory<br>(Actual Reserved Memory)<br>=<br>extra_reserved_memory + 800 MB")
    end

    CA1-->|Windows|CB1
    CA1-->|Other OS|CB2
    CA1-->|Manual setting<br>--reserve-vram|CB3
    CB1-->CC1
    CB2-->CC1
    CB3-->CC1

By default, it’s 600 MB on Windows, 400 MB on other OSes, or 900 MB in Stability Matrix. Setting it to --reserve-vram 0 frees it all up, adding 800 MB to the actual reserved memory (minimum_inference_memory).

Screenshot of Reserve VRAM in comfyui startup settings with comment

Cross Attention Method

Cross Attention Method defines how calculations, including memory management, are handled.

--use-quad-cross-attention: Always splits data into four, slowest.
--use-split-cross-attention: Splits data to save VRAM, slower.
--use-pytorch-cross-attention: Fastest.

If unset, compatible GPUs default to --use-pytorch-cross-attention. Start with that, and switch if you hit an Out Of Memory (OOM) error.

Model Loading and Unloading in ComfyUI

ComfyUI keeps loaded models in VRAM and unloads them based on scoring when VRAM runs low.

flowchart TB
    subgraph Model Loading
        DB1{Compare model VRAM usage<br>with available VRAM}
        DC1(Load entire model)
        DD1{Split model}
        DE1(Normal inference)
        DE2(Use RAM too<br>Model swaps between VRAM/RAM, slowing down)
    end

    DB1-->|VRAM > Model|DC1
    DB1-->|VRAM < Model|DD1
    DD1-->|Split model fits VRAM|DE1
    DD1-->|Cannot split|DE2

    subgraph Model Unloading
        DF1{Loading a new model}
        DH1(Load)
        DG1(Unload based on scoring:<br>1. Model size<br>2. Used in other steps<br>3. Last loaded)
        DI1(Inference)
    end

    DE1-->DF1
    DE2-->DF1
    DF1-->|Enough VRAM|DH1
    DF1-->|Low VRAM|DG1
    DG1-->DH1
    DH1-->DI1

Scoring for Unloading Models

Model size
Used in other steps
Last loaded

When VRAM capacity is insufficient, models are unloaded based on scoring, but the optimal model isn’t always the one unloaded.

For large models like Flux.1 or SD 3.5, using a custom node like ComfyUI-MultiGPU to explicitly assign model loading can smooth things out.

Releasing Flan-T5xxl_TE-only in FP32, FP16, and GGUF Formats! | AI image journey

FP Formats for Text Encoder, UNET/Transformer, and VAE

Next, let’s check the FP formats for the text encoder, UNET/Transformer, and VAE.

Text Encoder

flowchart TB
    subgraph Text Encoder Inference Format
        GA1{Check manual settings}
        GB1(FP32)
        GB2(FP16)
        GB3("FP8e4m3<br>(RTX 4000 series only)")
        GB4("FP8e5m2<br>(RTX 4000 series only)")
    end

    GA1-->|--fp32-text-enc|GB1
    GA1-->|--fp16-text-enc or unset|GB2
    GA1-->|--fp8_e4m3fn-text-enc|GB3
    GA1-->|--fp8_e5m2-text-enc|GB4

When using a text encoder other than FP16, pick the format matching your model. To use the FP32 text encoder I’ve mentioned before, enter --fp32-text-enc in the bottom input field of the startup settings.

Screenshot of the free input screen of comfyui's startup configuration with comment

What Are CLIP and T5xxl ? How Text Encoders Can Make Illustrations Stunning! | AI image journey

UNET / Transformer

flowchart TB
    subgraph UNET / Transformer Inference Format
        EA1{Check manual settings}
        EB1(FP64)
        EB2(FP32)
        EB3(FP16)
        EB4(BF16)
        EB5("FP8e4m3<br>(RTX 4000 series only)")
        EB6("FP8e5m2<br>(RTX 4000 series only)")
        EC1{Determine data type}
        ED1(FP16)
        ED2(BF16)
        ED3(FP32)
    end

    EA1--->|--fp64-unet|EB1
    EA1--->|--fp32-unet|EB2
    EA1--->|--fp16-unet|EB3
    EA1--->|--bf16-unet|EB4
    EA1--->|--fp8_e4m3fn-unet|EB5
    EA1--->|--fp8-e5m2-unet|EB6
    EA1-->|Unset|EC1
    EC1-->|FP16|ED1
    EC1-->|BF16|ED2
    EC1-->|Other|ED3

UNET/Transformer automatically detects FP32, FP16, or BF16 models. For FP8 models, select the format in the startup settings or the model-loading node.

On RTX 4000 series, FP8 is fast but reduces image quality.

FP8e4m3 offers better precision than FP8e5m2.

Comparing Flux.1 Compression Formats: The Impact on Composition! | AI image journey

VAE

flowchart TB
    subgraph VAE Inference Format
        HA1{Check manual settings}
        HB1(FP32)
        HB2(FP16)
        HB3(BF16)
        HB4(CPU)
        HC1{GPU check}
        HD1(FP32)
        HD2(BF16)
    end

    HA1-->|--fp32-vae|HB1
    HA1-->|--fp16-vae|HB2
    HA1-->|--bf16-vae|HB3
    HA1-->|--cpu-vae|HB4
    HA1-->|Unset|HC1
    HC1-->|NVIDIA RTX 3000+<br>or Apple Silicon|HD2
    HC1-->|Other|HD1

The VAE selects BF16 format on RTX 3000 series or later, or Apple Silicon, and FP32 format for everything else. There’s almost no difference in image quality between the two.

Minimum Split Size by Model

Finally, let’s look at the minimum split sizes for each AI model. The largest non-splittable file is the UNET/Transformer. Here’s what I measured:

Test Environment

RTX 4060 Ti 16GB
--use-pytorch-cross-attention
--normalvram

Results

This shows that with the right startup settings, HiDream, Flux.1, SD 3.5, and AuraFlow run on 12GB VRAM, while SDXL needs just 6GB VRAM in FP16/BF16 formats.