CUDA Out of Memory: Complete Troubleshooting Guide

"CUDA out of memory" is the most frustrating error in deep learning. This guide covers every solution, from quick fixes to advanced techniques.

Understanding the Error

What the Error Looks Like

PyTorch:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 24.00 GiB total capacity; 21.58 GiB already allocated;
1.45 GiB free; 22.00 GiB reserved in total by PyTorch)

TensorFlow:

ResourceExhaustedError: OOM when allocating tensor with shape [32, 512, 512, 3]

Stable Diffusion:

torch.cuda.OutOfMemoryError: CUDA out of memory.

What Causes CUDA OOM?

Cause	Description	Solution
Model too large	Model parameters exceed VRAM	Use smaller model or quantization
Batch size too high	Each sample uses memory	Reduce batch size
Input too large	High-resolution images/long sequences	Reduce input size
Memory leak	Tensors not freed properly	Clear cache, fix code
Gradient accumulation	Gradients stored for backward pass	Use gradient checkpointing
Multiple models	Loading several models simultaneously	Unload unused models

Quick Diagnosis

Step 1: Check Current GPU Memory Usage

# Real-time monitoring
watch -n 1 nvidia-smi

# One-time check
nvidia-smi

Output explanation:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05   Driver Version: 535.104.05   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 4090     On   | 00000000:01:00.0 Off |                  Off |
| 30%   45C    P2    75W / 450W |  18432MiB / 24564MiB |     85%      Default |
+-------------------------------+----------------------+----------------------+

Key metric: 18432MiB / 24564MiB = 75% memory used

Step 2: Check Memory in Python

import torch

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")

# Memory info
print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")

Step 3: Find Memory-Hungry Operations

# Enable memory tracking
torch.cuda.memory._record_memory_history()

# Run your code here
# ...

# Get memory snapshot
snapshot = torch.cuda.memory._snapshot()

# Analyze with PyTorch's memory visualization tools

Quick Fixes (Try These First)

Fix 1: Reduce Batch Size

The simplest and most effective solution:

# Before
train_loader = DataLoader(dataset, batch_size=32)  # OOM!

# After
train_loader = DataLoader(dataset, batch_size=8)   # Works!

Rule of thumb: If OOM occurs, halve the batch size until it works.

Fix 2: Clear GPU Cache

import torch
import gc

# Clear PyTorch cache
torch.cuda.empty_cache()

# Force garbage collection
gc.collect()

When to use: After loading/unloading models, between training runs.

Fix 3: Use Mixed Precision (FP16/BF16)

Reduces memory usage by ~50% with minimal accuracy loss:

# PyTorch native AMP
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()

    with autocast():  # FP16 forward pass
        outputs = model(batch)
        loss = criterion(outputs, targets)

    scaler.scale(loss).backward()  # Scaled backward pass
    scaler.step(optimizer)
    scaler.update()

For Hugging Face Transformers:

from transformers import TrainingArguments

training_args = TrainingArguments(
    fp16=True,  # Enable FP16
    # or bf16=True for newer GPUs (RTX 3090+, A100, H100)
)

Fix 4: Enable Gradient Checkpointing

Trades compute for memory by recomputing activations:

# PyTorch
model.gradient_checkpointing_enable()

# Hugging Face
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    gradient_checkpointing=True
)

Memory savings: 50-70% for transformer models

Fix 5: Use Smaller Data Types

# Load model in lower precision
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    torch_dtype=torch.float16,  # Half precision
    # or torch_dtype=torch.bfloat16 for A100/H100
)

Advanced Solutions

Solution 1: 8-bit and 4-bit Quantization

8-bit Quantization (bitsandbytes):

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_8bit=True,
    device_map="auto"
)

4-bit Quantization (QLoRA):

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    quantization_config=bnb_config,
    device_map="auto"
)

Memory comparison:

Model	FP32	FP16	8-bit	4-bit
Llama 2 7B	28 GB	14 GB	7 GB	3.5 GB
Llama 2 13B	52 GB	26 GB	13 GB	6.5 GB
Llama 2 70B	280 GB	140 GB	70 GB	35 GB

Solution 2: Gradient Accumulation

Simulate larger batch sizes without more memory:

accumulation_steps = 4
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    outputs = model(batch)
    loss = criterion(outputs, targets)
    loss = loss / accumulation_steps  # Normalize loss
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Effective batch size: actual_batch_size × accumulation_steps

Solution 3: CPU Offloading

Move parts of the model to CPU:

from accelerate import Accelerator

accelerator = Accelerator(cpu_offload=True)
model, optimizer = accelerator.prepare(model, optimizer)

DeepSpeed ZeRO-Offload:

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu"
        }
    }
}

Solution 4: Model Parallelism

Split model across multiple GPUs:

from accelerate import Accelerator

accelerator = Accelerator()
model = accelerator.prepare(model)

# Or manually with device_map
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    device_map="auto"  # Automatically splits across GPUs
)

Solution 5: Efficient Attention (Flash Attention)

# Install flash-attn
# pip install flash-attn --no-build-isolation

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

Memory savings: 20-40% for long sequences

Framework-Specific Solutions

PyTorch

# 1. Disable gradient for inference
with torch.no_grad():
    outputs = model(inputs)

# 2. Delete tensors when done
del tensor
torch.cuda.empty_cache()

# 3. Use inference mode (faster than no_grad)
with torch.inference_mode():
    outputs = model(inputs)

# 4. Limit memory fragmentation
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

TensorFlow

import tensorflow as tf

# 1. Enable memory growth
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

# 2. Limit GPU memory
tf.config.set_logical_device_configuration(
    gpus[0],
    [tf.config.LogicalDeviceConfiguration(memory_limit=8192)]  # 8GB limit
)

# 3. Clear session
tf.keras.backend.clear_session()

Stable Diffusion / ComfyUI

# 1. Use attention slicing
pipe.enable_attention_slicing()

# 2. Use VAE slicing for large images
pipe.enable_vae_slicing()

# 3. Use sequential CPU offload
pipe.enable_sequential_cpu_offload()

# 4. Use model CPU offload (less aggressive)
pipe.enable_model_cpu_offload()

# 5. Use xformers memory efficient attention
pipe.enable_xformers_memory_efficient_attention()

Automatic1111 WebUI flags:

python launch.py --medvram  # Medium VRAM optimization
python launch.py --lowvram  # Low VRAM optimization (slower)
python launch.py --xformers # Enable xformers

Hugging Face Transformers

from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    fp16=True,
    optim="adamw_8bit",  # 8-bit optimizer
)

Memory Requirements Reference

LLM Training Memory Formula

Memory ≈ Model Parameters × 4 bytes (FP32)
       + Optimizer States × 8-12 bytes per parameter (Adam)
       + Gradients × 4 bytes per parameter
       + Activations (depends on batch size and sequence length)

Approximate VRAM needed:

Model	Inference (FP16)	Fine-tuning (FP16)	Full Training
7B	14 GB	20 GB	56 GB
13B	26 GB	40 GB	104 GB
70B	140 GB	200 GB	560 GB

Image Generation Memory

Model	Resolution	Minimum VRAM	Recommended
SD 1.5	512×512	4 GB	8 GB
SD 1.5	768×768	6 GB	10 GB
SDXL	1024×1024	8 GB	12 GB
SDXL + ControlNet	1024×1024	12 GB	16 GB
Flux.1	1024×1024	16 GB	24 GB

Debugging Memory Leaks

Identify Leaks

import torch

def check_memory():
    print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
    print(f"Reserved: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")

# Check at different points
check_memory()  # Before
output = model(input)
check_memory()  # After forward
loss.backward()
check_memory()  # After backward

Common Leak Patterns

1. Storing tensors in lists:

# BAD - accumulates GPU memory
losses = []
for batch in dataloader:
    loss = model(batch)
    losses.append(loss)  # Keeps tensor on GPU!

# GOOD - detach and move to CPU
losses = []
for batch in dataloader:
    loss = model(batch)
    losses.append(loss.detach().cpu().item())

2. Not clearing gradients:

# BAD - gradients accumulate
for batch in dataloader:
    loss = model(batch)
    loss.backward()

# GOOD - zero gradients
for batch in dataloader:
    optimizer.zero_grad()  # Clear previous gradients
    loss = model(batch)
    loss.backward()
    optimizer.step()

3. Keeping computation graph:

# BAD - keeps entire computation graph
total_loss = 0
for batch in dataloader:
    loss = model(batch)
    total_loss += loss  # Keeps graph!

# GOOD - detach from graph
total_loss = 0
for batch in dataloader:
    loss = model(batch)
    total_loss += loss.item()  # Just the number

Prevention Strategies

1. Estimate Memory Before Running

def estimate_model_memory(model, batch_size, precision='fp32'):
    """Estimate memory usage for a model."""
    param_size = sum(p.numel() for p in model.parameters())

    bytes_per_param = {'fp32': 4, 'fp16': 2, 'int8': 1}[precision]

    # Parameters + gradients + optimizer states (Adam ≈ 3x)
    training_memory = param_size * bytes_per_param * 4

    total = training_memory
    return total / 1e9  # GB

# Usage
memory_needed = estimate_model_memory(model, batch_size=8)
print(f"Estimated memory: {memory_needed:.2f} GB")

2. Start Small, Scale Up

# Start with minimum settings
batch_size = 1

# Gradually increase until OOM
for bs in [1, 2, 4, 8, 16, 32]:
    try:
        train_step(batch_size=bs)
        print(f"Batch size {bs}: OK")
    except RuntimeError as e:
        if "out of memory" in str(e):
            print(f"Batch size {bs}: OOM! Use {bs // 2}")
            break
        raise e

3. Monitor During Training

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()

for step, batch in enumerate(dataloader):
    # ... training code ...

    # Log memory usage
    writer.add_scalar('memory/allocated',
                      torch.cuda.memory_allocated(0) / 1e9, step)
    writer.add_scalar('memory/reserved',
                      torch.cuda.memory_reserved(0) / 1e9, step)

Quick Reference: OOM Solutions by Situation

Situation	First Try	If Still OOM	Last Resort
Training	Reduce batch size	Gradient checkpointing	8-bit optimizer
Fine-tuning	Use LoRA/QLoRA	4-bit quantization	CPU offload
Inference	FP16/BF16	4-bit quantization	CPU offload
Image Gen	--medvram flag	--lowvram flag	Reduce resolution
Long sequences	Flash Attention	Reduce seq length	Chunked processing

Summary

Diagnose first: Use nvidia-smi and PyTorch memory functions
Quick fixes: Reduce batch size, clear cache, use FP16
Advanced: Quantization, gradient checkpointing, CPU offload
Prevent: Estimate memory beforehand, monitor during training
When all else fails: Get more VRAM (upgrade GPU or use cloud)

Remember: CUDA OOM is usually solvable. Start with the simplest solutions and work your way up.

Need more VRAM? Browse SynpixCloud's marketplace for RTX 4090 (24GB), A100 (80GB), and H100 (80GB) instances.

CUDA Out of Memory: Complete Troubleshooting Guide

Table of Contents