๐ŸŽ‰ SynpixCloud is Now Live! Welcome to Our GPU Cloud PlatformGet Started

CUDA Out of Memory: Complete Troubleshooting Guide

Jan 17, 2026

"CUDA out of memory" is the most frustrating error in deep learning. This guide covers every solution, from quick fixes to advanced techniques.

Understanding the Error

What the Error Looks Like

PyTorch:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 24.00 GiB total capacity; 21.58 GiB already allocated;
1.45 GiB free; 22.00 GiB reserved in total by PyTorch)

TensorFlow:

ResourceExhaustedError: OOM when allocating tensor with shape [32, 512, 512, 3]

Stable Diffusion:

torch.cuda.OutOfMemoryError: CUDA out of memory.

What Causes CUDA OOM?

CauseDescriptionSolution
Model too largeModel parameters exceed VRAMUse smaller model or quantization
Batch size too highEach sample uses memoryReduce batch size
Input too largeHigh-resolution images/long sequencesReduce input size
Memory leakTensors not freed properlyClear cache, fix code
Gradient accumulationGradients stored for backward passUse gradient checkpointing
Multiple modelsLoading several models simultaneouslyUnload unused models

Quick Diagnosis

Step 1: Check Current GPU Memory Usage

# Real-time monitoring
watch -n 1 nvidia-smi

# One-time check
nvidia-smi

Output explanation:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05   Driver Version: 535.104.05   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 4090     On   | 00000000:01:00.0 Off |                  Off |
| 30%   45C    P2    75W / 450W |  18432MiB / 24564MiB |     85%      Default |
+-------------------------------+----------------------+----------------------+

Key metric: 18432MiB / 24564MiB = 75% memory used

Step 2: Check Memory in Python

import torch

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")

# Memory info
print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")

Step 3: Find Memory-Hungry Operations

# Enable memory tracking
torch.cuda.memory._record_memory_history()

# Run your code here
# ...

# Get memory snapshot
snapshot = torch.cuda.memory._snapshot()

# Analyze with PyTorch's memory visualization tools

Quick Fixes (Try These First)

Fix 1: Reduce Batch Size

The simplest and most effective solution:

# Before
train_loader = DataLoader(dataset, batch_size=32)  # OOM!

# After
train_loader = DataLoader(dataset, batch_size=8)   # Works!

Rule of thumb: If OOM occurs, halve the batch size until it works.

Fix 2: Clear GPU Cache

import torch
import gc

# Clear PyTorch cache
torch.cuda.empty_cache()

# Force garbage collection
gc.collect()

When to use: After loading/unloading models, between training runs.

Fix 3: Use Mixed Precision (FP16/BF16)

Reduces memory usage by ~50% with minimal accuracy loss:

# PyTorch native AMP
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()

    with autocast():  # FP16 forward pass
        outputs = model(batch)
        loss = criterion(outputs, targets)

    scaler.scale(loss).backward()  # Scaled backward pass
    scaler.step(optimizer)
    scaler.update()

For Hugging Face Transformers:

from transformers import TrainingArguments

training_args = TrainingArguments(
    fp16=True,  # Enable FP16
    # or bf16=True for newer GPUs (RTX 3090+, A100, H100)
)

Fix 4: Enable Gradient Checkpointing

Trades compute for memory by recomputing activations:

# PyTorch
model.gradient_checkpointing_enable()

# Hugging Face
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    gradient_checkpointing=True
)

Memory savings: 50-70% for transformer models

Fix 5: Use Smaller Data Types

# Load model in lower precision
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    torch_dtype=torch.float16,  # Half precision
    # or torch_dtype=torch.bfloat16 for A100/H100
)

Advanced Solutions

Solution 1: 8-bit and 4-bit Quantization

8-bit Quantization (bitsandbytes):

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_8bit=True,
    device_map="auto"
)

4-bit Quantization (QLoRA):

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    quantization_config=bnb_config,
    device_map="auto"
)

Memory comparison:

ModelFP32FP168-bit4-bit
Llama 2 7B28 GB14 GB7 GB3.5 GB
Llama 2 13B52 GB26 GB13 GB6.5 GB
Llama 2 70B280 GB140 GB70 GB35 GB

Solution 2: Gradient Accumulation

Simulate larger batch sizes without more memory:

accumulation_steps = 4
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    outputs = model(batch)
    loss = criterion(outputs, targets)
    loss = loss / accumulation_steps  # Normalize loss
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Effective batch size: actual_batch_size ร— accumulation_steps

Solution 3: CPU Offloading

Move parts of the model to CPU:

from accelerate import Accelerator

accelerator = Accelerator(cpu_offload=True)
model, optimizer = accelerator.prepare(model, optimizer)

DeepSpeed ZeRO-Offload:

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu"
        }
    }
}

Solution 4: Model Parallelism

Split model across multiple GPUs:

from accelerate import Accelerator

accelerator = Accelerator()
model = accelerator.prepare(model)

# Or manually with device_map
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    device_map="auto"  # Automatically splits across GPUs
)

Solution 5: Efficient Attention (Flash Attention)

# Install flash-attn
# pip install flash-attn --no-build-isolation

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

Memory savings: 20-40% for long sequences

Framework-Specific Solutions

PyTorch

# 1. Disable gradient for inference
with torch.no_grad():
    outputs = model(inputs)

# 2. Delete tensors when done
del tensor
torch.cuda.empty_cache()

# 3. Use inference mode (faster than no_grad)
with torch.inference_mode():
    outputs = model(inputs)

# 4. Limit memory fragmentation
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

TensorFlow

import tensorflow as tf

# 1. Enable memory growth
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

# 2. Limit GPU memory
tf.config.set_logical_device_configuration(
    gpus[0],
    [tf.config.LogicalDeviceConfiguration(memory_limit=8192)]  # 8GB limit
)

# 3. Clear session
tf.keras.backend.clear_session()

Stable Diffusion / ComfyUI

# 1. Use attention slicing
pipe.enable_attention_slicing()

# 2. Use VAE slicing for large images
pipe.enable_vae_slicing()

# 3. Use sequential CPU offload
pipe.enable_sequential_cpu_offload()

# 4. Use model CPU offload (less aggressive)
pipe.enable_model_cpu_offload()

# 5. Use xformers memory efficient attention
pipe.enable_xformers_memory_efficient_attention()

Automatic1111 WebUI flags:

python launch.py --medvram  # Medium VRAM optimization
python launch.py --lowvram  # Low VRAM optimization (slower)
python launch.py --xformers # Enable xformers

Hugging Face Transformers

from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    fp16=True,
    optim="adamw_8bit",  # 8-bit optimizer
)

Memory Requirements Reference

LLM Training Memory Formula

Memory โ‰ˆ Model Parameters ร— 4 bytes (FP32)
       + Optimizer States ร— 8-12 bytes per parameter (Adam)
       + Gradients ร— 4 bytes per parameter
       + Activations (depends on batch size and sequence length)

Approximate VRAM needed:

ModelInference (FP16)Fine-tuning (FP16)Full Training
7B14 GB20 GB56 GB
13B26 GB40 GB104 GB
70B140 GB200 GB560 GB

Image Generation Memory

ModelResolutionMinimum VRAMRecommended
SD 1.5512ร—5124 GB8 GB
SD 1.5768ร—7686 GB10 GB
SDXL1024ร—10248 GB12 GB
SDXL + ControlNet1024ร—102412 GB16 GB
Flux.11024ร—102416 GB24 GB

Debugging Memory Leaks

Identify Leaks

import torch

def check_memory():
    print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
    print(f"Reserved: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")

# Check at different points
check_memory()  # Before
output = model(input)
check_memory()  # After forward
loss.backward()
check_memory()  # After backward

Common Leak Patterns

1. Storing tensors in lists:

# BAD - accumulates GPU memory
losses = []
for batch in dataloader:
    loss = model(batch)
    losses.append(loss)  # Keeps tensor on GPU!

# GOOD - detach and move to CPU
losses = []
for batch in dataloader:
    loss = model(batch)
    losses.append(loss.detach().cpu().item())

2. Not clearing gradients:

# BAD - gradients accumulate
for batch in dataloader:
    loss = model(batch)
    loss.backward()

# GOOD - zero gradients
for batch in dataloader:
    optimizer.zero_grad()  # Clear previous gradients
    loss = model(batch)
    loss.backward()
    optimizer.step()

3. Keeping computation graph:

# BAD - keeps entire computation graph
total_loss = 0
for batch in dataloader:
    loss = model(batch)
    total_loss += loss  # Keeps graph!

# GOOD - detach from graph
total_loss = 0
for batch in dataloader:
    loss = model(batch)
    total_loss += loss.item()  # Just the number

Prevention Strategies

1. Estimate Memory Before Running

def estimate_model_memory(model, batch_size, precision='fp32'):
    """Estimate memory usage for a model."""
    param_size = sum(p.numel() for p in model.parameters())

    bytes_per_param = {'fp32': 4, 'fp16': 2, 'int8': 1}[precision]

    # Parameters + gradients + optimizer states (Adam โ‰ˆ 3x)
    training_memory = param_size * bytes_per_param * 4

    total = training_memory
    return total / 1e9  # GB

# Usage
memory_needed = estimate_model_memory(model, batch_size=8)
print(f"Estimated memory: {memory_needed:.2f} GB")

2. Start Small, Scale Up

# Start with minimum settings
batch_size = 1

# Gradually increase until OOM
for bs in [1, 2, 4, 8, 16, 32]:
    try:
        train_step(batch_size=bs)
        print(f"Batch size {bs}: OK")
    except RuntimeError as e:
        if "out of memory" in str(e):
            print(f"Batch size {bs}: OOM! Use {bs // 2}")
            break
        raise e

3. Monitor During Training

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()

for step, batch in enumerate(dataloader):
    # ... training code ...

    # Log memory usage
    writer.add_scalar('memory/allocated',
                      torch.cuda.memory_allocated(0) / 1e9, step)
    writer.add_scalar('memory/reserved',
                      torch.cuda.memory_reserved(0) / 1e9, step)

Quick Reference: OOM Solutions by Situation

SituationFirst TryIf Still OOMLast Resort
TrainingReduce batch sizeGradient checkpointing8-bit optimizer
Fine-tuningUse LoRA/QLoRA4-bit quantizationCPU offload
InferenceFP16/BF164-bit quantizationCPU offload
Image Gen--medvram flag--lowvram flagReduce resolution
Long sequencesFlash AttentionReduce seq lengthChunked processing

Summary

  1. Diagnose first: Use nvidia-smi and PyTorch memory functions
  2. Quick fixes: Reduce batch size, clear cache, use FP16
  3. Advanced: Quantization, gradient checkpointing, CPU offload
  4. Prevent: Estimate memory beforehand, monitor during training
  5. When all else fails: Get more VRAM (upgrade GPU or use cloud)

Remember: CUDA OOM is usually solvable. Start with the simplest solutions and work your way up.


Need more VRAM? Browse SynpixCloud's marketplace for RTX 4090 (24GB), A100 (80GB), and H100 (80GB) instances.

SynpixCloud Team

SynpixCloud Team