"CUDA out of memory" is the most frustrating error in deep learning. This guide covers every solution, from quick fixes to advanced techniques.
Understanding the Error
What the Error Looks Like
PyTorch:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 24.00 GiB total capacity; 21.58 GiB already allocated;
1.45 GiB free; 22.00 GiB reserved in total by PyTorch)TensorFlow:
ResourceExhaustedError: OOM when allocating tensor with shape [32, 512, 512, 3]Stable Diffusion:
torch.cuda.OutOfMemoryError: CUDA out of memory.What Causes CUDA OOM?
| Cause | Description | Solution |
|---|---|---|
| Model too large | Model parameters exceed VRAM | Use smaller model or quantization |
| Batch size too high | Each sample uses memory | Reduce batch size |
| Input too large | High-resolution images/long sequences | Reduce input size |
| Memory leak | Tensors not freed properly | Clear cache, fix code |
| Gradient accumulation | Gradients stored for backward pass | Use gradient checkpointing |
| Multiple models | Loading several models simultaneously | Unload unused models |
Quick Diagnosis
Step 1: Check Current GPU Memory Usage
# Real-time monitoring
watch -n 1 nvidia-smi
# One-time check
nvidia-smiOutput explanation:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA RTX 4090 On | 00000000:01:00.0 Off | Off |
| 30% 45C P2 75W / 450W | 18432MiB / 24564MiB | 85% Default |
+-------------------------------+----------------------+----------------------+Key metric: 18432MiB / 24564MiB = 75% memory used
Step 2: Check Memory in Python
import torch
# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")
# Memory info
print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")Step 3: Find Memory-Hungry Operations
# Enable memory tracking
torch.cuda.memory._record_memory_history()
# Run your code here
# ...
# Get memory snapshot
snapshot = torch.cuda.memory._snapshot()
# Analyze with PyTorch's memory visualization toolsQuick Fixes (Try These First)
Fix 1: Reduce Batch Size
The simplest and most effective solution:
# Before
train_loader = DataLoader(dataset, batch_size=32) # OOM!
# After
train_loader = DataLoader(dataset, batch_size=8) # Works!Rule of thumb: If OOM occurs, halve the batch size until it works.
Fix 2: Clear GPU Cache
import torch
import gc
# Clear PyTorch cache
torch.cuda.empty_cache()
# Force garbage collection
gc.collect()When to use: After loading/unloading models, between training runs.
Fix 3: Use Mixed Precision (FP16/BF16)
Reduces memory usage by ~50% with minimal accuracy loss:
# PyTorch native AMP
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
with autocast(): # FP16 forward pass
outputs = model(batch)
loss = criterion(outputs, targets)
scaler.scale(loss).backward() # Scaled backward pass
scaler.step(optimizer)
scaler.update()For Hugging Face Transformers:
from transformers import TrainingArguments
training_args = TrainingArguments(
fp16=True, # Enable FP16
# or bf16=True for newer GPUs (RTX 3090+, A100, H100)
)Fix 4: Enable Gradient Checkpointing
Trades compute for memory by recomputing activations:
# PyTorch
model.gradient_checkpointing_enable()
# Hugging Face
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
gradient_checkpointing=True
)Memory savings: 50-70% for transformer models
Fix 5: Use Smaller Data Types
# Load model in lower precision
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
torch_dtype=torch.float16, # Half precision
# or torch_dtype=torch.bfloat16 for A100/H100
)Advanced Solutions
Solution 1: 8-bit and 4-bit Quantization
8-bit Quantization (bitsandbytes):
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
load_in_8bit=True,
device_map="auto"
)4-bit Quantization (QLoRA):
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b",
quantization_config=bnb_config,
device_map="auto"
)Memory comparison:
| Model | FP32 | FP16 | 8-bit | 4-bit |
|---|---|---|---|---|
| Llama 2 7B | 28 GB | 14 GB | 7 GB | 3.5 GB |
| Llama 2 13B | 52 GB | 26 GB | 13 GB | 6.5 GB |
| Llama 2 70B | 280 GB | 140 GB | 70 GB | 35 GB |
Solution 2: Gradient Accumulation
Simulate larger batch sizes without more memory:
accumulation_steps = 4
optimizer.zero_grad()
for i, batch in enumerate(dataloader):
outputs = model(batch)
loss = criterion(outputs, targets)
loss = loss / accumulation_steps # Normalize loss
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()Effective batch size: actual_batch_size ร accumulation_steps
Solution 3: CPU Offloading
Move parts of the model to CPU:
from accelerate import Accelerator
accelerator = Accelerator(cpu_offload=True)
model, optimizer = accelerator.prepare(model, optimizer)DeepSpeed ZeRO-Offload:
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu"
},
"offload_param": {
"device": "cpu"
}
}
}Solution 4: Model Parallelism
Split model across multiple GPUs:
from accelerate import Accelerator
accelerator = Accelerator()
model = accelerator.prepare(model)
# Or manually with device_map
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b",
device_map="auto" # Automatically splits across GPUs
)Solution 5: Efficient Attention (Flash Attention)
# Install flash-attn
# pip install flash-attn --no-build-isolation
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16
)Memory savings: 20-40% for long sequences
Framework-Specific Solutions
PyTorch
# 1. Disable gradient for inference
with torch.no_grad():
outputs = model(inputs)
# 2. Delete tensors when done
del tensor
torch.cuda.empty_cache()
# 3. Use inference mode (faster than no_grad)
with torch.inference_mode():
outputs = model(inputs)
# 4. Limit memory fragmentation
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"TensorFlow
import tensorflow as tf
# 1. Enable memory growth
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
# 2. Limit GPU memory
tf.config.set_logical_device_configuration(
gpus[0],
[tf.config.LogicalDeviceConfiguration(memory_limit=8192)] # 8GB limit
)
# 3. Clear session
tf.keras.backend.clear_session()Stable Diffusion / ComfyUI
# 1. Use attention slicing
pipe.enable_attention_slicing()
# 2. Use VAE slicing for large images
pipe.enable_vae_slicing()
# 3. Use sequential CPU offload
pipe.enable_sequential_cpu_offload()
# 4. Use model CPU offload (less aggressive)
pipe.enable_model_cpu_offload()
# 5. Use xformers memory efficient attention
pipe.enable_xformers_memory_efficient_attention()Automatic1111 WebUI flags:
python launch.py --medvram # Medium VRAM optimization
python launch.py --lowvram # Low VRAM optimization (slower)
python launch.py --xformers # Enable xformersHugging Face Transformers
from transformers import TrainingArguments
training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
gradient_checkpointing=True,
fp16=True,
optim="adamw_8bit", # 8-bit optimizer
)Memory Requirements Reference
LLM Training Memory Formula
Memory โ Model Parameters ร 4 bytes (FP32)
+ Optimizer States ร 8-12 bytes per parameter (Adam)
+ Gradients ร 4 bytes per parameter
+ Activations (depends on batch size and sequence length)Approximate VRAM needed:
| Model | Inference (FP16) | Fine-tuning (FP16) | Full Training |
|---|---|---|---|
| 7B | 14 GB | 20 GB | 56 GB |
| 13B | 26 GB | 40 GB | 104 GB |
| 70B | 140 GB | 200 GB | 560 GB |
Image Generation Memory
| Model | Resolution | Minimum VRAM | Recommended |
|---|---|---|---|
| SD 1.5 | 512ร512 | 4 GB | 8 GB |
| SD 1.5 | 768ร768 | 6 GB | 10 GB |
| SDXL | 1024ร1024 | 8 GB | 12 GB |
| SDXL + ControlNet | 1024ร1024 | 12 GB | 16 GB |
| Flux.1 | 1024ร1024 | 16 GB | 24 GB |
Debugging Memory Leaks
Identify Leaks
import torch
def check_memory():
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")
# Check at different points
check_memory() # Before
output = model(input)
check_memory() # After forward
loss.backward()
check_memory() # After backwardCommon Leak Patterns
1. Storing tensors in lists:
# BAD - accumulates GPU memory
losses = []
for batch in dataloader:
loss = model(batch)
losses.append(loss) # Keeps tensor on GPU!
# GOOD - detach and move to CPU
losses = []
for batch in dataloader:
loss = model(batch)
losses.append(loss.detach().cpu().item())2. Not clearing gradients:
# BAD - gradients accumulate
for batch in dataloader:
loss = model(batch)
loss.backward()
# GOOD - zero gradients
for batch in dataloader:
optimizer.zero_grad() # Clear previous gradients
loss = model(batch)
loss.backward()
optimizer.step()3. Keeping computation graph:
# BAD - keeps entire computation graph
total_loss = 0
for batch in dataloader:
loss = model(batch)
total_loss += loss # Keeps graph!
# GOOD - detach from graph
total_loss = 0
for batch in dataloader:
loss = model(batch)
total_loss += loss.item() # Just the numberPrevention Strategies
1. Estimate Memory Before Running
def estimate_model_memory(model, batch_size, precision='fp32'):
"""Estimate memory usage for a model."""
param_size = sum(p.numel() for p in model.parameters())
bytes_per_param = {'fp32': 4, 'fp16': 2, 'int8': 1}[precision]
# Parameters + gradients + optimizer states (Adam โ 3x)
training_memory = param_size * bytes_per_param * 4
total = training_memory
return total / 1e9 # GB
# Usage
memory_needed = estimate_model_memory(model, batch_size=8)
print(f"Estimated memory: {memory_needed:.2f} GB")2. Start Small, Scale Up
# Start with minimum settings
batch_size = 1
# Gradually increase until OOM
for bs in [1, 2, 4, 8, 16, 32]:
try:
train_step(batch_size=bs)
print(f"Batch size {bs}: OK")
except RuntimeError as e:
if "out of memory" in str(e):
print(f"Batch size {bs}: OOM! Use {bs // 2}")
break
raise e3. Monitor During Training
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
for step, batch in enumerate(dataloader):
# ... training code ...
# Log memory usage
writer.add_scalar('memory/allocated',
torch.cuda.memory_allocated(0) / 1e9, step)
writer.add_scalar('memory/reserved',
torch.cuda.memory_reserved(0) / 1e9, step)Quick Reference: OOM Solutions by Situation
| Situation | First Try | If Still OOM | Last Resort |
|---|---|---|---|
| Training | Reduce batch size | Gradient checkpointing | 8-bit optimizer |
| Fine-tuning | Use LoRA/QLoRA | 4-bit quantization | CPU offload |
| Inference | FP16/BF16 | 4-bit quantization | CPU offload |
| Image Gen | --medvram flag | --lowvram flag | Reduce resolution |
| Long sequences | Flash Attention | Reduce seq length | Chunked processing |
Summary
- Diagnose first: Use
nvidia-smiand PyTorch memory functions - Quick fixes: Reduce batch size, clear cache, use FP16
- Advanced: Quantization, gradient checkpointing, CPU offload
- Prevent: Estimate memory beforehand, monitor during training
- When all else fails: Get more VRAM (upgrade GPU or use cloud)
Remember: CUDA OOM is usually solvable. Start with the simplest solutions and work your way up.
Need more VRAM? Browse SynpixCloud's marketplace for RTX 4090 (24GB), A100 (80GB), and H100 (80GB) instances.
