GPU Acceleration Setup
This guide covers setting up GPU acceleration for Scrapalot, including CUDA support for PyTorch and llama-cpp-python, with special focus on Vulkan universal GPU support and NVIDIA Blackwell architecture.
Overview
Scrapalot supports multiple GPU acceleration backends:
- Vulkan - Universal GPU support (NVIDIA, AMD, Intel, Apple) - Recommended
- CUDA - NVIDIA GPUs only
- ROCm - AMD GPUs (experimental)
- Metal - Apple Silicon (macOS)
Quick Start: Vulkan Universal GPU Support
Why Vulkan?
Vulkan provides the best cross-platform GPU support for LLM inference:
- Universal Compatibility: Single solution for all GPU vendors
- Superior Performance: 90-102% of CUDA performance on NVIDIA GPUs
- Simplified Deployment: No vendor-specific driver management
- Future-Proof: Industry standard with active development
Performance Comparison
| Backend | NVIDIA RTX 4090 | AMD RX 7900 XTX | Intel Arc A770 | Apple M2 Pro |
|---|---|---|---|---|
| Vulkan | 100% (baseline) | 94% of RTX 3090 Ti | 85% of RTX 3060 | 90% of native |
| CUDA | 102% (reference) | N/A | N/A | N/A |
| ROCm | N/A | 88% of RTX 3090 Ti | N/A | N/A |
| Metal | N/A | N/A | N/A | 100% (native) |
Installation Steps
Step 1: Install Vulkan Drivers
Windows:
# Update GPU drivers (includes Vulkan automatically)
# Verify installation
vulkaninfo --summaryLinux (Ubuntu/Debian):
# Install Vulkan runtime and drivers
sudo apt update && sudo apt install vulkan-tools mesa-vulkan-drivers
# For NVIDIA GPUs:
sudo apt install nvidia-driver-535 nvidia-utils-535
# Or auto-install: sudo ubuntu-drivers autoinstall
# For AMD GPUs:
sudo apt install mesa-vulkan-drivers vulkan-utils
# Verify installation
vulkaninfo --summarymacOS:
# Install Vulkan SDK
brew install --cask vulkan-sdk
# Verify installation
vulkaninfo --summaryStep 2: Install llama-cpp-python with Vulkan Support
# Activate your conda environment
conda activate scrapalot-chat
# Install with Vulkan support
CMAKE_ARGS="-DLLAMA_VULKAN=ON" pip install llama-cpp-python==0.3.8 --no-cache-dir --force-reinstallStep 3: Enable Vulkan in Configuration
Edit configs/config.yaml:
llm:
advanced:
vulkan:
enabled: true
prefer_vulkan: true # Prefer Vulkan over vendor-specific backends
device_index: 0 # GPU device to use (0 = primary)
memory_budget_mb: null # Auto-detect available memoryStep 4: Verify Installation
# Test Vulkan detection
vulkaninfo --summary
# Start Scrapalot and check logs for:
# "Detected GPU via Vulkan: [GPU_NAME] ([VENDOR])"
# "Using Vulkan backend for universal GPU acceleration"Production Deployment with Vulkan
Docker Compose Configuration:
# docker-compose.prod.yml
version: '3.8'
services:
scrapalot-backend:
image: scrapalot/backend:latest
environment:
- LLM_VULKAN_ENABLED=true
- LLM_VULKAN_PREFER=true
volumes:
- /dev/dri:/dev/dri # Linux GPU access
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]Troubleshooting Vulkan
Issue: "vulkaninfo: no vulkan available"
Solution:
- Update GPU drivers to latest version
- Install Vulkan SDK/runtime
- Verify hardware supports Vulkan 1.1+
Issue: "No Vulkan GPUs detected"
Solution:
- Run
vulkaninfo --summaryto check output - Verify GPU supports Vulkan 1.1 or higher
- Update to latest drivers
Issue: Performance slower than expected
Solution:
- Increase
gpu_layersin config.yaml - Check memory usage:
nvidia-smiorradeontop - Verify Vulkan backend is being used in logs
CUDA Setup for NVIDIA GPUs
Prerequisites
Hardware:
- NVIDIA GPU (RTX series recommended)
- Minimum 4GB VRAM, 8GB+ recommended
- System RAM: 16GB+ recommended
Software:
- Latest NVIDIA drivers supporting CUDA 12.1+
- Conda or virtualenv
- Python 3.10+
Verify GPU Compatibility
# Check NVIDIA driver and CUDA support
nvidia-smi
# Should show your GPU and CUDA version 12.1+Quick Setup
Step 1: Install PyTorch with CUDA
# Activate environment
conda activate scrapalot-chat
# Install PyTorch with CUDA 12.1
pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu121 \
--force-reinstallStep 2: Install llama-cpp-python with CUDA
# Uninstall CPU version
pip uninstall -y llama-cpp-python
# Install pre-compiled CUDA wheel
pip install llama-cpp-python==0.3.16 \
--extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/cu121Step 3: Verify Installation
# Verify PyTorch CUDA support
python -c "import torch; print('PyTorch version:', torch.__version__); print('CUDA available:', torch.cuda.is_available()); print('CUDA version:', torch.version.cuda)"
# Expected output:
# PyTorch version: 2.x.x+cu121
# CUDA available: True
# CUDA version: 12.1Configuration
Update configs/config.yaml for optimal GPU performance:
llm:
advanced:
# GPU Configuration
gpu_layers: 50 # Number of layers to offload to GPU (0 = CPU only)
context_size: 8192 # Context window size
batch_size: 512 # Batch size for processing
threads: 4 # CPU threads for non-GPU operations
use_mlock: true # Keep models in memory
use_mmap: true # Enable memory mappingSpecial Case: Blackwell GPU Users
Important Notice for Blackwell Architecture
If you have an NVIDIA Blackwell GPU (RTX 4500 PRO, RTX 5080/5090), please note:
Current Status:
- PyTorch stable releases (2.8.0) do NOT support Blackwell (sm_120) architecture yet
- Our codebase automatically detects this and falls back to CPU mode
- Document processing still works via PyMuPDF4LLM (CPU, very fast)
- Embedding models work (HuggingFace sentence-transformers)
What Works on Blackwell
| Component | Status | Notes |
|---|---|---|
| Document Processing | Working | Auto-falls back to PyMuPDF4LLM (CPU) |
| Embedding Models | Working | sentence-transformers models work |
| llama-cpp-python | Partial | May use CPU fallback, still functional |
| Docling GPU | Not Working | Auto-falls back to CPU mode |
| PyTorch CUDA | Not Working | Incompatible with sm_120 |
Solutions for Blackwell Users
Option 1: Use Current Setup (Recommended)
Our codebase includes automatic Blackwell detection and CPU fallback:
- Document processing works via PyMuPDF4LLM
- All features functional, just without GPU acceleration for some components
- This is the safest and most stable option
Option 2: Try PyTorch Nightly with CUDA 13.0 (Experimental)
PyTorch nightly builds with CUDA 13.0 have full Blackwell sm_120 support:
# Uninstall stable PyTorch
pip uninstall torch torchvision torchaudio -y
# Install nightly build with CUDA 13.0 (Blackwell support!)
pip install --pre torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/nightly/cu130Warning: Nightly builds may be unstable.
Option 3: Wait for Official Support (Coming Soon)
PyTorch developers are working on official sm_120 support:
- Expected in PyTorch 2.9+ stable release
- CUDA 13.0 required
- Tracking: https://github.com/pytorch/pytorch/issues/159779
Remote GPU Access: Cloud-to-Home Setup
Overview
Connect your cloud server to your home GPU setup using Tailscale VPN, enabling cloud-hosted applications to leverage local GPU power.
Use Case:
- Cloud Server: Hetzner/AWS (limited GPU)
- Home GPUs: Powerful local GPUs (e.g., 2x RTX 4500 PRO)
- Goal: Leverage home GPU power for cloud applications
Quick Setup (30 Minutes)
Step 1: Install Tailscale on Home GPU Machine
Windows:
# Install Tailscale
winget install tailscale.tailscale
# Start Tailscale (opens browser for authentication)
# Sign in with your preferred account
# Verify installation
tailscale status
# Should show your assigned IP: 100.x.x.xLinux:
# Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh
# Start and authenticate
sudo tailscale up
# Verify installation
sudo tailscale statusStep 2: Install Tailscale on Cloud Server
# SSH into your cloud server
ssh your-server
# Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh
# Start and authenticate (use SAME account as home machine)
sudo tailscale up
# Verify both machines are connected
sudo tailscale statusStep 3: Configure LM Studio on Home Machine
- Launch LM Studio on your GPU machine
- Go to Settings → Server
- Configure:
Host: 0.0.0.0 # Listen on all interfaces
Port: 1234 # Default LM Studio port
CORS: Enabled # Allow cross-origin requests- Start Server
- Test locally:
curl http://localhost:1234/v1/models
Step 4: Update Provider Endpoint
In Scrapalot UI:
- Navigate to Settings → Model Providers
- Find LM Studio provider
- Update API Base:
http://100.x.x.x:1234(Replace with your home machine's Tailscale IP)
- Set Status to Active
- Save
Step 5: Test Connection
# Test from cloud server
curl http://100.x.x.x:1234/v1/models
# Monitor GPU usage on home machine
nvidia-smi -l 1Security Best Practices
Enable Tailscale ACLs:
{
"acls": [
{
"action": "accept",
"src": ["tag:cloud-server"],
"dst": ["tag:home-gpu:1234"]
}
]
}Add API Key Authentication:
Configure API key in LM Studio settings and add to .env:
REMOTE_GPU_API_KEY=your-secret-keyPerformance Optimization
Dual GPU Configuration:
With multiple GPUs, you can:
Option A: Single Large Model (Recommended)
- Load massive models across both GPUs
- LM Studio supports tensor parallelism
- Example: Llama 3.1 70B across 2x RTX 4500 PRO
Option B: Multiple Providers
- Run different models on each GPU
- Create separate providers in Scrapalot
- Manual load balancing
Network Optimization:
# On cloud server - enable TCP BBR congestion control
echo "net.core.default_qdisc=fq" | sudo tee -a /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control=bbr" | sudo tee -a /etc/sysctl.conf
sudo sysctl -pTroubleshooting
Common Issues
Issue: CUDA not available
# Check PyTorch installation
python -c "import torch; print(torch.__version__)"
# Should show: 2.x.x+cu121 (not +cpu)
# If showing +cpu, reinstall with --force-reinstall flag
pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu121 \
--force-reinstallIssue: Out of Memory
# Reduce GPU layers in config.yaml
gpu_layers: 20 # Instead of 50
# Use smaller batch sizes
batch_size: 256 # Instead of 512
# Close other GPU applicationsIssue: llama-cpp-python not using GPU
# Verify CUDA version is installed
pip show llama-cpp-python
# Should show version with +cu121 suffix
# If not, reinstall:
pip install llama-cpp-python==0.3.16 \
--extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/cu121 \
--force-reinstallIssue: Cannot connect to remote GPU
# Check Tailscale status on both machines
tailscale status
# Test ping from cloud to home
ping 100.x.x.x
# Check firewall on home machine (Windows)
New-NetFirewallRule -DisplayName "LM Studio" \
-Direction Inbound -LocalPort 1234 \
-Protocol TCP -Action AllowDebug Commands
# Check installed packages
pip list | grep -E "(torch|llama)"
# Verify CUDA toolkit
nvcc --version
# Check GPU utilization during model loading
nvidia-smi -l 1
# Monitor memory usage
watch -n 1 nvidia-smiPerformance Monitoring
Key Metrics to Monitor
- Model Loading Times: Track initialization performance
- GPU Memory Utilization: Monitor VRAM usage
- Inference Latency: Measure response times
- Temperature: Ensure GPU stays within safe limits
Monitoring Commands
# Real-time GPU monitoring
nvidia-smi -l 1
# Detailed GPU information
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv
# Monitor specific processes
nvidia-smi pmon -i 0Alternative GPU Support
AMD GPUs (ROCm)
For AMD GPUs, use ROCm backend:
# Install ROCm (Ubuntu)
sudo apt install rocm-hip-runtime
# Install PyTorch with ROCm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7Apple Silicon (Metal)
For Apple M1/M2/M3 chips:
# PyTorch with Metal support (included by default on macOS)
pip install torch torchvision torchaudio
# Verify Metal support
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"Intel GPUs
For Intel Arc GPUs:
# Use Vulkan backend (recommended)
CMAKE_ARGS="-DLLAMA_VULKAN=ON" pip install llama-cpp-python --force-reinstallNext Steps
- Model Management - Deploy and manage local models
- Deployment Guide - Production deployment strategies
- Cloud Infrastructure - Cloud deployment with GPU
For detailed GPU configuration and advanced optimization, consult the model management documentation.