Skip to content

GPU Acceleration Setup

This guide covers setting up GPU acceleration for Scrapalot, including CUDA support for PyTorch and llama-cpp-python, with special focus on Vulkan universal GPU support and NVIDIA Blackwell architecture.

Overview

Scrapalot supports multiple GPU acceleration backends:

  • Vulkan - Universal GPU support (NVIDIA, AMD, Intel, Apple) - Recommended
  • CUDA - NVIDIA GPUs only
  • ROCm - AMD GPUs (experimental)
  • Metal - Apple Silicon (macOS)

Quick Start: Vulkan Universal GPU Support

Why Vulkan?

Vulkan provides the best cross-platform GPU support for LLM inference:

  • Universal Compatibility: Single solution for all GPU vendors
  • Superior Performance: 90-102% of CUDA performance on NVIDIA GPUs
  • Simplified Deployment: No vendor-specific driver management
  • Future-Proof: Industry standard with active development

Performance Comparison

BackendNVIDIA RTX 4090AMD RX 7900 XTXIntel Arc A770Apple M2 Pro
Vulkan100% (baseline)94% of RTX 3090 Ti85% of RTX 306090% of native
CUDA102% (reference)N/AN/AN/A
ROCmN/A88% of RTX 3090 TiN/AN/A
MetalN/AN/AN/A100% (native)

Installation Steps

Step 1: Install Vulkan Drivers

Windows:

powershell
# Update GPU drivers (includes Vulkan automatically)
# Verify installation
vulkaninfo --summary

Linux (Ubuntu/Debian):

bash
# Install Vulkan runtime and drivers
sudo apt update && sudo apt install vulkan-tools mesa-vulkan-drivers

# For NVIDIA GPUs:
sudo apt install nvidia-driver-535 nvidia-utils-535
# Or auto-install: sudo ubuntu-drivers autoinstall

# For AMD GPUs:
sudo apt install mesa-vulkan-drivers vulkan-utils

# Verify installation
vulkaninfo --summary

macOS:

bash
# Install Vulkan SDK
brew install --cask vulkan-sdk

# Verify installation
vulkaninfo --summary

Step 2: Install llama-cpp-python with Vulkan Support

bash
# Activate your conda environment
conda activate scrapalot-chat

# Install with Vulkan support
CMAKE_ARGS="-DLLAMA_VULKAN=ON" pip install llama-cpp-python==0.3.8 --no-cache-dir --force-reinstall

Step 3: Enable Vulkan in Configuration

Edit configs/config.yaml:

yaml
llm:
  advanced:
    vulkan:
      enabled: true
      prefer_vulkan: true      # Prefer Vulkan over vendor-specific backends
      device_index: 0          # GPU device to use (0 = primary)
      memory_budget_mb: null   # Auto-detect available memory

Step 4: Verify Installation

bash
# Test Vulkan detection
vulkaninfo --summary

# Start Scrapalot and check logs for:
# "Detected GPU via Vulkan: [GPU_NAME] ([VENDOR])"
# "Using Vulkan backend for universal GPU acceleration"

Production Deployment with Vulkan

Docker Compose Configuration:

yaml
# docker-compose.prod.yml
version: '3.8'
services:
  scrapalot-backend:
    image: scrapalot/backend:latest
    environment:
      - LLM_VULKAN_ENABLED=true
      - LLM_VULKAN_PREFER=true
    volumes:
      - /dev/dri:/dev/dri  # Linux GPU access
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Troubleshooting Vulkan

Issue: "vulkaninfo: no vulkan available"

Solution:

  • Update GPU drivers to latest version
  • Install Vulkan SDK/runtime
  • Verify hardware supports Vulkan 1.1+

Issue: "No Vulkan GPUs detected"

Solution:

  • Run vulkaninfo --summary to check output
  • Verify GPU supports Vulkan 1.1 or higher
  • Update to latest drivers

Issue: Performance slower than expected

Solution:

  • Increase gpu_layers in config.yaml
  • Check memory usage: nvidia-smi or radeontop
  • Verify Vulkan backend is being used in logs

CUDA Setup for NVIDIA GPUs

Prerequisites

Hardware:

  • NVIDIA GPU (RTX series recommended)
  • Minimum 4GB VRAM, 8GB+ recommended
  • System RAM: 16GB+ recommended

Software:

  • Latest NVIDIA drivers supporting CUDA 12.1+
  • Conda or virtualenv
  • Python 3.10+

Verify GPU Compatibility

bash
# Check NVIDIA driver and CUDA support
nvidia-smi

# Should show your GPU and CUDA version 12.1+

Quick Setup

Step 1: Install PyTorch with CUDA

bash
# Activate environment
conda activate scrapalot-chat

# Install PyTorch with CUDA 12.1
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu121 \
  --force-reinstall

Step 2: Install llama-cpp-python with CUDA

bash
# Uninstall CPU version
pip uninstall -y llama-cpp-python

# Install pre-compiled CUDA wheel
pip install llama-cpp-python==0.3.16 \
  --extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/cu121

Step 3: Verify Installation

bash
# Verify PyTorch CUDA support
python -c "import torch; print('PyTorch version:', torch.__version__); print('CUDA available:', torch.cuda.is_available()); print('CUDA version:', torch.version.cuda)"

# Expected output:
# PyTorch version: 2.x.x+cu121
# CUDA available: True
# CUDA version: 12.1

Configuration

Update configs/config.yaml for optimal GPU performance:

yaml
llm:
  advanced:
    # GPU Configuration
    gpu_layers: 50          # Number of layers to offload to GPU (0 = CPU only)
    context_size: 8192      # Context window size
    batch_size: 512         # Batch size for processing
    threads: 4              # CPU threads for non-GPU operations
    use_mlock: true         # Keep models in memory
    use_mmap: true          # Enable memory mapping

Special Case: Blackwell GPU Users

Important Notice for Blackwell Architecture

If you have an NVIDIA Blackwell GPU (RTX 4500 PRO, RTX 5080/5090), please note:

Current Status:

  • PyTorch stable releases (2.8.0) do NOT support Blackwell (sm_120) architecture yet
  • Our codebase automatically detects this and falls back to CPU mode
  • Document processing still works via PyMuPDF4LLM (CPU, very fast)
  • Embedding models work (HuggingFace sentence-transformers)

What Works on Blackwell

ComponentStatusNotes
Document ProcessingWorkingAuto-falls back to PyMuPDF4LLM (CPU)
Embedding ModelsWorkingsentence-transformers models work
llama-cpp-pythonPartialMay use CPU fallback, still functional
Docling GPUNot WorkingAuto-falls back to CPU mode
PyTorch CUDANot WorkingIncompatible with sm_120

Solutions for Blackwell Users

Option 1: Use Current Setup (Recommended)

Our codebase includes automatic Blackwell detection and CPU fallback:

  • Document processing works via PyMuPDF4LLM
  • All features functional, just without GPU acceleration for some components
  • This is the safest and most stable option

Option 2: Try PyTorch Nightly with CUDA 13.0 (Experimental)

PyTorch nightly builds with CUDA 13.0 have full Blackwell sm_120 support:

bash
# Uninstall stable PyTorch
pip uninstall torch torchvision torchaudio -y

# Install nightly build with CUDA 13.0 (Blackwell support!)
pip install --pre torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/nightly/cu130

Warning: Nightly builds may be unstable.

Option 3: Wait for Official Support (Coming Soon)

PyTorch developers are working on official sm_120 support:

Remote GPU Access: Cloud-to-Home Setup

Overview

Connect your cloud server to your home GPU setup using Tailscale VPN, enabling cloud-hosted applications to leverage local GPU power.

Use Case:

  • Cloud Server: Hetzner/AWS (limited GPU)
  • Home GPUs: Powerful local GPUs (e.g., 2x RTX 4500 PRO)
  • Goal: Leverage home GPU power for cloud applications

Quick Setup (30 Minutes)

Step 1: Install Tailscale on Home GPU Machine

Windows:

powershell
# Install Tailscale
winget install tailscale.tailscale

# Start Tailscale (opens browser for authentication)
# Sign in with your preferred account

# Verify installation
tailscale status
# Should show your assigned IP: 100.x.x.x

Linux:

bash
# Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh

# Start and authenticate
sudo tailscale up

# Verify installation
sudo tailscale status

Step 2: Install Tailscale on Cloud Server

bash
# SSH into your cloud server
ssh your-server

# Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh

# Start and authenticate (use SAME account as home machine)
sudo tailscale up

# Verify both machines are connected
sudo tailscale status

Step 3: Configure LM Studio on Home Machine

  1. Launch LM Studio on your GPU machine
  2. Go to Settings → Server
  3. Configure:
yaml
Host: 0.0.0.0          # Listen on all interfaces
Port: 1234             # Default LM Studio port
CORS: Enabled          # Allow cross-origin requests
  1. Start Server
  2. Test locally: curl http://localhost:1234/v1/models

Step 4: Update Provider Endpoint

In Scrapalot UI:

  1. Navigate to Settings → Model Providers
  2. Find LM Studio provider
  3. Update API Base:
http://100.x.x.x:1234

(Replace with your home machine's Tailscale IP)

  1. Set Status to Active
  2. Save

Step 5: Test Connection

bash
# Test from cloud server
curl http://100.x.x.x:1234/v1/models

# Monitor GPU usage on home machine
nvidia-smi -l 1

Security Best Practices

Enable Tailscale ACLs:

json
{
  "acls": [
    {
      "action": "accept",
      "src": ["tag:cloud-server"],
      "dst": ["tag:home-gpu:1234"]
    }
  ]
}

Add API Key Authentication:

Configure API key in LM Studio settings and add to .env:

bash
REMOTE_GPU_API_KEY=your-secret-key

Performance Optimization

Dual GPU Configuration:

With multiple GPUs, you can:

Option A: Single Large Model (Recommended)

  • Load massive models across both GPUs
  • LM Studio supports tensor parallelism
  • Example: Llama 3.1 70B across 2x RTX 4500 PRO

Option B: Multiple Providers

  • Run different models on each GPU
  • Create separate providers in Scrapalot
  • Manual load balancing

Network Optimization:

bash
# On cloud server - enable TCP BBR congestion control
echo "net.core.default_qdisc=fq" | sudo tee -a /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control=bbr" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Troubleshooting

Common Issues

Issue: CUDA not available

bash
# Check PyTorch installation
python -c "import torch; print(torch.__version__)"

# Should show: 2.x.x+cu121 (not +cpu)

# If showing +cpu, reinstall with --force-reinstall flag
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu121 \
  --force-reinstall

Issue: Out of Memory

bash
# Reduce GPU layers in config.yaml
gpu_layers: 20  # Instead of 50

# Use smaller batch sizes
batch_size: 256  # Instead of 512

# Close other GPU applications

Issue: llama-cpp-python not using GPU

bash
# Verify CUDA version is installed
pip show llama-cpp-python

# Should show version with +cu121 suffix
# If not, reinstall:
pip install llama-cpp-python==0.3.16 \
  --extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/cu121 \
  --force-reinstall

Issue: Cannot connect to remote GPU

bash
# Check Tailscale status on both machines
tailscale status

# Test ping from cloud to home
ping 100.x.x.x

# Check firewall on home machine (Windows)
New-NetFirewallRule -DisplayName "LM Studio" \
  -Direction Inbound -LocalPort 1234 \
  -Protocol TCP -Action Allow

Debug Commands

bash
# Check installed packages
pip list | grep -E "(torch|llama)"

# Verify CUDA toolkit
nvcc --version

# Check GPU utilization during model loading
nvidia-smi -l 1

# Monitor memory usage
watch -n 1 nvidia-smi

Performance Monitoring

Key Metrics to Monitor

  • Model Loading Times: Track initialization performance
  • GPU Memory Utilization: Monitor VRAM usage
  • Inference Latency: Measure response times
  • Temperature: Ensure GPU stays within safe limits

Monitoring Commands

bash
# Real-time GPU monitoring
nvidia-smi -l 1

# Detailed GPU information
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv

# Monitor specific processes
nvidia-smi pmon -i 0

Alternative GPU Support

AMD GPUs (ROCm)

For AMD GPUs, use ROCm backend:

bash
# Install ROCm (Ubuntu)
sudo apt install rocm-hip-runtime

# Install PyTorch with ROCm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7

Apple Silicon (Metal)

For Apple M1/M2/M3 chips:

bash
# PyTorch with Metal support (included by default on macOS)
pip install torch torchvision torchaudio

# Verify Metal support
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"

Intel GPUs

For Intel Arc GPUs:

bash
# Use Vulkan backend (recommended)
CMAKE_ARGS="-DLLAMA_VULKAN=ON" pip install llama-cpp-python --force-reinstall

Next Steps


For detailed GPU configuration and advanced optimization, consult the model management documentation.

Released under the MIT License.