GPU Acceleration Setup

This guide covers setting up GPU acceleration for Scrapalot, including CUDA support for PyTorch and llama-cpp-python, with special focus on Vulkan universal GPU support and NVIDIA Blackwell architecture.

Overview

Scrapalot supports multiple GPU acceleration backends:

Vulkan - Universal GPU support (NVIDIA, AMD, Intel, Apple) - Recommended
CUDA - NVIDIA GPUs only
ROCm - AMD GPUs (experimental)
Metal - Apple Silicon (macOS)

Quick Start: Vulkan Universal GPU Support

Why Vulkan?

Vulkan provides the best cross-platform GPU support for LLM inference:

Universal Compatibility: Single solution for all GPU vendors
Superior Performance: 90-102% of CUDA performance on NVIDIA GPUs
Simplified Deployment: No vendor-specific driver management
Future-Proof: Industry standard with active development

Performance Comparison

Backend	NVIDIA RTX 4090	AMD RX 7900 XTX	Intel Arc A770	Apple M2 Pro
Vulkan	100% (baseline)	94% of RTX 3090 Ti	85% of RTX 3060	90% of native
CUDA	102% (reference)	N/A	N/A	N/A
ROCm	N/A	88% of RTX 3090 Ti	N/A	N/A
Metal	N/A	N/A	N/A	100% (native)

Installation Steps

Step 1: Install Vulkan Drivers

Windows:

powershell

# Update GPU drivers (includes Vulkan automatically)
# Verify installation
vulkaninfo --summary

Linux (Ubuntu/Debian):

bash

# Install Vulkan runtime and drivers
sudo apt update && sudo apt install vulkan-tools mesa-vulkan-drivers

# For NVIDIA GPUs:
sudo apt install nvidia-driver-535 nvidia-utils-535
# Or auto-install: sudo ubuntu-drivers autoinstall

# For AMD GPUs:
sudo apt install mesa-vulkan-drivers vulkan-utils

# Verify installation
vulkaninfo --summary

macOS:

bash

# Install Vulkan SDK
brew install --cask vulkan-sdk

# Verify installation
vulkaninfo --summary

Step 2: Install llama-cpp-python with Vulkan Support

bash

# Activate your conda environment
conda activate scrapalot-chat

# Install with Vulkan support
CMAKE_ARGS="-DLLAMA_VULKAN=ON" pip install llama-cpp-python==0.3.8 --no-cache-dir --force-reinstall

Step 3: Enable Vulkan in Configuration

Edit configs/config.yaml:

yaml

llm:
  advanced:
    vulkan:
      enabled: true
      prefer_vulkan: true      # Prefer Vulkan over vendor-specific backends
      device_index: 0          # GPU device to use (0 = primary)
      memory_budget_mb: null   # Auto-detect available memory

Step 4: Verify Installation

bash

# Test Vulkan detection
vulkaninfo --summary

# Start Scrapalot and check logs for:
# "Detected GPU via Vulkan: [GPU_NAME] ([VENDOR])"
# "Using Vulkan backend for universal GPU acceleration"

Production Deployment with Vulkan

Docker Compose Configuration:

yaml

# docker-compose.prod.yml
version: '3.8'
services:
  scrapalot-backend:
    image: scrapalot/backend:latest
    environment:
      - LLM_VULKAN_ENABLED=true
      - LLM_VULKAN_PREFER=true
    volumes:
      - /dev/dri:/dev/dri  # Linux GPU access
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Troubleshooting Vulkan

Issue: "vulkaninfo: no vulkan available"

Solution:

Update GPU drivers to latest version
Install Vulkan SDK/runtime
Verify hardware supports Vulkan 1.1+

Issue: "No Vulkan GPUs detected"

Solution:

Run vulkaninfo --summary to check output
Verify GPU supports Vulkan 1.1 or higher
Update to latest drivers

Issue: Performance slower than expected

Solution:

Increase gpu_layers in config.yaml
Check memory usage: nvidia-smi or radeontop
Verify Vulkan backend is being used in logs

CUDA Setup for NVIDIA GPUs

Prerequisites

Hardware:

NVIDIA GPU (RTX series recommended)
Minimum 4GB VRAM, 8GB+ recommended
System RAM: 16GB+ recommended

Software:

Latest NVIDIA drivers supporting CUDA 12.1+
Conda or virtualenv
Python 3.10+

Verify GPU Compatibility

bash

# Check NVIDIA driver and CUDA support
nvidia-smi

# Should show your GPU and CUDA version 12.1+

Quick Setup

Step 1: Install PyTorch with CUDA

bash

# Activate environment
conda activate scrapalot-chat

# Install PyTorch with CUDA 12.1
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu121 \
  --force-reinstall

Step 2: Install llama-cpp-python with CUDA

bash

# Uninstall CPU version
pip uninstall -y llama-cpp-python

# Install pre-compiled CUDA wheel
pip install llama-cpp-python==0.3.16 \
  --extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/cu121

Step 3: Verify Installation

bash

# Verify PyTorch CUDA support
python -c "import torch; print('PyTorch version:', torch.__version__); print('CUDA available:', torch.cuda.is_available()); print('CUDA version:', torch.version.cuda)"

# Expected output:
# PyTorch version: 2.x.x+cu121
# CUDA available: True
# CUDA version: 12.1

Configuration

Update configs/config.yaml for optimal GPU performance:

yaml

llm:
  advanced:
    # GPU Configuration
    gpu_layers: 50          # Number of layers to offload to GPU (0 = CPU only)
    context_size: 8192      # Context window size
    batch_size: 512         # Batch size for processing
    threads: 4              # CPU threads for non-GPU operations
    use_mlock: true         # Keep models in memory
    use_mmap: true          # Enable memory mapping

Special Case: Blackwell GPU Users

Important Notice for Blackwell Architecture

If you have an NVIDIA Blackwell GPU (RTX 4500 PRO, RTX 5080/5090), please note:

Current Status:

PyTorch stable releases (2.8.0) do NOT support Blackwell (sm_120) architecture yet
Our codebase automatically detects this and falls back to CPU mode
Document processing still works via PyMuPDF4LLM (CPU, very fast)
Embedding models work (HuggingFace sentence-transformers)

What Works on Blackwell

Component	Status	Notes
Document Processing	Working	Auto-falls back to PyMuPDF4LLM (CPU)
Embedding Models	Working	sentence-transformers models work
llama-cpp-python	Partial	May use CPU fallback, still functional
Docling GPU	Not Working	Auto-falls back to CPU mode
PyTorch CUDA	Not Working	Incompatible with sm_120

Solutions for Blackwell Users

Option 1: Use Current Setup (Recommended)

Our codebase includes automatic Blackwell detection and CPU fallback:

Document processing works via PyMuPDF4LLM
All features functional, just without GPU acceleration for some components
This is the safest and most stable option

Option 2: Try PyTorch Nightly with CUDA 13.0 (Experimental)

PyTorch nightly builds with CUDA 13.0 have full Blackwell sm_120 support:

bash

# Uninstall stable PyTorch
pip uninstall torch torchvision torchaudio -y

# Install nightly build with CUDA 13.0 (Blackwell support!)
pip install --pre torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/nightly/cu130

Warning: Nightly builds may be unstable.

Option 3: Wait for Official Support (Coming Soon)

PyTorch developers are working on official sm_120 support:

Expected in PyTorch 2.9+ stable release
CUDA 13.0 required
Tracking: https://github.com/pytorch/pytorch/issues/159779

Remote GPU Access: Cloud-to-Home Setup

Overview

Connect your cloud server to your home GPU setup using Tailscale VPN, enabling cloud-hosted applications to leverage local GPU power.

Use Case:

Cloud Server: Hetzner/AWS (limited GPU)
Home GPUs: Powerful local GPUs (e.g., 2x RTX 4500 PRO)
Goal: Leverage home GPU power for cloud applications

Quick Setup (30 Minutes)

Step 1: Install Tailscale on Home GPU Machine

Windows:

powershell

# Install Tailscale
winget install tailscale.tailscale

# Start Tailscale (opens browser for authentication)
# Sign in with your preferred account

# Verify installation
tailscale status
# Should show your assigned IP: 100.x.x.x

Linux:

bash

# Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh

# Start and authenticate
sudo tailscale up

# Verify installation
sudo tailscale status

Step 2: Install Tailscale on Cloud Server

bash

# SSH into your cloud server
ssh your-server

# Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh

# Start and authenticate (use SAME account as home machine)
sudo tailscale up

# Verify both machines are connected
sudo tailscale status

Step 3: Configure LM Studio on Home Machine

Launch LM Studio on your GPU machine
Go to Settings → Server
Configure:

yaml

Host: 0.0.0.0          # Listen on all interfaces
Port: 1234             # Default LM Studio port
CORS: Enabled          # Allow cross-origin requests

Start Server
Test locally: curl http://localhost:1234/v1/models

Step 4: Update Provider Endpoint

In Scrapalot UI:

Navigate to Settings → Model Providers
Find LM Studio provider
Update API Base:

http://100.x.x.x:1234

(Replace with your home machine's Tailscale IP)

Set Status to Active
Save

Step 5: Test Connection

bash

# Test from cloud server
curl http://100.x.x.x:1234/v1/models

# Monitor GPU usage on home machine
nvidia-smi -l 1

Security Best Practices

Enable Tailscale ACLs:

json

{
  "acls": [
    {
      "action": "accept",
      "src": ["tag:cloud-server"],
      "dst": ["tag:home-gpu:1234"]
    }
  ]
}

Add API Key Authentication:

Configure API key in LM Studio settings and add to .env:

bash

REMOTE_GPU_API_KEY=your-secret-key

Performance Optimization

Dual GPU Configuration:

With multiple GPUs, you can:

Option A: Single Large Model (Recommended)

Load massive models across both GPUs
LM Studio supports tensor parallelism
Example: Llama 3.1 70B across 2x RTX 4500 PRO

Option B: Multiple Providers

Run different models on each GPU
Create separate providers in Scrapalot
Manual load balancing

Network Optimization:

bash

# On cloud server - enable TCP BBR congestion control
echo "net.core.default_qdisc=fq" | sudo tee -a /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control=bbr" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Troubleshooting

Common Issues

Issue: CUDA not available

bash

# Check PyTorch installation
python -c "import torch; print(torch.__version__)"

# Should show: 2.x.x+cu121 (not +cpu)

# If showing +cpu, reinstall with --force-reinstall flag
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu121 \
  --force-reinstall

Issue: Out of Memory

bash

# Reduce GPU layers in config.yaml
gpu_layers: 20  # Instead of 50

# Use smaller batch sizes
batch_size: 256  # Instead of 512

# Close other GPU applications

Issue: llama-cpp-python not using GPU

bash

# Verify CUDA version is installed
pip show llama-cpp-python

# Should show version with +cu121 suffix
# If not, reinstall:
pip install llama-cpp-python==0.3.16 \
  --extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/cu121 \
  --force-reinstall

Issue: Cannot connect to remote GPU

bash

# Check Tailscale status on both machines
tailscale status

# Test ping from cloud to home
ping 100.x.x.x

# Check firewall on home machine (Windows)
New-NetFirewallRule -DisplayName "LM Studio" \
  -Direction Inbound -LocalPort 1234 \
  -Protocol TCP -Action Allow

Debug Commands

bash

# Check installed packages
pip list | grep -E "(torch|llama)"

# Verify CUDA toolkit
nvcc --version

# Check GPU utilization during model loading
nvidia-smi -l 1

# Monitor memory usage
watch -n 1 nvidia-smi

Performance Monitoring

Key Metrics to Monitor

Model Loading Times: Track initialization performance
GPU Memory Utilization: Monitor VRAM usage
Inference Latency: Measure response times
Temperature: Ensure GPU stays within safe limits

Monitoring Commands

bash

# Real-time GPU monitoring
nvidia-smi -l 1

# Detailed GPU information
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv

# Monitor specific processes
nvidia-smi pmon -i 0

Alternative GPU Support

AMD GPUs (ROCm)

For AMD GPUs, use ROCm backend:

bash

# Install ROCm (Ubuntu)
sudo apt install rocm-hip-runtime

# Install PyTorch with ROCm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7

Apple Silicon (Metal)

For Apple M1/M2/M3 chips:

bash

# PyTorch with Metal support (included by default on macOS)
pip install torch torchvision torchaudio

# Verify Metal support
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"

Intel GPUs

For Intel Arc GPUs:

bash

# Use Vulkan backend (recommended)
CMAKE_ARGS="-DLLAMA_VULKAN=ON" pip install llama-cpp-python --force-reinstall

Next Steps

Model Management - Deploy and manage local models
Deployment Guide - Production deployment strategies
Cloud Infrastructure - Cloud deployment with GPU

For detailed GPU configuration and advanced optimization, consult the model management documentation.

GPU Acceleration Setup ​

Overview ​

Quick Start: Vulkan Universal GPU Support ​

Why Vulkan? ​

Performance Comparison ​

Installation Steps ​

Step 1: Install Vulkan Drivers ​

Step 2: Install llama-cpp-python with Vulkan Support ​

Step 3: Enable Vulkan in Configuration ​

Step 4: Verify Installation ​

Production Deployment with Vulkan ​

Troubleshooting Vulkan ​

CUDA Setup for NVIDIA GPUs ​

Prerequisites ​

Verify GPU Compatibility ​

Quick Setup ​

Step 1: Install PyTorch with CUDA ​

Step 2: Install llama-cpp-python with CUDA ​

Step 3: Verify Installation ​

Configuration ​

Special Case: Blackwell GPU Users ​

Important Notice for Blackwell Architecture ​

What Works on Blackwell ​

Solutions for Blackwell Users ​

Remote GPU Access: Cloud-to-Home Setup ​

Overview ​

Quick Setup (30 Minutes) ​

Step 1: Install Tailscale on Home GPU Machine ​

Step 2: Install Tailscale on Cloud Server ​

Step 3: Configure LM Studio on Home Machine ​

Step 4: Update Provider Endpoint ​

Step 5: Test Connection ​

Security Best Practices ​

Performance Optimization ​

Troubleshooting ​

Common Issues ​

Debug Commands ​

Performance Monitoring ​

Key Metrics to Monitor ​

Monitoring Commands ​

Alternative GPU Support ​

AMD GPUs (ROCm) ​

Apple Silicon (Metal) ​

Intel GPUs ​

Next Steps ​

GPU Acceleration Setup

Overview

Quick Start: Vulkan Universal GPU Support

Why Vulkan?

Performance Comparison

Installation Steps

Step 1: Install Vulkan Drivers

Step 2: Install llama-cpp-python with Vulkan Support

Step 3: Enable Vulkan in Configuration

Step 4: Verify Installation

Production Deployment with Vulkan

Troubleshooting Vulkan

CUDA Setup for NVIDIA GPUs

Prerequisites

Verify GPU Compatibility

Quick Setup

Step 1: Install PyTorch with CUDA

Step 2: Install llama-cpp-python with CUDA

Step 3: Verify Installation

Configuration

Special Case: Blackwell GPU Users

Important Notice for Blackwell Architecture

What Works on Blackwell

Solutions for Blackwell Users

Remote GPU Access: Cloud-to-Home Setup

Overview

Quick Setup (30 Minutes)

Step 1: Install Tailscale on Home GPU Machine

Step 2: Install Tailscale on Cloud Server

Step 3: Configure LM Studio on Home Machine

Step 4: Update Provider Endpoint

Step 5: Test Connection

Security Best Practices

Performance Optimization

Troubleshooting

Common Issues

Debug Commands

Performance Monitoring

Key Metrics to Monitor

Monitoring Commands

Alternative GPU Support

AMD GPUs (ROCm)

Apple Silicon (Metal)

Intel GPUs

Next Steps