Production Deployment Guide

This comprehensive guide covers deploying Scrapalot in production environments, from single-server deployments to scalable cloud architectures.

Deployment Overview

Deployment Options

Choose the deployment method that best fits your needs:

Docker Compose

Single server deployment (recommended for most use cases)

Cloud Platforms

Cloud deployment examples (experimental)

VPS Deployment

Traditional server deployment

Edge Deployment

Local/on-premises deployment

Architecture Components

Docker Compose Deployment

Cloud Deployment Quick Start

For cloud deployment with CI/CD, the complete workflow:

Key Features:

External Supabase PostgreSQL (no local DB)
GitHub Actions CI/CD
Nginx Proxy Manager for SSL
Docker Compose with automated deployments
Optional GPU/Vulkan support

For detailed cloud deployment instructions, see the Cloud Infrastructure Guide.

Production Docker Compose Setup

Step 1: Environment Configuration

Create a production environment file based on the template:

bash

cp docker-scrapalot/example.env docker-scrapalot/.env
# Edit docker-scrapalot/.env with your production values

Key Variables to configure:

Database credentials (POSTGRES_PASSWORD, REDIS_PASSWORD)
LLM provider settings and API keys
Model directory configuration
Neo4j credentials (optional)
GPU and Vulkan support settings

See docker-scrapalot/example.env for the complete template.

Step 2: Production Dockerfile

The main configuration is in docker-scrapalot/Dockerfile with the following features:

Python FastAPI backend with all dependencies
Integrated GPU acceleration support (Vulkan/CUDA)
Vulkan support enabled via build arguments
Model directory mounting for persistent storage
GPU-aware health checks and proper logging

Deployment Commands

Development Deployment

bash

# Navigate to docker directory
cd docker-scrapalot

# 1. Prepare environment
cp example.env .env
# Edit .env with your values

# 2. Build and start services
docker-compose up -d

# 3. Verify deployment
curl -f http://localhost:8090/health

Production Deployment

bash

# Navigate to docker directory
cd docker-scrapalot

# 1. Prepare environment
cp example.env .env
# Edit .env with production values

# 2. Build and start
docker-compose -f docker-compose.yaml up -d

# 3. Check service health
docker-compose ps
docker-compose logs scrapalot-chat

# 4. Verify all services are running
curl -f http://localhost:8090/health
curl -f http://localhost:8091/health  # LLM service

GPU-Accelerated Deployment (Vulkan)

bash

# Navigate to docker directory
cd docker-scrapalot

# 1. Build with Vulkan support
docker build \
  --build-arg CMAKE_ARGS="-DLLAMA_VULKAN=ON" \
  -f Dockerfile \
  -t scrapalot-chat:latest ..

# 2. Enable Vulkan in environment
export LLM_VULKAN_ENABLED=true
export LLM_VULKAN_PREFER=true

# 3. Deploy with docker-compose
docker-compose -f docker-compose.yaml up -d

# 4. Verify GPU acceleration
docker-compose logs scrapalot-chat | grep -i vulkan

Configuration Management

Advanced Configuration System

Scrapalot uses a comprehensive YAML-based configuration system located at configs/config.yaml. This replaces simple environment variables with sophisticated configuration management.

Key Configuration Sections

Server & Infrastructure:

yaml

# Server Configuration
host: "0.0.0.0"
port: 8090
workers: 4
log_level: "info"

# Redis Configuration
redis:
  host: ${REDIS_HOST:-localhost}
  port: ${REDIS_PORT:-6479}
  password: ${REDIS_PASSWORD:-""}
  db: 0

# PostgreSQL Configuration
postgres:
  host: ${POSTGRES_HOST:-localhost}
  port: ${POSTGRES_PORT:-15432}
  db: ${POSTGRES_DB:-scrapalot}
  user: ${POSTGRES_USER:-scrapalot}
  password: ${POSTGRES_PASSWORD:-scrapalot}

LLM & Model Management:

yaml

llm:
  models_directory: ${LLM_MODELS_DIRECTORY:-models}
  max_parallel_chats: ${LLM_MAX_PARALLEL_CHATS:-1}
  max_loaded_models: ${LLM_MAX_LOADED_MODELS:-1}

  # Advanced model configuration
  advanced:
    gpu_layers: ${LLM_GPU_LAYERS:-auto}
    context_size: ${LLM_CONTEXT_SIZE:-32768}
    batch_size: ${LLM_BATCH_SIZE:-1024}
    threads: ${LLM_THREADS:-4}

Document Processing:

yaml

documents:
  max_concurrent_jobs_per_user: 3
  batch_size: 10
  timeout: 300
  max_file_size_mb: 10
  upload_path: ${UPLOAD_PATH:-data/upload}

Model Directory Structure

The system automatically organizes models by type:

models/
├── gguf/                    # LLM models in GGUF format
├── huggingface/             # Non-embedding HuggingFace models
└── embeddings/              # All embedding models
    ├── gguf/               # GGUF embedding models
    └── huggingface/        # HuggingFace embedding models

Automatic Model Type Detection: The system automatically routes downloaded models to the correct directory based on model name patterns.

Environment Variable Integration

The configuration system supports environment variable overrides using the ${VAR_NAME:-default} syntax:

bash

# Override specific settings
export LLM_MODELS_DIRECTORY="/custom/models/path"
export POSTGRES_PASSWORD="secure_password"
export LLM_GPU_LAYERS="50"

Production Configuration Tips

Security: Always override default passwords in production
Performance: Adjust gpu_layers, context_size, and batch_size based on hardware
Storage: Configure models_directory and upload_path for persistent storage
Scaling: Set max_parallel_chats and workers based on expected load

Cloud Platform Deployments

Experimental

The AWS and GCP deployment configurations below are experimental and have not been fully tested in production. They are provided as starting templates for users who wish to deploy on these platforms. For production-ready deployment, see the Cloud Infrastructure Guide which covers tested Docker Compose deployment with CI/CD.

AWS Deployment with ECS

ECS Task Definition

Example ECS task definition for production deployment:

json

{
  "family": "scrapalot-backend",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "2048",
  "memory": "4096",
  "executionRoleArn": "arn:aws:iam::account:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::account:role/ecsTaskRole",
  "containerDefinitions": [
    {
      "name": "scrapalot-backend",
      "image": "your-account.dkr.ecr.region.amazonaws.com/scrapalot-backend:latest",
      "portMappings": [
        {
          "containerPort": 8090,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "ENVIRONMENT",
          "value": "prod"
        },
        {
          "name": "DATABASE_URL",
          "value": "postgresql://user:pass@rds-endpoint:5432/scrapalot"
        }
      ],
      "secrets": [
        {
          "name": "SECRET_KEY",
          "valueFrom": "arn:aws:secretsmanager:region:account:secret:scrapalot/secret-key"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/scrapalot-backend",
          "awslogs-region": "us-west-2",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8090/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3
      }
    }
  ]
}

Google Cloud Platform Deployment

Complete Configuration: gcp/ directory

Component	File	Description
Cloud Run Service	`gcp/cloudrun.yaml`	Serverless backend deployment
Cloud SQL	`gcp/cloudsql.yaml`	Managed PostgreSQL database
Deployment Script	`gcp/deploy.sh`	Automated deployment script

bash

# Quick deployment
chmod +x gcp/deploy.sh && ./gcp/deploy.sh

SSL/TLS Configuration

Automated SSL Setup

Use the automated SSL setup script with Let's Encrypt:

bash

# Quick SSL setup with Let's Encrypt
sudo ./scripts/setup_ssl.sh yourdomain.com api.yourdomain.com

Files referenced:

scripts/setup_ssl.sh - Automated SSL configuration
Configuration integrates with Nginx Proxy Manager

Monitoring and Observability

Application Metrics

Monitor key application metrics with Prometheus:

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests by method, endpoint, status
`http_request_duration_seconds`	Histogram	HTTP request duration
`websocket_connections_active`	Gauge	Active WebSocket connections
`document_processing_seconds`	Histogram	Document processing time

Configuration:

monitoring/prometheus.yml - Metrics collection and alerting
monitoring/grafana/dashboards/ - Pre-built dashboards
monitoring/alert_rules.yml - Production alerting rules
monitoring/exporters/ - Database and service exporters

Example Metrics Implementation

python

from prometheus_client import Counter, Histogram, Gauge

# Metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests',
                       ['method', 'endpoint', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds',
                            'HTTP request duration')
ACTIVE_CONNECTIONS = Gauge('websocket_connections_active',
                          'Active WebSocket connections')
DOCUMENT_PROCESSING_TIME = Histogram('document_processing_seconds',
                                    'Document processing time')

Backup and Recovery

Database Backup Strategy

bash

#!/bin/bash
# scripts/backup_database.sh

BACKUP_DIR="/backups/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="scrapalot_backup_${DATE}.sql"

# Create backup directory
mkdir -p $BACKUP_DIR

# Perform backup
pg_dump -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB > $BACKUP_DIR/$BACKUP_FILE

# Compress backup
gzip $BACKUP_DIR/$BACKUP_FILE

# Upload to S3 (optional)
aws s3 cp $BACKUP_DIR/${BACKUP_FILE}.gz s3://your-backup-bucket/postgres/

# Clean up old backups (keep last 30 days)
find $BACKUP_DIR -name "*.gz" -mtime +30 -delete

echo "Backup completed: ${BACKUP_FILE}.gz"

Disaster Recovery Plan

Recovery Procedures:

Database Failure:

Stop application services
Restore from latest backup
Run database migrations if needed
Restart services
Verify functionality

Complete System Failure:

Deploy infrastructure from IaC
Restore database from backup
Restore Redis data if available
Deploy application containers
Restore uploaded files from backup
Update DNS if needed
Verify all services

Backup Schedule:

Database: Daily at 2 AM UTC
Files: Daily at 3 AM UTC
Configuration: On every change

Recovery Objectives:

Recovery Time Objective (RTO): 4 hours
Recovery Point Objective (RPO): 1 hour

Local Model Deployment

Model Deployment Architecture

Scrapalot implements a sophisticated local model deployment system with dual activation pathways designed for different production use cases.

Deployment Flow Overview

Production Deployment Strategies

Container-Based Model Deployment

Recommended for: Production environments with standardized model requirements

dockerfile

# Dockerfile.models - Specialized container for model serving
FROM python:3.12-slim

# Install model dependencies
RUN pip install llama-cpp-python==0.3.8

# Create models directory with proper permissions
RUN mkdir -p /app/data/models/gguf && \
    chmod 755 /app/data/models

# Copy pre-downloaded models
COPY models/ /app/data/models/

# Set model service configuration
ENV LLM_MODELS_DIRECTORY=/app/data/models
ENV LLM_PROVIDER=local

WORKDIR /app
CMD ["python", "-m", "src.main.service.local_models.model_service"]

Hardware Resource Planning

Model Size	CPU RAM	GPU VRAM	Container Memory Limit
1-3B	8GB	4GB	12GB
7B	16GB	8GB	24GB
13B	32GB	16GB	48GB
70B+	128GB	40GB+	160GB+

Production Configuration

Optimized Model Service Configuration

yaml

# config/production.yaml
llm:
  provider: "local"
  models_directory: "/app/data/models"
  max_loaded_models: 2

  advanced:
    gpu_layers: 40              # Conservative GPU usage
    context_size: 4096          # Balanced context size
    batch_size: 256             # Optimized for throughput
    threads: 6                  # Leave cores for other processes
    use_mlock: true             # Keep models in memory
    use_mmap: true              # Enable memory mapping

Health Checks for Model Services

yaml

services:
  scrapalot-backend:
    healthcheck:
      test: [
        "CMD", "curl", "-f",
        "http://localhost:8090/llm-inference/system-capabilities"
      ]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s  # Allow time for model loading

Key Deployment Endpoints

Model Activation Pathways:

POST /llm-inference/models/{model_id}/start-gpu - Direct GPU activation
POST /llm-inference/deploy-model - Service-based deployment
GET /llm-inference/system-capabilities - Hardware capabilities check
GET /llm-inference/deployment-status - Deployment status monitoring

Monitoring and Troubleshooting

Key Metrics to Monitor

Model Loading Times: Track initialization performance
GPU Memory Utilization: Monitor VRAM usage patterns
Inference Latency: Measure response times
Model Switching Frequency: Optimize for usage patterns
Threading Health: Monitor background thread performance

Common Issues and Solutions

Model Loading Failures:

bash

# Check model file permissions
docker-compose exec scrapalot-backend ls -la /app/data/models/

# Verify model service threading
docker-compose logs scrapalot-backend | grep -i "load_model_thread"

# Check deployment status
curl http://localhost:8090/llm-inference/deployment-status

Memory Issues:

bash

# Monitor container memory usage
docker stats scrapalot-backend

# Check GPU memory
docker-compose exec scrapalot-backend nvidia-smi

# Review model configuration
docker-compose exec scrapalot-backend cat /app/config.yaml

Best Practices Summary

Dual Pathway Understanding: Choose between direct GPU activation and service-based deployment based on use case
Resource Planning: Allocate sufficient memory and GPU resources based on model requirements
Health Monitoring: Implement comprehensive health checks with appropriate timeouts for model loading
Threading Awareness: Monitor background thread performance and resource isolation
Configuration Management: Use config.yaml for standardized model service deployments
Performance Monitoring: Track model loading times, inference latency, and resource utilization

For detailed model management information, see: Model Management Guide

Windows Conda Environment GPU Setup

CUDA 12.1 Installation for Windows

Based on real-world deployment experience, complete guide for setting up GPU acceleration in Windows conda environment.

Prerequisites

Windows 10/11 with NVIDIA GPU
Conda environment (e.g., scrapalot-chat)
Latest NVIDIA drivers installed

Step-by-Step Installation

1. Activate Conda Environment

powershell

conda activate scrapalot-chat

2. Upgrade pip

powershell

# Use full path if needed
C:\python\envs\scrapalot-chat\python.exe -m pip install --upgrade pip

3. Install PyTorch with CUDA 12.1

powershell

# CRITICAL: Use --force-reinstall to override CPU version from requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --force-reinstall

4. Install llama-cpp-python with CUDA 12.1

powershell

# Uninstall CPU version first
pip uninstall -y llama-cpp-python

# Install pre-compiled CUDA wheel (NO CMAKE_ARGS needed)
pip install llama-cpp-python==0.3.16 --extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/cu121

5. Verify Installation

powershell

python -c "import torch; print('PyTorch version:', torch.__version__); print('CUDA available:', torch.cuda.is_available()); print('CUDA version:', torch.version.cuda)"

Common Issues and Solutions

Issue: PyTorch shows 2.5.1+cpu instead of CUDA version

powershell

# Solution: Force reinstall to override requirements.txt CPU version
pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --force-reinstall

Issue: CUDA available: False

Verify NVIDIA drivers are installed: Check Device Manager
Ensure you used --force-reinstall flag
Check if requirements.txt is overriding with CPU versions

Issue: CMAKE_ARGS confusion

Pre-compiled wheels (CUDA): NO CMAKE_ARGS needed
Building from source (Vulkan): CMAKE_ARGS required
Use pre-compiled wheels for faster, easier installation

Installation Paths Comparison

Method	CMAKE_ARGS	Build Time	Compatibility	Recommended
CUDA Pre-compiled Wheel	Not needed	Instant	NVIDIA only	Yes
Vulkan Build from Source	Required	10-30 min	Universal GPU	For AMD/Intel
CPU Fallback	Not needed	Instant	All systems	Development only

Troubleshooting Commands

powershell

# Check installed packages
pip list | findstr torch
pip list | findstr llama

# Verify GPU detection
python -c "import torch; print(f'GPU count: {torch.cuda.device_count()}'); print(f'GPU name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"

# Test llama-cpp-python
python -c "import llama_cpp; print('llama-cpp-python imported successfully')"

Performance Verification

After successful installation, verify GPU acceleration:

powershell

# Start scrapalot-chat and check logs for:
# - "CUDA Available: True"
# - GPU detection messages
# - Model loading with GPU layers

# Monitor GPU usage during inference
# Use Task Manager > Performance > GPU or nvidia-smi if available

Next Steps

GPU Setup Guide - Detailed Vulkan and GPU configuration
Cloud Infrastructure - Complete cloud deployment with CI/CD
Architecture Overview - Understand system architecture
Model Management - Local model deployment strategies

This comprehensive deployment guide provides the foundation for deploying Scrapalot in production environments, from simple single-server deployments to complex, scalable cloud architectures.

Production Deployment Guide ​

Deployment Overview ​

Deployment Options ​

Docker Compose

Cloud Platforms

VPS Deployment

Edge Deployment

Architecture Components ​

Docker Compose Deployment ​

Cloud Deployment Quick Start ​

Production Docker Compose Setup ​

Step 1: Environment Configuration ​

Step 2: Production Dockerfile ​

Deployment Commands ​

Development Deployment ​

Production Deployment ​

GPU-Accelerated Deployment (Vulkan) ​

Configuration Management ​

Advanced Configuration System ​

Key Configuration Sections ​

Model Directory Structure ​

Environment Variable Integration ​

Production Configuration Tips ​

Cloud Platform Deployments ​

AWS Deployment with ECS ​

ECS Task Definition ​

Google Cloud Platform Deployment ​

SSL/TLS Configuration ​

Automated SSL Setup ​

Monitoring and Observability ​

Application Metrics ​

Example Metrics Implementation ​

Backup and Recovery ​

Database Backup Strategy ​

Disaster Recovery Plan ​

Local Model Deployment ​

Model Deployment Architecture ​

Deployment Flow Overview ​

Production Deployment Strategies ​

Container-Based Model Deployment ​

Hardware Resource Planning ​

Production Configuration ​

Optimized Model Service Configuration ​

Health Checks for Model Services ​

Key Deployment Endpoints ​

Monitoring and Troubleshooting ​

Key Metrics to Monitor ​

Common Issues and Solutions ​

Best Practices Summary ​

Windows Conda Environment GPU Setup ​

CUDA 12.1 Installation for Windows ​

Prerequisites ​

Step-by-Step Installation ​

Common Issues and Solutions ​

Installation Paths Comparison ​

Troubleshooting Commands ​

Performance Verification ​

Next Steps ​

Production Deployment Guide

Deployment Overview

Deployment Options

Architecture Components

Docker Compose Deployment

Cloud Deployment Quick Start

Production Docker Compose Setup

Step 1: Environment Configuration

Step 2: Production Dockerfile

Deployment Commands

Development Deployment

Production Deployment

GPU-Accelerated Deployment (Vulkan)

Configuration Management

Advanced Configuration System

Key Configuration Sections

Model Directory Structure

Environment Variable Integration

Production Configuration Tips

Cloud Platform Deployments

AWS Deployment with ECS

ECS Task Definition

Google Cloud Platform Deployment

SSL/TLS Configuration

Automated SSL Setup

Monitoring and Observability

Application Metrics

Example Metrics Implementation

Backup and Recovery

Database Backup Strategy

Disaster Recovery Plan

Local Model Deployment

Model Deployment Architecture

Deployment Flow Overview

Production Deployment Strategies

Container-Based Model Deployment

Hardware Resource Planning

Production Configuration

Optimized Model Service Configuration

Health Checks for Model Services

Key Deployment Endpoints

Monitoring and Troubleshooting

Key Metrics to Monitor

Common Issues and Solutions

Best Practices Summary

Windows Conda Environment GPU Setup

CUDA 12.1 Installation for Windows

Prerequisites

Step-by-Step Installation

Common Issues and Solutions

Installation Paths Comparison

Troubleshooting Commands

Performance Verification

Next Steps