Production Deployment Guide
This comprehensive guide covers deploying Scrapalot in production environments, from single-server deployments to scalable cloud architectures.
Deployment Overview
Deployment Options
Choose the deployment method that best fits your needs:
Docker Compose
Single server deployment (recommended for most use cases)
Cloud Platforms
Cloud deployment examples (experimental)
VPS Deployment
Traditional server deployment
Edge Deployment
Local/on-premises deployment
Architecture Components
Docker Compose Deployment
Cloud Deployment Quick Start
For cloud deployment with CI/CD, the complete workflow:
Key Features:
- External Supabase PostgreSQL (no local DB)
- GitHub Actions CI/CD
- Nginx Proxy Manager for SSL
- Docker Compose with automated deployments
- Optional GPU/Vulkan support
For detailed cloud deployment instructions, see the Cloud Infrastructure Guide.
Production Docker Compose Setup
Step 1: Environment Configuration
Create a production environment file based on the template:
cp docker-scrapalot/example.env docker-scrapalot/.env
# Edit docker-scrapalot/.env with your production valuesKey Variables to configure:
- Database credentials (
POSTGRES_PASSWORD,REDIS_PASSWORD) - LLM provider settings and API keys
- Model directory configuration
- Neo4j credentials (optional)
- GPU and Vulkan support settings
See docker-scrapalot/example.env for the complete template.
Step 2: Production Dockerfile
The main configuration is in docker-scrapalot/Dockerfile with the following features:
- Python FastAPI backend with all dependencies
- Integrated GPU acceleration support (Vulkan/CUDA)
- Vulkan support enabled via build arguments
- Model directory mounting for persistent storage
- GPU-aware health checks and proper logging
Deployment Commands
Development Deployment
# Navigate to docker directory
cd docker-scrapalot
# 1. Prepare environment
cp example.env .env
# Edit .env with your values
# 2. Build and start services
docker-compose up -d
# 3. Verify deployment
curl -f http://localhost:8090/healthProduction Deployment
# Navigate to docker directory
cd docker-scrapalot
# 1. Prepare environment
cp example.env .env
# Edit .env with production values
# 2. Build and start
docker-compose -f docker-compose.yaml up -d
# 3. Check service health
docker-compose ps
docker-compose logs scrapalot-chat
# 4. Verify all services are running
curl -f http://localhost:8090/health
curl -f http://localhost:8091/health # LLM serviceGPU-Accelerated Deployment (Vulkan)
# Navigate to docker directory
cd docker-scrapalot
# 1. Build with Vulkan support
docker build \
--build-arg CMAKE_ARGS="-DLLAMA_VULKAN=ON" \
-f Dockerfile \
-t scrapalot-chat:latest ..
# 2. Enable Vulkan in environment
export LLM_VULKAN_ENABLED=true
export LLM_VULKAN_PREFER=true
# 3. Deploy with docker-compose
docker-compose -f docker-compose.yaml up -d
# 4. Verify GPU acceleration
docker-compose logs scrapalot-chat | grep -i vulkanConfiguration Management
Advanced Configuration System
Scrapalot uses a comprehensive YAML-based configuration system located at configs/config.yaml. This replaces simple environment variables with sophisticated configuration management.
Key Configuration Sections
Server & Infrastructure:
# Server Configuration
host: "0.0.0.0"
port: 8090
workers: 4
log_level: "info"
# Redis Configuration
redis:
host: ${REDIS_HOST:-localhost}
port: ${REDIS_PORT:-6479}
password: ${REDIS_PASSWORD:-""}
db: 0
# PostgreSQL Configuration
postgres:
host: ${POSTGRES_HOST:-localhost}
port: ${POSTGRES_PORT:-15432}
db: ${POSTGRES_DB:-scrapalot}
user: ${POSTGRES_USER:-scrapalot}
password: ${POSTGRES_PASSWORD:-scrapalot}LLM & Model Management:
llm:
models_directory: ${LLM_MODELS_DIRECTORY:-models}
max_parallel_chats: ${LLM_MAX_PARALLEL_CHATS:-1}
max_loaded_models: ${LLM_MAX_LOADED_MODELS:-1}
# Advanced model configuration
advanced:
gpu_layers: ${LLM_GPU_LAYERS:-auto}
context_size: ${LLM_CONTEXT_SIZE:-32768}
batch_size: ${LLM_BATCH_SIZE:-1024}
threads: ${LLM_THREADS:-4}Document Processing:
documents:
max_concurrent_jobs_per_user: 3
batch_size: 10
timeout: 300
max_file_size_mb: 10
upload_path: ${UPLOAD_PATH:-data/upload}Model Directory Structure
The system automatically organizes models by type:
models/
├── gguf/ # LLM models in GGUF format
├── huggingface/ # Non-embedding HuggingFace models
└── embeddings/ # All embedding models
├── gguf/ # GGUF embedding models
└── huggingface/ # HuggingFace embedding modelsAutomatic Model Type Detection: The system automatically routes downloaded models to the correct directory based on model name patterns.
Environment Variable Integration
The configuration system supports environment variable overrides using the ${VAR_NAME:-default} syntax:
# Override specific settings
export LLM_MODELS_DIRECTORY="/custom/models/path"
export POSTGRES_PASSWORD="secure_password"
export LLM_GPU_LAYERS="50"Production Configuration Tips
- Security: Always override default passwords in production
- Performance: Adjust
gpu_layers,context_size, andbatch_sizebased on hardware - Storage: Configure
models_directoryandupload_pathfor persistent storage - Scaling: Set
max_parallel_chatsandworkersbased on expected load
Cloud Platform Deployments
Experimental
The AWS and GCP deployment configurations below are experimental and have not been fully tested in production. They are provided as starting templates for users who wish to deploy on these platforms. For production-ready deployment, see the Cloud Infrastructure Guide which covers tested Docker Compose deployment with CI/CD.
AWS Deployment with ECS
ECS Task Definition
Example ECS task definition for production deployment:
{
"family": "scrapalot-backend",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "2048",
"memory": "4096",
"executionRoleArn": "arn:aws:iam::account:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::account:role/ecsTaskRole",
"containerDefinitions": [
{
"name": "scrapalot-backend",
"image": "your-account.dkr.ecr.region.amazonaws.com/scrapalot-backend:latest",
"portMappings": [
{
"containerPort": 8090,
"protocol": "tcp"
}
],
"environment": [
{
"name": "ENVIRONMENT",
"value": "prod"
},
{
"name": "DATABASE_URL",
"value": "postgresql://user:pass@rds-endpoint:5432/scrapalot"
}
],
"secrets": [
{
"name": "SECRET_KEY",
"valueFrom": "arn:aws:secretsmanager:region:account:secret:scrapalot/secret-key"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/scrapalot-backend",
"awslogs-region": "us-west-2",
"awslogs-stream-prefix": "ecs"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8090/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3
}
}
]
}Google Cloud Platform Deployment
Complete Configuration: gcp/ directory
| Component | File | Description |
|---|---|---|
| Cloud Run Service | gcp/cloudrun.yaml | Serverless backend deployment |
| Cloud SQL | gcp/cloudsql.yaml | Managed PostgreSQL database |
| Deployment Script | gcp/deploy.sh | Automated deployment script |
# Quick deployment
chmod +x gcp/deploy.sh && ./gcp/deploy.shSSL/TLS Configuration
Automated SSL Setup
Use the automated SSL setup script with Let's Encrypt:
# Quick SSL setup with Let's Encrypt
sudo ./scripts/setup_ssl.sh yourdomain.com api.yourdomain.comFiles referenced:
scripts/setup_ssl.sh- Automated SSL configuration- Configuration integrates with Nginx Proxy Manager
Monitoring and Observability
Application Metrics
Monitor key application metrics with Prometheus:
| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total HTTP requests by method, endpoint, status |
http_request_duration_seconds | Histogram | HTTP request duration |
websocket_connections_active | Gauge | Active WebSocket connections |
document_processing_seconds | Histogram | Document processing time |
Configuration:
monitoring/prometheus.yml- Metrics collection and alertingmonitoring/grafana/dashboards/- Pre-built dashboardsmonitoring/alert_rules.yml- Production alerting rulesmonitoring/exporters/- Database and service exporters
Example Metrics Implementation
from prometheus_client import Counter, Histogram, Gauge
# Metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests',
['method', 'endpoint', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds',
'HTTP request duration')
ACTIVE_CONNECTIONS = Gauge('websocket_connections_active',
'Active WebSocket connections')
DOCUMENT_PROCESSING_TIME = Histogram('document_processing_seconds',
'Document processing time')Backup and Recovery
Database Backup Strategy
#!/bin/bash
# scripts/backup_database.sh
BACKUP_DIR="/backups/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="scrapalot_backup_${DATE}.sql"
# Create backup directory
mkdir -p $BACKUP_DIR
# Perform backup
pg_dump -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB > $BACKUP_DIR/$BACKUP_FILE
# Compress backup
gzip $BACKUP_DIR/$BACKUP_FILE
# Upload to S3 (optional)
aws s3 cp $BACKUP_DIR/${BACKUP_FILE}.gz s3://your-backup-bucket/postgres/
# Clean up old backups (keep last 30 days)
find $BACKUP_DIR -name "*.gz" -mtime +30 -delete
echo "Backup completed: ${BACKUP_FILE}.gz"Disaster Recovery Plan
Recovery Procedures:
Database Failure:
- Stop application services
- Restore from latest backup
- Run database migrations if needed
- Restart services
- Verify functionality
Complete System Failure:
- Deploy infrastructure from IaC
- Restore database from backup
- Restore Redis data if available
- Deploy application containers
- Restore uploaded files from backup
- Update DNS if needed
- Verify all services
Backup Schedule:
- Database: Daily at 2 AM UTC
- Files: Daily at 3 AM UTC
- Configuration: On every change
Recovery Objectives:
- Recovery Time Objective (RTO): 4 hours
- Recovery Point Objective (RPO): 1 hour
Local Model Deployment
Model Deployment Architecture
Scrapalot implements a sophisticated local model deployment system with dual activation pathways designed for different production use cases.
Deployment Flow Overview
Production Deployment Strategies
Container-Based Model Deployment
Recommended for: Production environments with standardized model requirements
# Dockerfile.models - Specialized container for model serving
FROM python:3.12-slim
# Install model dependencies
RUN pip install llama-cpp-python==0.3.8
# Create models directory with proper permissions
RUN mkdir -p /app/data/models/gguf && \
chmod 755 /app/data/models
# Copy pre-downloaded models
COPY models/ /app/data/models/
# Set model service configuration
ENV LLM_MODELS_DIRECTORY=/app/data/models
ENV LLM_PROVIDER=local
WORKDIR /app
CMD ["python", "-m", "src.main.service.local_models.model_service"]Hardware Resource Planning
| Model Size | CPU RAM | GPU VRAM | Container Memory Limit |
|---|---|---|---|
| 1-3B | 8GB | 4GB | 12GB |
| 7B | 16GB | 8GB | 24GB |
| 13B | 32GB | 16GB | 48GB |
| 70B+ | 128GB | 40GB+ | 160GB+ |
Production Configuration
Optimized Model Service Configuration
# config/production.yaml
llm:
provider: "local"
models_directory: "/app/data/models"
max_loaded_models: 2
advanced:
gpu_layers: 40 # Conservative GPU usage
context_size: 4096 # Balanced context size
batch_size: 256 # Optimized for throughput
threads: 6 # Leave cores for other processes
use_mlock: true # Keep models in memory
use_mmap: true # Enable memory mappingHealth Checks for Model Services
services:
scrapalot-backend:
healthcheck:
test: [
"CMD", "curl", "-f",
"http://localhost:8090/llm-inference/system-capabilities"
]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s # Allow time for model loadingKey Deployment Endpoints
Model Activation Pathways:
POST /llm-inference/models/{model_id}/start-gpu- Direct GPU activationPOST /llm-inference/deploy-model- Service-based deploymentGET /llm-inference/system-capabilities- Hardware capabilities checkGET /llm-inference/deployment-status- Deployment status monitoring
Monitoring and Troubleshooting
Key Metrics to Monitor
- Model Loading Times: Track initialization performance
- GPU Memory Utilization: Monitor VRAM usage patterns
- Inference Latency: Measure response times
- Model Switching Frequency: Optimize for usage patterns
- Threading Health: Monitor background thread performance
Common Issues and Solutions
Model Loading Failures:
# Check model file permissions
docker-compose exec scrapalot-backend ls -la /app/data/models/
# Verify model service threading
docker-compose logs scrapalot-backend | grep -i "load_model_thread"
# Check deployment status
curl http://localhost:8090/llm-inference/deployment-statusMemory Issues:
# Monitor container memory usage
docker stats scrapalot-backend
# Check GPU memory
docker-compose exec scrapalot-backend nvidia-smi
# Review model configuration
docker-compose exec scrapalot-backend cat /app/config.yamlBest Practices Summary
- Dual Pathway Understanding: Choose between direct GPU activation and service-based deployment based on use case
- Resource Planning: Allocate sufficient memory and GPU resources based on model requirements
- Health Monitoring: Implement comprehensive health checks with appropriate timeouts for model loading
- Threading Awareness: Monitor background thread performance and resource isolation
- Configuration Management: Use
config.yamlfor standardized model service deployments - Performance Monitoring: Track model loading times, inference latency, and resource utilization
For detailed model management information, see: Model Management Guide
Windows Conda Environment GPU Setup
CUDA 12.1 Installation for Windows
Based on real-world deployment experience, complete guide for setting up GPU acceleration in Windows conda environment.
Prerequisites
- Windows 10/11 with NVIDIA GPU
- Conda environment (e.g.,
scrapalot-chat) - Latest NVIDIA drivers installed
Step-by-Step Installation
1. Activate Conda Environment
conda activate scrapalot-chat2. Upgrade pip
# Use full path if needed
C:\python\envs\scrapalot-chat\python.exe -m pip install --upgrade pip3. Install PyTorch with CUDA 12.1
# CRITICAL: Use --force-reinstall to override CPU version from requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --force-reinstall4. Install llama-cpp-python with CUDA 12.1
# Uninstall CPU version first
pip uninstall -y llama-cpp-python
# Install pre-compiled CUDA wheel (NO CMAKE_ARGS needed)
pip install llama-cpp-python==0.3.16 --extra-index-url https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/cu1215. Verify Installation
python -c "import torch; print('PyTorch version:', torch.__version__); print('CUDA available:', torch.cuda.is_available()); print('CUDA version:', torch.version.cuda)"Common Issues and Solutions
Issue: PyTorch shows 2.5.1+cpu instead of CUDA version
# Solution: Force reinstall to override requirements.txt CPU version
pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --force-reinstallIssue: CUDA available: False
- Verify NVIDIA drivers are installed: Check Device Manager
- Ensure you used
--force-reinstallflag - Check if requirements.txt is overriding with CPU versions
Issue: CMAKE_ARGS confusion
- Pre-compiled wheels (CUDA): NO CMAKE_ARGS needed
- Building from source (Vulkan): CMAKE_ARGS required
- Use pre-compiled wheels for faster, easier installation
Installation Paths Comparison
| Method | CMAKE_ARGS | Build Time | Compatibility | Recommended |
|---|---|---|---|---|
| CUDA Pre-compiled Wheel | Not needed | Instant | NVIDIA only | Yes |
| Vulkan Build from Source | Required | 10-30 min | Universal GPU | For AMD/Intel |
| CPU Fallback | Not needed | Instant | All systems | Development only |
Troubleshooting Commands
# Check installed packages
pip list | findstr torch
pip list | findstr llama
# Verify GPU detection
python -c "import torch; print(f'GPU count: {torch.cuda.device_count()}'); print(f'GPU name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"
# Test llama-cpp-python
python -c "import llama_cpp; print('llama-cpp-python imported successfully')"Performance Verification
After successful installation, verify GPU acceleration:
# Start scrapalot-chat and check logs for:
# - "CUDA Available: True"
# - GPU detection messages
# - Model loading with GPU layers
# Monitor GPU usage during inference
# Use Task Manager > Performance > GPU or nvidia-smi if availableNext Steps
- GPU Setup Guide - Detailed Vulkan and GPU configuration
- Cloud Infrastructure - Complete cloud deployment with CI/CD
- Architecture Overview - Understand system architecture
- Model Management - Local model deployment strategies
This comprehensive deployment guide provides the foundation for deploying Scrapalot in production environments, from simple single-server deployments to complex, scalable cloud architectures.