Architecture Overview
Scrapalot is an open-source, enterprise-grade RAG platform built with modern, scalable architecture for intelligent document processing and AI-powered knowledge retrieval.
For Open Source Users
This architecture is 100% open source (MIT License). You can self-host everything on your own infrastructure - no vendor lock-in, no hidden costs.
System Overview
Scrapalot connects your documents with AI to deliver accurate, cited answers. Here's how it works:
Core Components
Web Interface
Modern, responsive React application
Features:
- Real-time streaming answers
- Document management
- Multi-language support (English, Croatian)
- Dark/light themes
- Mobile-friendly
Technology:
- React 18 with TypeScript
- Tailwind CSS for beautiful design
- Real-time WebSocket updates
- Progressive web app ready
Backend Server
High-performance Python API
Capabilities:
- Advanced RAG strategies (13 different approaches)
- Multi-provider AI model support
- Background document processing
- Real-time progress updates
- Secure multi-tenant architecture
Technology:
- FastAPI for blazing-fast API
- Async Python for efficiency
- WebSocket for real-time communication
- Comprehensive security
Data Storage
Flexible, powerful databases
Required:
- PostgreSQL with pgvector - Document storage and vector search
Optional:
- Redis - Caching for speed
- Neo4j - Knowledge graph for Graph RAG
Benefits:
- Automatic backups (with Supabase)
- Scalable to millions of documents
- Fast semantic search
- Secure multi-tenant isolation
AI Models
Your choice of providers
Cloud Options:
- OpenAI (GPT-4, GPT-3.5)
- Anthropic (Claude)
- Google (Gemini)
Local Options:
- Ollama (easy local server)
- LM Studio (GPU-accelerated)
- Direct GGUF models
- vLLM (production inference)
Flexibility:
- Mix and match providers
- Change models anytime
- No vendor lock-in
RAG Engine: How Search Works
Scrapalot uses Tri-Modal Fusion - three search methods working together:
Why Three Search Methods?
Semantic Search (Vector embeddings):
- Understands conceptual similarity
- "renewable energy" finds "solar power"
- Best for: Conceptual questions
Keyword Search (BM25):
- Exact term matching
- "error 221" finds exactly that
- Best for: Technical terms, codes, specific phrases
Graph Search (Neo4j, optional):
- Understands relationships
- "How are X and Y connected?"
- Best for: Relationship questions
Intelligent Routing: The system automatically chooses the best method(s) for each question. You don't need to think about it - it just works.
How Documents Become Knowledge
Processing steps:
- Upload - You upload document
- Extract - Text extracted from PDF/Word/etc
- Chunk - Intelligently split into searchable segments
- Embed - Generate vector embeddings
- Index - Store in database with metadata
- Ready - Available for search immediately
Processing time:
- Small doc (10 pages): ~30 seconds
- Medium doc (100 pages): ~2 minutes
- Large doc (500 pages): ~10 minutes
Security & Privacy
Data Protection
Multi-tenant isolation:
- Your data completely separate from others
- Database-level security (Row Level Security)
- Workspace-based access control
- Cannot be bypassed
Encryption:
- All connections encrypted (TLS/SSL)
- API keys encrypted at rest
- Secure credential storage
Access control:
- JWT authentication
- Role-based permissions
- Workspace sharing controls
- Audit logging
Privacy Options
Cloud deployment:
- Use managed services (Supabase, etc.)
- Your data isolated in your account
- Encrypted in transit and at rest
Self-hosted:
- Complete data sovereignty
- Never leaves your infrastructure
- Full control over everything
- Local AI models for zero external calls
Deployment Options
Quick Start (Recommended)
Perfect for getting started:
- Single server deployment
- Managed database (Supabase free tier)
- Cloud AI models (OpenAI, Claude)
- 10 minutes to running
Requirements:
- 4GB RAM minimum
- Docker installed
- Internet connection
Production Deployment
For serious use:
- Load-balanced API servers
- Database with replicas
- Redis caching layer
- Background worker pool
- Monitoring and logging
Scaling:
- Horizontal API scaling
- Worker pool sizing
- Read replicas for database
- CDN for frontend
Privacy-First Deployment
For maximum data control:
- Self-hosted infrastructure
- Local AI models only
- Air-gapped if needed
- Complete audit trail
Technology Stack
Frontend
- React 18 + TypeScript
- Vite (ultra-fast builds)
- Tailwind CSS + Shadcn/ui
- WebSocket for real-time updates
- i18next for translations
Backend
- FastAPI (Python)
- SQLAlchemy ORM
- LangChain (RAG framework)
- Pydantic AI (intelligent routing)
- llama-cpp-python (local models)
Databases
- PostgreSQL 16+ with pgvector (required)
- Redis (optional, for caching)
- Neo4j (optional, for Graph RAG)
AI Providers (all optional)
- OpenAI, Anthropic, Google Gemini
- Ollama, vLLM, LM Studio
- Any OpenAI-compatible endpoint
What Makes Scrapalot Different
Tri-Modal Fusion Search
Most RAG systems use only vector search. Scrapalot combines three methods for superior accuracy.
Intelligent Routing
AI automatically selects the best search strategy for each question. No manual configuration needed.
Context Expansion
Understands document structure to provide complete context, not just isolated chunks.
Model Flexibility
Use any AI model - cloud, local, or mixed. Switch anytime without data migration.
True Open Source
MIT licensed, self-host everything, no vendor lock-in, active community.
Production Ready
Enterprise security, multi-tenancy, real-time streaming, comprehensive monitoring.
Next Steps
Explore More
Get Started:
- Quick Start Guide - Running in 10 minutes
- Deployment Guide - Production setup
- User Guide - Feature overview
Learn the Features:
- RAG Strategy - 13 search strategies explained
- Document Processing - 17+ chunking methods
- Context Expansion - Smart document understanding
- Graph RAG - Relationship-aware search
Advanced Topics:
- Model Management - Choosing and configuring AI models
- Database Design - Data storage and organization
- External Connectors - Auto-sync from cloud sources
- Security - Access control and data protection
Scrapalot makes advanced RAG accessible. Complex technology, simple experience.