Skip to content

Architecture Overview

Scrapalot is an open-source, enterprise-grade RAG platform built with modern, scalable architecture for intelligent document processing and AI-powered knowledge retrieval.

For Open Source Users

This architecture is 100% open source (MIT License). You can self-host everything on your own infrastructure - no vendor lock-in, no hidden costs.

System Overview

Scrapalot connects your documents with AI to deliver accurate, cited answers. Here's how it works:

Core Components

Web Interface

Modern, responsive React application

Features:

  • Real-time streaming answers
  • Document management
  • Multi-language support (English, Croatian)
  • Dark/light themes
  • Mobile-friendly

Technology:

  • React 18 with TypeScript
  • Tailwind CSS for beautiful design
  • Real-time WebSocket updates
  • Progressive web app ready

Backend Server

High-performance Python API

Capabilities:

  • Advanced RAG strategies (13 different approaches)
  • Multi-provider AI model support
  • Background document processing
  • Real-time progress updates
  • Secure multi-tenant architecture

Technology:

  • FastAPI for blazing-fast API
  • Async Python for efficiency
  • WebSocket for real-time communication
  • Comprehensive security

Data Storage

Flexible, powerful databases

Required:

  • PostgreSQL with pgvector - Document storage and vector search

Optional:

  • Redis - Caching for speed
  • Neo4j - Knowledge graph for Graph RAG

Benefits:

  • Automatic backups (with Supabase)
  • Scalable to millions of documents
  • Fast semantic search
  • Secure multi-tenant isolation

AI Models

Your choice of providers

Cloud Options:

  • OpenAI (GPT-4, GPT-3.5)
  • Anthropic (Claude)
  • Google (Gemini)

Local Options:

  • Ollama (easy local server)
  • LM Studio (GPU-accelerated)
  • Direct GGUF models
  • vLLM (production inference)

Flexibility:

  • Mix and match providers
  • Change models anytime
  • No vendor lock-in

RAG Engine: How Search Works

Scrapalot uses Tri-Modal Fusion - three search methods working together:

Why Three Search Methods?

Semantic Search (Vector embeddings):

  • Understands conceptual similarity
  • "renewable energy" finds "solar power"
  • Best for: Conceptual questions

Keyword Search (BM25):

  • Exact term matching
  • "error 221" finds exactly that
  • Best for: Technical terms, codes, specific phrases

Graph Search (Neo4j, optional):

  • Understands relationships
  • "How are X and Y connected?"
  • Best for: Relationship questions

Intelligent Routing: The system automatically chooses the best method(s) for each question. You don't need to think about it - it just works.

How Documents Become Knowledge

Processing steps:

  1. Upload - You upload document
  2. Extract - Text extracted from PDF/Word/etc
  3. Chunk - Intelligently split into searchable segments
  4. Embed - Generate vector embeddings
  5. Index - Store in database with metadata
  6. Ready - Available for search immediately

Processing time:

  • Small doc (10 pages): ~30 seconds
  • Medium doc (100 pages): ~2 minutes
  • Large doc (500 pages): ~10 minutes

Security & Privacy

Data Protection

Multi-tenant isolation:

  • Your data completely separate from others
  • Database-level security (Row Level Security)
  • Workspace-based access control
  • Cannot be bypassed

Encryption:

  • All connections encrypted (TLS/SSL)
  • API keys encrypted at rest
  • Secure credential storage

Access control:

  • JWT authentication
  • Role-based permissions
  • Workspace sharing controls
  • Audit logging

Privacy Options

Cloud deployment:

  • Use managed services (Supabase, etc.)
  • Your data isolated in your account
  • Encrypted in transit and at rest

Self-hosted:

  • Complete data sovereignty
  • Never leaves your infrastructure
  • Full control over everything
  • Local AI models for zero external calls

Deployment Options

Perfect for getting started:

  • Single server deployment
  • Managed database (Supabase free tier)
  • Cloud AI models (OpenAI, Claude)
  • 10 minutes to running

Requirements:

  • 4GB RAM minimum
  • Docker installed
  • Internet connection

Production Deployment

For serious use:

  • Load-balanced API servers
  • Database with replicas
  • Redis caching layer
  • Background worker pool
  • Monitoring and logging

Scaling:

  • Horizontal API scaling
  • Worker pool sizing
  • Read replicas for database
  • CDN for frontend

Privacy-First Deployment

For maximum data control:

  • Self-hosted infrastructure
  • Local AI models only
  • Air-gapped if needed
  • Complete audit trail

Technology Stack

Frontend

  • React 18 + TypeScript
  • Vite (ultra-fast builds)
  • Tailwind CSS + Shadcn/ui
  • WebSocket for real-time updates
  • i18next for translations

Backend

  • FastAPI (Python)
  • SQLAlchemy ORM
  • LangChain (RAG framework)
  • Pydantic AI (intelligent routing)
  • llama-cpp-python (local models)

Databases

  • PostgreSQL 16+ with pgvector (required)
  • Redis (optional, for caching)
  • Neo4j (optional, for Graph RAG)

AI Providers (all optional)

  • OpenAI, Anthropic, Google Gemini
  • Ollama, vLLM, LM Studio
  • Any OpenAI-compatible endpoint

What Makes Scrapalot Different

Most RAG systems use only vector search. Scrapalot combines three methods for superior accuracy.

Intelligent Routing

AI automatically selects the best search strategy for each question. No manual configuration needed.

Context Expansion

Understands document structure to provide complete context, not just isolated chunks.

Model Flexibility

Use any AI model - cloud, local, or mixed. Switch anytime without data migration.

True Open Source

MIT licensed, self-host everything, no vendor lock-in, active community.

Production Ready

Enterprise security, multi-tenancy, real-time streaming, comprehensive monitoring.

Next Steps

Explore More

Get Started:

Learn the Features:

Advanced Topics:


Scrapalot makes advanced RAG accessible. Complex technology, simple experience.

Released under the MIT License.