Skip to content

Data Storage & Organization

Last Updated: March 2026

Scrapalot provides flexible, powerful data storage designed for optimal performance and security with a dual-database architecture separating user data from AI data.

Storage Strategy

Scrapalot uses multiple specialized databases to deliver the best performance for different types of data:

Primary Database: PostgreSQL with pgvector

Two-Database Architecture:

  • scrapalot_backend (Kotlin Backend): User data, auth, workspaces, collections, sessions, messages
  • scrapalot (Python Chat): AI data, document content, chunks, embeddings, research plans

Purpose: Separation of concerns for scalability and security

Benefits:

  • Reliable: ACID-compliant transactions ensure your data is never lost
  • Fast Vector Search: Native support for semantic similarity search
  • Scalable: Handles millions of documents efficiently
  • Flexible: JSON support for custom metadata

Recommended for: All production deployments

Optional Components

Redis:

  • Purpose: Speed up repeat queries and maintain user sessions
  • Benefits: Instant responses for recently asked questions
  • When to use: High-traffic deployments where speed matters

Neo4j:

  • Purpose: Understand relationships between entities in your documents
  • Benefits: Answer questions about how concepts relate to each other
  • When to use: When you need Graph RAG capabilities

Deployment Options

Production uses a self-hosted PostgreSQL with pgvector running in Docker. This provides complete data sovereignty and no external dependencies.

Why Self-Hosted pgvector Docker?

  • Complete data sovereignty
  • No external service dependencies
  • Full control over configuration and backups
  • pgvector extension pre-installed in Docker image
  • No ongoing service costs

Setup Steps:

  1. Deploy pgvector Docker container (included in docker-compose)
  2. Two databases created automatically: scrapalot_backend and scrapalot
  3. Migrations run automatically on startup (Liquibase for Kotlin, Alembic for Python)

Cross-Service Data Sync

Redis Streams SAGA ensures reliable cross-service data synchronization:

  • Kotlin and Python each own specific tables (see Data Ownership below)
  • Changes propagate via Redis Streams with consumer groups (XADD/XREADGROUP)
  • SAGA pattern: remote DB commits first, ACK via saga_ack stream, then local DB commits
  • Pending message recovery at startup for guaranteed delivery
  • Replaces legacy Redis Pub/Sub for data sync operations

Alternative: Self-Hosted PostgreSQL

You can run PostgreSQL yourself without Docker for complete control:

Requirements:

  • PostgreSQL 18 or newer
  • pgvector extension installed
  • Minimum 2GB RAM allocated per database
  • Regular backup strategy
  • Two databases: scrapalot_backend and scrapalot

When to use:

  • Complete data sovereignty required
  • Custom performance tuning needed
  • Integration with existing infrastructure

Data Organization

Workspaces

What: Organizational containers for your content Use for: Separating projects, teams, or clients Features:

  • Share with team members
  • Set one as default
  • Control access per workspace

Collections

What: Groups of related documents within a workspace Use for: Organizing documents by topic, project, or purpose Features:

  • Choose embedding model per collection
  • Configure chunking settings
  • Add custom metadata
  • Control search scope

Documents

What: Your uploaded files Support: PDFs, Word docs, text files, and more Features:

  • Automatic processing and indexing
  • Deduplication (same file uploaded twice detected)
  • Processing status tracking
  • Custom metadata support

Document Chunks

What: Smart segments of your documents optimized for retrieval Benefits:

  • Precise answers from exact document sections
  • Efficient semantic search
  • Citation with page numbers
  • Context-aware segmentation

Security & Access Control

Multi-Tenant Isolation

Automatic Protection:

  • Your data is completely isolated from other users
  • Database-level security (Row Level Security)
  • Cannot be bypassed at any level
  • Workspace-based access control

How it works:

  1. You only see your workspaces
  2. Your team members only see shared workspaces
  3. Access cascades automatically (workspace → collections → documents)

Sharing & Collaboration

Workspace Sharing:

  • Share entire workspace with team members
  • Control access levels (owner, editor, viewer)
  • Shared users automatically see all collections in workspace

Access Levels:

  • Owner: Full control, can delete and share
  • Editor: Add/edit documents, cannot delete workspace
  • Viewer: Read-only access

Data Protection

Backup & Recovery

Backups:

  • Configure your own backup schedule for self-hosted deployments
  • Store backups securely off-site
  • Test restoration regularly

Encryption

Data in Transit:

  • All connections use TLS/SSL encryption
  • API calls encrypted automatically
  • Database connections secured

Data at Rest:

  • Database encryption (configurable per deployment)
  • API keys encrypted in storage
  • Secure credential management

Compliance

Access Logging:

  • Track who accessed what data
  • Audit trail for compliance
  • Security event monitoring

Data Retention:

  • Configure retention policies
  • Automatic cleanup options
  • Export capabilities for archival

Performance Features

Vector Search Optimization

Fast Similarity Search:

  • Optimized indexes for semantic search
  • Efficient nearest-neighbor algorithms
  • Configurable precision/speed trade-off

Search Performance:

  • Sub-second search across millions of chunks
  • Parallel query execution
  • Smart caching for common queries

Connection Management

Automatic Pooling:

  • Efficient connection reuse
  • Handles high concurrent users
  • Automatic scaling within limits

Performance Monitoring:

  • Track query performance
  • Identify slow operations
  • Optimize based on usage patterns

Migration & Updates

Automatic Migrations

Zero-Downtime Updates:

  • Database schema updates run automatically on startup
  • No manual intervention needed
  • Rollback capability if issues occur

What gets migrated:

  • New features and improvements
  • Security patches
  • Performance optimizations
  • Bug fixes

Cross-Platform Support

Works Everywhere:

  • Development: SQLite for local testing
  • Production: PostgreSQL for scale and reliability
  • Automatic adaptation to environment

Troubleshooting

Common Issues

Cannot Connect to Database:

  • Verify connection parameters are correct
  • Check if database is running/accessible
  • Try transaction pooler if direct connection fails
  • Ensure firewall allows connections

Slow Queries:

  • Check if indexes are built
  • Review query complexity
  • Consider enabling Redis cache
  • Monitor database resource usage

Out of Storage:

  • Review document retention policies
  • Delete unused collections
  • Optimize chunk sizes
  • Check for duplicate uploads

Performance Degradation:

  • Monitor connection pool usage
  • Check for long-running queries
  • Review index health
  • Consider read replicas for scale

Best Practices

Document Management

Organize Effectively:

  • Use workspaces to separate major projects
  • Group related documents in collections
  • Add meaningful metadata
  • Use consistent naming conventions

Performance Optimization:

  • Choose appropriate chunk sizes for your content
  • Select embedding models based on accuracy vs. speed needs
  • Archive old collections when no longer needed
  • Monitor storage usage

Security

Access Control:

  • Use workspace sharing for team collaboration
  • Review access permissions regularly
  • Remove inactive users
  • Use viewer role when edit access not needed

Data Safety:

  • Verify backups are working
  • Test restoration procedures
  • Keep connection credentials secure
  • Use environment variables, never hardcode

Scaling

Growing Your Deployment:

  • Start with self-hosted pgvector Docker
  • Enable Redis as query volume increases
  • Add Neo4j when graph features needed
  • Consider read replicas for high traffic

Scrapalot handles all database complexity automatically. Just upload your documents and start asking questions - the rest happens behind the scenes.

Released under the MIT License.