Data Storage & Organization
Last Updated: March 2026
Scrapalot provides flexible, powerful data storage designed for optimal performance and security with a dual-database architecture separating user data from AI data.
Storage Strategy
Scrapalot uses multiple specialized databases to deliver the best performance for different types of data:
Primary Database: PostgreSQL with pgvector
Two-Database Architecture:
scrapalot_backend(Kotlin Backend): User data, auth, workspaces, collections, sessions, messagesscrapalot(Python Chat): AI data, document content, chunks, embeddings, research plans
Purpose: Separation of concerns for scalability and security
Benefits:
- Reliable: ACID-compliant transactions ensure your data is never lost
- Fast Vector Search: Native support for semantic similarity search
- Scalable: Handles millions of documents efficiently
- Flexible: JSON support for custom metadata
Recommended for: All production deployments
Optional Components
Redis:
- Purpose: Speed up repeat queries and maintain user sessions
- Benefits: Instant responses for recently asked questions
- When to use: High-traffic deployments where speed matters
Neo4j:
- Purpose: Understand relationships between entities in your documents
- Benefits: Answer questions about how concepts relate to each other
- When to use: When you need Graph RAG capabilities
Deployment Options
Production: Self-Hosted pgvector Docker (Recommended)
Production uses a self-hosted PostgreSQL with pgvector running in Docker. This provides complete data sovereignty and no external dependencies.
Why Self-Hosted pgvector Docker?
- Complete data sovereignty
- No external service dependencies
- Full control over configuration and backups
- pgvector extension pre-installed in Docker image
- No ongoing service costs
Setup Steps:
- Deploy pgvector Docker container (included in docker-compose)
- Two databases created automatically:
scrapalot_backendandscrapalot - Migrations run automatically on startup (Liquibase for Kotlin, Alembic for Python)
Cross-Service Data Sync
Redis Streams SAGA ensures reliable cross-service data synchronization:
- Kotlin and Python each own specific tables (see Data Ownership below)
- Changes propagate via Redis Streams with consumer groups (XADD/XREADGROUP)
- SAGA pattern: remote DB commits first, ACK via
saga_ackstream, then local DB commits - Pending message recovery at startup for guaranteed delivery
- Replaces legacy Redis Pub/Sub for data sync operations
Alternative: Self-Hosted PostgreSQL
You can run PostgreSQL yourself without Docker for complete control:
Requirements:
- PostgreSQL 18 or newer
- pgvector extension installed
- Minimum 2GB RAM allocated per database
- Regular backup strategy
- Two databases:
scrapalot_backendandscrapalot
When to use:
- Complete data sovereignty required
- Custom performance tuning needed
- Integration with existing infrastructure
Data Organization
Workspaces
What: Organizational containers for your content Use for: Separating projects, teams, or clients Features:
- Share with team members
- Set one as default
- Control access per workspace
Collections
What: Groups of related documents within a workspace Use for: Organizing documents by topic, project, or purpose Features:
- Choose embedding model per collection
- Configure chunking settings
- Add custom metadata
- Control search scope
Documents
What: Your uploaded files Support: PDFs, Word docs, text files, and more Features:
- Automatic processing and indexing
- Deduplication (same file uploaded twice detected)
- Processing status tracking
- Custom metadata support
Document Chunks
What: Smart segments of your documents optimized for retrieval Benefits:
- Precise answers from exact document sections
- Efficient semantic search
- Citation with page numbers
- Context-aware segmentation
Security & Access Control
Multi-Tenant Isolation
Automatic Protection:
- Your data is completely isolated from other users
- Database-level security (Row Level Security)
- Cannot be bypassed at any level
- Workspace-based access control
How it works:
- You only see your workspaces
- Your team members only see shared workspaces
- Access cascades automatically (workspace → collections → documents)
Sharing & Collaboration
Workspace Sharing:
- Share entire workspace with team members
- Control access levels (owner, editor, viewer)
- Shared users automatically see all collections in workspace
Access Levels:
- Owner: Full control, can delete and share
- Editor: Add/edit documents, cannot delete workspace
- Viewer: Read-only access
Data Protection
Backup & Recovery
Backups:
- Configure your own backup schedule for self-hosted deployments
- Store backups securely off-site
- Test restoration regularly
Encryption
Data in Transit:
- All connections use TLS/SSL encryption
- API calls encrypted automatically
- Database connections secured
Data at Rest:
- Database encryption (configurable per deployment)
- API keys encrypted in storage
- Secure credential management
Compliance
Access Logging:
- Track who accessed what data
- Audit trail for compliance
- Security event monitoring
Data Retention:
- Configure retention policies
- Automatic cleanup options
- Export capabilities for archival
Performance Features
Vector Search Optimization
Fast Similarity Search:
- Optimized indexes for semantic search
- Efficient nearest-neighbor algorithms
- Configurable precision/speed trade-off
Search Performance:
- Sub-second search across millions of chunks
- Parallel query execution
- Smart caching for common queries
Connection Management
Automatic Pooling:
- Efficient connection reuse
- Handles high concurrent users
- Automatic scaling within limits
Performance Monitoring:
- Track query performance
- Identify slow operations
- Optimize based on usage patterns
Migration & Updates
Automatic Migrations
Zero-Downtime Updates:
- Database schema updates run automatically on startup
- No manual intervention needed
- Rollback capability if issues occur
What gets migrated:
- New features and improvements
- Security patches
- Performance optimizations
- Bug fixes
Cross-Platform Support
Works Everywhere:
- Development: SQLite for local testing
- Production: PostgreSQL for scale and reliability
- Automatic adaptation to environment
Troubleshooting
Common Issues
Cannot Connect to Database:
- Verify connection parameters are correct
- Check if database is running/accessible
- Try transaction pooler if direct connection fails
- Ensure firewall allows connections
Slow Queries:
- Check if indexes are built
- Review query complexity
- Consider enabling Redis cache
- Monitor database resource usage
Out of Storage:
- Review document retention policies
- Delete unused collections
- Optimize chunk sizes
- Check for duplicate uploads
Performance Degradation:
- Monitor connection pool usage
- Check for long-running queries
- Review index health
- Consider read replicas for scale
Best Practices
Document Management
Organize Effectively:
- Use workspaces to separate major projects
- Group related documents in collections
- Add meaningful metadata
- Use consistent naming conventions
Performance Optimization:
- Choose appropriate chunk sizes for your content
- Select embedding models based on accuracy vs. speed needs
- Archive old collections when no longer needed
- Monitor storage usage
Security
Access Control:
- Use workspace sharing for team collaboration
- Review access permissions regularly
- Remove inactive users
- Use viewer role when edit access not needed
Data Safety:
- Verify backups are working
- Test restoration procedures
- Keep connection credentials secure
- Use environment variables, never hardcode
Scaling
Growing Your Deployment:
- Start with self-hosted pgvector Docker
- Enable Redis as query volume increases
- Add Neo4j when graph features needed
- Consider read replicas for high traffic
Related Documentation
- RAG Strategy - How vector search powers retrieval
- Model Management - Choosing embedding models
- Security - Access control details
- Deployment Guide - Production setup
Scrapalot handles all database complexity automatically. Just upload your documents and start asking questions - the rest happens behind the scenes.