Data Storage & Organization
Scrapalot provides flexible, powerful data storage designed for optimal performance and security.
Storage Strategy
Scrapalot uses multiple specialized databases to deliver the best performance for different types of data:
Primary Database: PostgreSQL with pgvector
Purpose: Your documents, user data, and vector embeddings
Benefits:
- Reliable: ACID-compliant transactions ensure your data is never lost
- Fast Vector Search: Native support for semantic similarity search
- Scalable: Handles millions of documents efficiently
- Flexible: JSON support for custom metadata
Recommended for: All production deployments
Optional Components
Redis:
- Purpose: Speed up repeat queries and maintain user sessions
- Benefits: Instant responses for recently asked questions
- When to use: High-traffic deployments where speed matters
Neo4j:
- Purpose: Understand relationships between entities in your documents
- Benefits: Answer questions about how concepts relate to each other
- When to use: When you need Graph RAG capabilities
SQLite:
- Purpose: Simple local development and testing
- Benefits: Zero configuration needed
- When to use: Development only, not for production
Deployment Options
Quick Start: Supabase (Recommended)
Supabase provides a fully-managed PostgreSQL database with pgvector already configured.
Why Supabase?
- Free tier available for getting started
- Automatic backups included
- Built-in connection pooling
- pgvector extension ready to use
- No database maintenance required
Setup Steps:
- Create account at supabase.com
- Create new project (choose region close to your users)
- Enable pgvector extension in Database → Extensions
- Copy connection details to your environment configuration
- Start Scrapalot - migrations run automatically
Connection Options:
Option A: Transaction Pooler (Recommended)
- More reliable connection handling
- Works even when project is paused
- Slight connection overhead
Option B: Direct Connection
- Lower latency
- Direct database access
- May not work if project pauses
Self-Hosted PostgreSQL
You can run PostgreSQL yourself for complete control:
Requirements:
- PostgreSQL 16 or newer
- pgvector extension installed
- Minimum 2GB RAM allocated
- Regular backup strategy
When to use:
- Complete data sovereignty required
- Custom performance tuning needed
- Integration with existing infrastructure
Data Organization
Workspaces
What: Organizational containers for your content Use for: Separating projects, teams, or clients Features:
- Share with team members
- Set one as default
- Control access per workspace
Collections
What: Groups of related documents within a workspace Use for: Organizing documents by topic, project, or purpose Features:
- Choose embedding model per collection
- Configure chunking settings
- Add custom metadata
- Control search scope
Documents
What: Your uploaded files Support: PDFs, Word docs, text files, and more Features:
- Automatic processing and indexing
- Deduplication (same file uploaded twice detected)
- Processing status tracking
- Custom metadata support
Document Chunks
What: Smart segments of your documents optimized for retrieval Benefits:
- Precise answers from exact document sections
- Efficient semantic search
- Citation with page numbers
- Context-aware segmentation
Security & Access Control
Multi-Tenant Isolation
Automatic Protection:
- Your data is completely isolated from other users
- Database-level security (Row Level Security)
- Cannot be bypassed at any level
- Workspace-based access control
How it works:
- You only see your workspaces
- Your team members only see shared workspaces
- Access cascades automatically (workspace → collections → documents)
Sharing & Collaboration
Workspace Sharing:
- Share entire workspace with team members
- Control access levels (owner, editor, viewer)
- Shared users automatically see all collections in workspace
Access Levels:
- Owner: Full control, can delete and share
- Editor: Add/edit documents, cannot delete workspace
- Viewer: Read-only access
Data Protection
Backup & Recovery
Automatic Backups (Supabase):
- Daily automatic backups included
- Point-in-time recovery available
- Test recovery procedures recommended
Self-Hosted:
- Configure your own backup schedule
- Store backups securely off-site
- Test restoration regularly
Encryption
Data in Transit:
- All connections use TLS/SSL encryption
- API calls encrypted automatically
- Database connections secured
Data at Rest:
- Database encryption (Supabase provides this)
- API keys encrypted in storage
- Secure credential management
Compliance
Access Logging:
- Track who accessed what data
- Audit trail for compliance
- Security event monitoring
Data Retention:
- Configure retention policies
- Automatic cleanup options
- Export capabilities for archival
Performance Features
Vector Search Optimization
Fast Similarity Search:
- Optimized indexes for semantic search
- Efficient nearest-neighbor algorithms
- Configurable precision/speed trade-off
Search Performance:
- Sub-second search across millions of chunks
- Parallel query execution
- Smart caching for common queries
Connection Management
Automatic Pooling:
- Efficient connection reuse
- Handles high concurrent users
- Automatic scaling within limits
Performance Monitoring:
- Track query performance
- Identify slow operations
- Optimize based on usage patterns
Migration & Updates
Automatic Migrations
Zero-Downtime Updates:
- Database schema updates run automatically on startup
- No manual intervention needed
- Rollback capability if issues occur
What gets migrated:
- New features and improvements
- Security patches
- Performance optimizations
- Bug fixes
Cross-Platform Support
Works Everywhere:
- Development: SQLite for local testing
- Production: PostgreSQL for scale and reliability
- Automatic adaptation to environment
Troubleshooting
Common Issues
Cannot Connect to Database:
- Verify connection parameters are correct
- Check if database is running/accessible
- Try transaction pooler if direct connection fails
- Ensure firewall allows connections
Slow Queries:
- Check if indexes are built
- Review query complexity
- Consider enabling Redis cache
- Monitor database resource usage
Out of Storage:
- Review document retention policies
- Delete unused collections
- Optimize chunk sizes
- Check for duplicate uploads
Performance Degradation:
- Monitor connection pool usage
- Check for long-running queries
- Review index health
- Consider read replicas for scale
Best Practices
Document Management
Organize Effectively:
- Use workspaces to separate major projects
- Group related documents in collections
- Add meaningful metadata
- Use consistent naming conventions
Performance Optimization:
- Choose appropriate chunk sizes for your content
- Select embedding models based on accuracy vs. speed needs
- Archive old collections when no longer needed
- Monitor storage usage
Security
Access Control:
- Use workspace sharing for team collaboration
- Review access permissions regularly
- Remove inactive users
- Use viewer role when edit access not needed
Data Safety:
- Verify backups are working
- Test restoration procedures
- Keep connection credentials secure
- Use environment variables, never hardcode
Scaling
Growing Your Deployment:
- Start with Supabase free tier
- Enable Redis as query volume increases
- Add Neo4j when graph features needed
- Consider read replicas for high traffic
Related Documentation
- RAG Strategy - How vector search powers retrieval
- Model Management - Choosing embedding models
- Security - Access control details
- Deployment Guide - Production setup
Scrapalot handles all database complexity automatically. Just upload your documents and start asking questions - the rest happens behind the scenes.