Skip to content

Data Storage & Organization

Scrapalot provides flexible, powerful data storage designed for optimal performance and security.

Storage Strategy

Scrapalot uses multiple specialized databases to deliver the best performance for different types of data:

Primary Database: PostgreSQL with pgvector

Purpose: Your documents, user data, and vector embeddings

Benefits:

  • Reliable: ACID-compliant transactions ensure your data is never lost
  • Fast Vector Search: Native support for semantic similarity search
  • Scalable: Handles millions of documents efficiently
  • Flexible: JSON support for custom metadata

Recommended for: All production deployments

Optional Components

Redis:

  • Purpose: Speed up repeat queries and maintain user sessions
  • Benefits: Instant responses for recently asked questions
  • When to use: High-traffic deployments where speed matters

Neo4j:

  • Purpose: Understand relationships between entities in your documents
  • Benefits: Answer questions about how concepts relate to each other
  • When to use: When you need Graph RAG capabilities

SQLite:

  • Purpose: Simple local development and testing
  • Benefits: Zero configuration needed
  • When to use: Development only, not for production

Deployment Options

Supabase provides a fully-managed PostgreSQL database with pgvector already configured.

Why Supabase?

  • Free tier available for getting started
  • Automatic backups included
  • Built-in connection pooling
  • pgvector extension ready to use
  • No database maintenance required

Setup Steps:

  1. Create account at supabase.com
  2. Create new project (choose region close to your users)
  3. Enable pgvector extension in Database → Extensions
  4. Copy connection details to your environment configuration
  5. Start Scrapalot - migrations run automatically

Connection Options:

Option A: Transaction Pooler (Recommended)

  • More reliable connection handling
  • Works even when project is paused
  • Slight connection overhead

Option B: Direct Connection

  • Lower latency
  • Direct database access
  • May not work if project pauses

Self-Hosted PostgreSQL

You can run PostgreSQL yourself for complete control:

Requirements:

  • PostgreSQL 16 or newer
  • pgvector extension installed
  • Minimum 2GB RAM allocated
  • Regular backup strategy

When to use:

  • Complete data sovereignty required
  • Custom performance tuning needed
  • Integration with existing infrastructure

Data Organization

Workspaces

What: Organizational containers for your content Use for: Separating projects, teams, or clients Features:

  • Share with team members
  • Set one as default
  • Control access per workspace

Collections

What: Groups of related documents within a workspace Use for: Organizing documents by topic, project, or purpose Features:

  • Choose embedding model per collection
  • Configure chunking settings
  • Add custom metadata
  • Control search scope

Documents

What: Your uploaded files Support: PDFs, Word docs, text files, and more Features:

  • Automatic processing and indexing
  • Deduplication (same file uploaded twice detected)
  • Processing status tracking
  • Custom metadata support

Document Chunks

What: Smart segments of your documents optimized for retrieval Benefits:

  • Precise answers from exact document sections
  • Efficient semantic search
  • Citation with page numbers
  • Context-aware segmentation

Security & Access Control

Multi-Tenant Isolation

Automatic Protection:

  • Your data is completely isolated from other users
  • Database-level security (Row Level Security)
  • Cannot be bypassed at any level
  • Workspace-based access control

How it works:

  1. You only see your workspaces
  2. Your team members only see shared workspaces
  3. Access cascades automatically (workspace → collections → documents)

Sharing & Collaboration

Workspace Sharing:

  • Share entire workspace with team members
  • Control access levels (owner, editor, viewer)
  • Shared users automatically see all collections in workspace

Access Levels:

  • Owner: Full control, can delete and share
  • Editor: Add/edit documents, cannot delete workspace
  • Viewer: Read-only access

Data Protection

Backup & Recovery

Automatic Backups (Supabase):

  • Daily automatic backups included
  • Point-in-time recovery available
  • Test recovery procedures recommended

Self-Hosted:

  • Configure your own backup schedule
  • Store backups securely off-site
  • Test restoration regularly

Encryption

Data in Transit:

  • All connections use TLS/SSL encryption
  • API calls encrypted automatically
  • Database connections secured

Data at Rest:

  • Database encryption (Supabase provides this)
  • API keys encrypted in storage
  • Secure credential management

Compliance

Access Logging:

  • Track who accessed what data
  • Audit trail for compliance
  • Security event monitoring

Data Retention:

  • Configure retention policies
  • Automatic cleanup options
  • Export capabilities for archival

Performance Features

Vector Search Optimization

Fast Similarity Search:

  • Optimized indexes for semantic search
  • Efficient nearest-neighbor algorithms
  • Configurable precision/speed trade-off

Search Performance:

  • Sub-second search across millions of chunks
  • Parallel query execution
  • Smart caching for common queries

Connection Management

Automatic Pooling:

  • Efficient connection reuse
  • Handles high concurrent users
  • Automatic scaling within limits

Performance Monitoring:

  • Track query performance
  • Identify slow operations
  • Optimize based on usage patterns

Migration & Updates

Automatic Migrations

Zero-Downtime Updates:

  • Database schema updates run automatically on startup
  • No manual intervention needed
  • Rollback capability if issues occur

What gets migrated:

  • New features and improvements
  • Security patches
  • Performance optimizations
  • Bug fixes

Cross-Platform Support

Works Everywhere:

  • Development: SQLite for local testing
  • Production: PostgreSQL for scale and reliability
  • Automatic adaptation to environment

Troubleshooting

Common Issues

Cannot Connect to Database:

  • Verify connection parameters are correct
  • Check if database is running/accessible
  • Try transaction pooler if direct connection fails
  • Ensure firewall allows connections

Slow Queries:

  • Check if indexes are built
  • Review query complexity
  • Consider enabling Redis cache
  • Monitor database resource usage

Out of Storage:

  • Review document retention policies
  • Delete unused collections
  • Optimize chunk sizes
  • Check for duplicate uploads

Performance Degradation:

  • Monitor connection pool usage
  • Check for long-running queries
  • Review index health
  • Consider read replicas for scale

Best Practices

Document Management

Organize Effectively:

  • Use workspaces to separate major projects
  • Group related documents in collections
  • Add meaningful metadata
  • Use consistent naming conventions

Performance Optimization:

  • Choose appropriate chunk sizes for your content
  • Select embedding models based on accuracy vs. speed needs
  • Archive old collections when no longer needed
  • Monitor storage usage

Security

Access Control:

  • Use workspace sharing for team collaboration
  • Review access permissions regularly
  • Remove inactive users
  • Use viewer role when edit access not needed

Data Safety:

  • Verify backups are working
  • Test restoration procedures
  • Keep connection credentials secure
  • Use environment variables, never hardcode

Scaling

Growing Your Deployment:

  • Start with Supabase free tier
  • Enable Redis as query volume increases
  • Add Neo4j when graph features needed
  • Consider read replicas for high traffic

Scrapalot handles all database complexity automatically. Just upload your documents and start asking questions - the rest happens behind the scenes.

Released under the MIT License.