Data Storage & Organization

Scrapalot provides flexible, powerful data storage designed for optimal performance and security.

Storage Strategy

Scrapalot uses multiple specialized databases to deliver the best performance for different types of data:

Primary Database: PostgreSQL with pgvector

Purpose: Your documents, user data, and vector embeddings

Benefits:

Reliable: ACID-compliant transactions ensure your data is never lost
Fast Vector Search: Native support for semantic similarity search
Scalable: Handles millions of documents efficiently
Flexible: JSON support for custom metadata

Recommended for: All production deployments

Optional Components

Redis:

Purpose: Speed up repeat queries and maintain user sessions
Benefits: Instant responses for recently asked questions
When to use: High-traffic deployments where speed matters

Neo4j:

Purpose: Understand relationships between entities in your documents
Benefits: Answer questions about how concepts relate to each other
When to use: When you need Graph RAG capabilities

SQLite:

Purpose: Simple local development and testing
Benefits: Zero configuration needed
When to use: Development only, not for production

Deployment Options

Quick Start: Supabase (Recommended)

Supabase provides a fully-managed PostgreSQL database with pgvector already configured.

Why Supabase?

Free tier available for getting started
Automatic backups included
Built-in connection pooling
pgvector extension ready to use
No database maintenance required

Setup Steps:

Create account at supabase.com
Create new project (choose region close to your users)
Enable pgvector extension in Database → Extensions
Copy connection details to your environment configuration
Start Scrapalot - migrations run automatically

Connection Options:

Option A: Transaction Pooler (Recommended)

More reliable connection handling
Works even when project is paused
Slight connection overhead

Option B: Direct Connection

Lower latency
Direct database access
May not work if project pauses

Self-Hosted PostgreSQL

You can run PostgreSQL yourself for complete control:

Requirements:

PostgreSQL 16 or newer
pgvector extension installed
Minimum 2GB RAM allocated
Regular backup strategy

When to use:

Complete data sovereignty required
Custom performance tuning needed
Integration with existing infrastructure

Data Organization

Workspaces

What: Organizational containers for your content Use for: Separating projects, teams, or clients Features:

Share with team members
Set one as default
Control access per workspace

Collections

What: Groups of related documents within a workspace Use for: Organizing documents by topic, project, or purpose Features:

Choose embedding model per collection
Configure chunking settings
Add custom metadata
Control search scope

Documents

What: Your uploaded files Support: PDFs, Word docs, text files, and more Features:

Automatic processing and indexing
Deduplication (same file uploaded twice detected)
Processing status tracking
Custom metadata support

Document Chunks

What: Smart segments of your documents optimized for retrieval Benefits:

Precise answers from exact document sections
Efficient semantic search
Citation with page numbers
Context-aware segmentation

Security & Access Control

Multi-Tenant Isolation

Automatic Protection:

Your data is completely isolated from other users
Database-level security (Row Level Security)
Cannot be bypassed at any level
Workspace-based access control

How it works:

You only see your workspaces
Your team members only see shared workspaces
Access cascades automatically (workspace → collections → documents)

Workspace Sharing:

Share entire workspace with team members
Control access levels (owner, editor, viewer)
Shared users automatically see all collections in workspace

Access Levels:

Owner: Full control, can delete and share
Editor: Add/edit documents, cannot delete workspace
Viewer: Read-only access

Data Protection

Backup & Recovery

Automatic Backups (Supabase):

Daily automatic backups included
Point-in-time recovery available
Test recovery procedures recommended

Self-Hosted:

Configure your own backup schedule
Store backups securely off-site
Test restoration regularly

Encryption

Data in Transit:

All connections use TLS/SSL encryption
API calls encrypted automatically
Database connections secured

Data at Rest:

Database encryption (Supabase provides this)
API keys encrypted in storage
Secure credential management

Compliance

Access Logging:

Track who accessed what data
Audit trail for compliance
Security event monitoring

Data Retention:

Configure retention policies
Automatic cleanup options
Export capabilities for archival

Performance Features

Vector Search Optimization

Fast Similarity Search:

Optimized indexes for semantic search
Efficient nearest-neighbor algorithms
Configurable precision/speed trade-off

Search Performance:

Sub-second search across millions of chunks
Parallel query execution
Smart caching for common queries

Connection Management

Automatic Pooling:

Efficient connection reuse
Handles high concurrent users
Automatic scaling within limits

Performance Monitoring:

Track query performance
Identify slow operations
Optimize based on usage patterns

Migration & Updates

Automatic Migrations

Zero-Downtime Updates:

Database schema updates run automatically on startup
No manual intervention needed
Rollback capability if issues occur

What gets migrated:

New features and improvements
Security patches
Performance optimizations
Bug fixes

Cross-Platform Support

Works Everywhere:

Development: SQLite for local testing
Production: PostgreSQL for scale and reliability
Automatic adaptation to environment

Troubleshooting

Common Issues

Cannot Connect to Database:

Verify connection parameters are correct
Check if database is running/accessible
Try transaction pooler if direct connection fails
Ensure firewall allows connections

Slow Queries:

Check if indexes are built
Review query complexity
Consider enabling Redis cache
Monitor database resource usage

Out of Storage:

Review document retention policies
Delete unused collections
Optimize chunk sizes
Check for duplicate uploads

Performance Degradation:

Monitor connection pool usage
Check for long-running queries
Review index health
Consider read replicas for scale

Best Practices

Document Management

Organize Effectively:

Use workspaces to separate major projects
Group related documents in collections
Add meaningful metadata
Use consistent naming conventions

Performance Optimization:

Choose appropriate chunk sizes for your content
Select embedding models based on accuracy vs. speed needs
Archive old collections when no longer needed
Monitor storage usage

Security

Access Control:

Use workspace sharing for team collaboration
Review access permissions regularly
Remove inactive users
Use viewer role when edit access not needed

Data Safety:

Verify backups are working
Test restoration procedures
Keep connection credentials secure
Use environment variables, never hardcode

Scaling

Growing Your Deployment:

Start with Supabase free tier
Enable Redis as query volume increases
Add Neo4j when graph features needed
Consider read replicas for high traffic

RAG Strategy - How vector search powers retrieval
Model Management - Choosing embedding models
Security - Access control details
Deployment Guide - Production setup

Scrapalot handles all database complexity automatically. Just upload your documents and start asking questions - the rest happens behind the scenes.

Data Storage & Organization ​

Storage Strategy ​

Primary Database: PostgreSQL with pgvector ​

Optional Components ​

Deployment Options ​

Quick Start: Supabase (Recommended) ​

Self-Hosted PostgreSQL ​

Data Organization ​

Workspaces ​

Collections ​

Documents ​

Document Chunks ​

Security & Access Control ​

Multi-Tenant Isolation ​

Sharing & Collaboration ​

Data Protection ​

Backup & Recovery ​

Encryption ​

Compliance ​

Performance Features ​

Vector Search Optimization ​

Connection Management ​

Migration & Updates ​

Automatic Migrations ​

Cross-Platform Support ​

Troubleshooting ​

Common Issues ​

Best Practices ​

Document Management ​

Security ​

Scaling ​

Related Documentation ​