Background Processing
Scrapalot processes documents in the background so you never have to wait. Upload files and continue working - you'll be notified when processing completes.
How Background Processing Works
When you upload a document, Scrapalot immediately queues it for processing and returns control to you. Behind the scenes, specialized workers handle the heavy lifting:
What Gets Processed
Document Processing
When you upload files:
- Text Extraction - Content extracted from PDFs, Word docs, etc.
- Smart Chunking - Documents split into optimal segments
- Embedding Generation - Vector embeddings created for semantic search
- Indexing - Chunks stored and indexed for fast retrieval
Processing time:
- Small documents (1-10 pages): 10-30 seconds
- Medium documents (10-100 pages): 30-120 seconds
- Large documents (100+ pages): 2-5 minutes
External Document Fetching
When you connect external sources:
- Automatic synchronization on schedule
- Downloads from Google Drive, web pages, APIs
- Queued processing for each fetched document
- Error handling and retry logic
Schedule options:
- Manual (on-demand only)
- Hourly
- Daily
- Weekly
Real-Time Progress
Progress Tracking
You always know what's happening:
- Upload stage (0-10%)
- Validation (10-20%)
- Text extraction (20-50%)
- Chunking (50-70%)
- Embedding generation (70-95%)
- Final indexing (95-100%)
Visual feedback:
- Progress bars in UI
- Status messages
- Estimated time remaining
- Error notifications if issues occur
Notifications
Get notified when:
- Documents finish processing
- Processing errors occur
- External fetches complete
- Batch operations finish
Deployment Options
Minimal Setup (Default)
For small teams and getting started:
- Essential workers only
- Low memory footprint
- Handles typical workloads
- Good for 1-10 concurrent users
Requirements:
- 4GB RAM minimum
- Processes one document at a time
- Good for most use cases
Enhanced Setup
For larger teams and high volume:
- Multiple specialized workers
- Parallel document processing
- Faster turnaround times
- Handles 10+ concurrent users
Requirements:
- 8GB+ RAM recommended
- Processes multiple documents simultaneously
- Better for production deployments
Resource Allocation
Workers adapt to your hardware:
- Lightweight workers for small servers
- Heavy workers for powerful machines
- Automatic memory management
- Graceful degradation under load
Error Handling
Automatic Retry
If processing fails:
- Automatic retry with increasing delays
- Up to 3 attempts per document
- Clear error messages if all attempts fail
- Queue continues with other documents
Common issues handled:
- Temporary network failures
- Rate limit exceeded (external sources)
- Corrupted file detection
- Timeout on very large files
Error Recovery
When things go wrong:
- You receive clear error notification
- Other documents continue processing
- Failed document can be re-uploaded
- Detailed logs for troubleshooting
Error types:
- File format not supported
- Document too large
- Corrupted file
- External service unavailable
Performance Features
Smart Queuing
Priority handling:
- User-triggered uploads get priority
- Scheduled syncs run during low activity
- System balances load automatically
- No queue starvation
Resource Management
Efficient processing:
- Memory limits prevent server overload
- Automatic worker restart on memory buildup
- CPU throttling for better responsiveness
- Disk space monitoring
Scalability
Grows with your needs:
- Add more workers as usage increases
- Horizontal scaling supported
- Load balancing across workers
- No downtime for scaling
Monitoring & Health
System Health
Track processing performance:
- Active jobs count
- Queue depth
- Processing times
- Error rates
- Resource utilization
Access via:
- Admin dashboard
- Health check endpoint
- System logs
Scheduled Tasks
Automatic maintenance:
- Cleanup of temporary files
- Old session removal
- Job history archival
- Health checks
Schedule:
- Runs during low activity periods
- Configurable timing
- Minimal performance impact
Configuration Options
Worker Tuning
Adjust for your environment:
- Low Memory: Reduce concurrent processing
- High Memory: Increase parallel workers
- CPU Limited: Reduce worker concurrency
- Fast Storage: Increase batch sizes
Processing Behavior
Customize processing:
- Chunk size preferences
- Embedding model selection
- Retry attempt limits
- Timeout durations
Troubleshooting
Slow Processing
If documents take too long:
- Check system resource usage
- Verify worker health
- Review document size/complexity
- Consider adding more workers
Typical solutions:
- Reduce concurrent uploads
- Increase worker memory allocation
- Split very large documents
- Use faster embedding models
Processing Stuck
If progress stops:
- Check worker status
- Review error logs
- Restart workers if needed
- Re-upload problematic documents
Prevention:
- Monitor queue depth
- Set appropriate timeouts
- Regular health checks
High Resource Usage
If system runs hot:
- Reduce worker concurrency
- Increase restart frequency
- Monitor for memory leaks
- Review processing limits
Best Practices
Efficient Uploads
Optimize your workflow:
- Upload related documents together
- Use appropriate file formats
- Pre-process very large files
- Remove unnecessary pages
Scheduled Syncs
Configure wisely:
- Schedule during off-peak hours
- Set reasonable sync frequencies
- Monitor quota usage (external sources)
- Review and clean old syncs
Resource Planning
Plan for growth:
- Start with minimal setup
- Monitor actual usage patterns
- Scale workers as needed
- Review performance metrics regularly
Related Documentation
- External Connectors - Automatic document fetching
- Database Design - Where processed data goes
- Document Processing - Chunking strategies
- Deployment Guide - Production configuration
Background processing is automatic and requires no user intervention. Just upload documents and Scrapalot handles the rest.