Skip to content

Background Processing

Scrapalot processes documents in the background so you never have to wait. Upload files and continue working - you'll be notified when processing completes.

How Background Processing Works

When you upload a document, Scrapalot immediately queues it for processing and returns control to you. Behind the scenes, specialized workers handle the heavy lifting:

What Gets Processed

Document Processing

When you upload files:

  1. Text Extraction - Content extracted from PDFs, Word docs, etc.
  2. Smart Chunking - Documents split into optimal segments
  3. Embedding Generation - Vector embeddings created for semantic search
  4. Indexing - Chunks stored and indexed for fast retrieval

Processing time:

  • Small documents (1-10 pages): 10-30 seconds
  • Medium documents (10-100 pages): 30-120 seconds
  • Large documents (100+ pages): 2-5 minutes

External Document Fetching

When you connect external sources:

  • Automatic synchronization on schedule
  • Downloads from Google Drive, web pages, APIs
  • Queued processing for each fetched document
  • Error handling and retry logic

Schedule options:

  • Manual (on-demand only)
  • Hourly
  • Daily
  • Weekly

Real-Time Progress

Progress Tracking

You always know what's happening:

  • Upload stage (0-10%)
  • Validation (10-20%)
  • Text extraction (20-50%)
  • Chunking (50-70%)
  • Embedding generation (70-95%)
  • Final indexing (95-100%)

Visual feedback:

  • Progress bars in UI
  • Status messages
  • Estimated time remaining
  • Error notifications if issues occur

Notifications

Get notified when:

  • Documents finish processing
  • Processing errors occur
  • External fetches complete
  • Batch operations finish

Deployment Options

Minimal Setup (Default)

For small teams and getting started:

  • Essential workers only
  • Low memory footprint
  • Handles typical workloads
  • Good for 1-10 concurrent users

Requirements:

  • 4GB RAM minimum
  • Processes one document at a time
  • Good for most use cases

Enhanced Setup

For larger teams and high volume:

  • Multiple specialized workers
  • Parallel document processing
  • Faster turnaround times
  • Handles 10+ concurrent users

Requirements:

  • 8GB+ RAM recommended
  • Processes multiple documents simultaneously
  • Better for production deployments

Resource Allocation

Workers adapt to your hardware:

  • Lightweight workers for small servers
  • Heavy workers for powerful machines
  • Automatic memory management
  • Graceful degradation under load

Error Handling

Automatic Retry

If processing fails:

  • Automatic retry with increasing delays
  • Up to 3 attempts per document
  • Clear error messages if all attempts fail
  • Queue continues with other documents

Common issues handled:

  • Temporary network failures
  • Rate limit exceeded (external sources)
  • Corrupted file detection
  • Timeout on very large files

Error Recovery

When things go wrong:

  1. You receive clear error notification
  2. Other documents continue processing
  3. Failed document can be re-uploaded
  4. Detailed logs for troubleshooting

Error types:

  • File format not supported
  • Document too large
  • Corrupted file
  • External service unavailable

Performance Features

Smart Queuing

Priority handling:

  • User-triggered uploads get priority
  • Scheduled syncs run during low activity
  • System balances load automatically
  • No queue starvation

Resource Management

Efficient processing:

  • Memory limits prevent server overload
  • Automatic worker restart on memory buildup
  • CPU throttling for better responsiveness
  • Disk space monitoring

Scalability

Grows with your needs:

  • Add more workers as usage increases
  • Horizontal scaling supported
  • Load balancing across workers
  • No downtime for scaling

Monitoring & Health

System Health

Track processing performance:

  • Active jobs count
  • Queue depth
  • Processing times
  • Error rates
  • Resource utilization

Access via:

  • Admin dashboard
  • Health check endpoint
  • System logs

Scheduled Tasks

Automatic maintenance:

  • Cleanup of temporary files
  • Old session removal
  • Job history archival
  • Health checks

Schedule:

  • Runs during low activity periods
  • Configurable timing
  • Minimal performance impact

Configuration Options

Worker Tuning

Adjust for your environment:

  • Low Memory: Reduce concurrent processing
  • High Memory: Increase parallel workers
  • CPU Limited: Reduce worker concurrency
  • Fast Storage: Increase batch sizes

Processing Behavior

Customize processing:

  • Chunk size preferences
  • Embedding model selection
  • Retry attempt limits
  • Timeout durations

Troubleshooting

Slow Processing

If documents take too long:

  • Check system resource usage
  • Verify worker health
  • Review document size/complexity
  • Consider adding more workers

Typical solutions:

  • Reduce concurrent uploads
  • Increase worker memory allocation
  • Split very large documents
  • Use faster embedding models

Processing Stuck

If progress stops:

  • Check worker status
  • Review error logs
  • Restart workers if needed
  • Re-upload problematic documents

Prevention:

  • Monitor queue depth
  • Set appropriate timeouts
  • Regular health checks

High Resource Usage

If system runs hot:

  • Reduce worker concurrency
  • Increase restart frequency
  • Monitor for memory leaks
  • Review processing limits

Best Practices

Efficient Uploads

Optimize your workflow:

  • Upload related documents together
  • Use appropriate file formats
  • Pre-process very large files
  • Remove unnecessary pages

Scheduled Syncs

Configure wisely:

  • Schedule during off-peak hours
  • Set reasonable sync frequencies
  • Monitor quota usage (external sources)
  • Review and clean old syncs

Resource Planning

Plan for growth:

  • Start with minimal setup
  • Monitor actual usage patterns
  • Scale workers as needed
  • Review performance metrics regularly

Background processing is automatic and requires no user intervention. Just upload documents and Scrapalot handles the rest.

Released under the MIT License.