Skip to content

Uploading Documents

Learn how to upload, process, and manage documents in Scrapalot for optimal AI-powered search and analysis.

Upload Methods

Supported File Formats

Documents

  • PDF (.pdf) - Recommended, best support
  • Word (.docx, .doc)
  • Text (.txt)
  • Markdown (.md)
  • Rich Text (.rtf)

Books & Publications

  • EPUB (.epub) - E-books
  • MOBI (.mobi) - Kindle format (via conversion)

Data Files

  • CSV (.csv) - Tabular data
  • JSON (.json) - Structured data
  • XML (.xml) - Structured documents

Presentations & Spreadsheets

  • PowerPoint (.pptx, .ppt)
  • Excel (.xlsx, .xls)

Code & Technical

  • Jupyter Notebooks (.ipynb)
  • HTML (.html, .htm)

File Size Limits

  • Maximum file size: 100 MB per file
  • Recommended size: Under 50 MB for optimal processing
  • Large files: May take 2-5 minutes to process

Document Processing Pipeline

Processing Stages Explained

Stage 1: File Validation (0-10%)

What happens:

  • File type verification
  • Size check
  • Corruption detection
  • Virus scanning (if enabled)

Common errors:

  • "Unsupported file type"
  • "File too large"
  • "Corrupted file"

Stage 2: Text Extraction (10-40%)

What happens:

  • Extract text from document
  • Preserve formatting and structure
  • Handle images (OCR if enabled)
  • Extract metadata

Processing time:

  • Small documents (<5MB): 5-15 seconds
  • Medium documents (5-20MB): 15-45 seconds
  • Large documents (20-100MB): 45-120 seconds

Stage 3: Chunking (40-60%)

What happens:

  • Split document into smaller chunks
  • Apply selected chunking strategy
  • Maintain context and coherence
  • Create hierarchical structure

Chunking strategies:

  • Contextual Retrieval (default) - Best for most documents
  • Late Chunking - Better for semantic relationships
  • Hierarchical - Multi-level document structure
  • Semantic - Meaning-based splitting
  • Agentic - AI-powered intelligent chunking

See: Document Processing Architecture

Stage 4: Embedding Generation (60-90%)

What happens:

  • Convert chunks to vector embeddings
  • Use selected embedding model
  • Store in vector database

Embedding models:

  • text-embedding-3-large (OpenAI) - High quality, 3072 dimensions
  • text-embedding-3-small (OpenAI) - Faster, 1536 dimensions
  • all-MiniLM-L6-v2 (Local) - Free, runs locally
  • nomic-embed-text (Local via Ollama) - Free, high quality

Stage 5: Database Storage (90-100%)

What happens:

  • Store document metadata
  • Index vector embeddings
  • Create search indexes
  • Enable document for querying

Result: Document is ready for search!

Document States

How to Upload Documents

Web Interface Upload

  1. Navigate to your collection

    • Open the collection where you want to add documents
    • Or create a new collection first
  2. Click the Upload button

    • Located in the top-right corner
    • Or drag files directly into the collection area
  3. Select your file(s)

    • Browse to your file
    • Or drag and drop files
  4. Monitor progress

    • Watch the progress bar
    • See real-time status updates
    • Processing typically takes 10-60 seconds
  5. Document ready

    • You'll see a notification when complete
    • Document appears in your collection
    • Ready to search immediately

Bulk Upload

For multiple documents:

  1. Click "Bulk Upload"
  2. Select folder or multiple files
  3. Files are queued for processing
  4. Monitor batch progress
  5. All documents processed in parallel

Bulk Upload Tips

  • Process up to 50 files simultaneously
  • Total batch size limit: 1GB
  • Processing time: ~1-2 minutes per 10 documents

API Upload

For programmatic uploads:

bash
curl -X POST http://localhost:8090/api/v1/documents/upload_stream \
  -H "Authorization: Bearer <access_token>" \
  -F "file=@document.pdf" \
  -F "collection_id=<collection_uuid>" \
  --no-buffer

See: API Reference for full details

External Connectors

Auto-sync from cloud storage:

Supported connectors:

  • Google Drive
  • Dropbox
  • OneDrive
  • Academic databases (PubMed, arXiv, JSTOR)

Setup:

  1. Go to Settings → Integrations
  2. Click "Connect" for your service
  3. Authorize Scrapalot
  4. Select folders to sync
  5. Documents auto-sync every hour

See: Integrations Guide

Best Practices

Organizing Documents

Create topic-based collections:

Research Papers/
├── AI Safety/
├── Climate Science/
└── Medical Research/

Work Documents/
├── Q1 2025 Reports/
├── Product Specs/
└── Meeting Notes/

Benefits:

  • Faster, more accurate searches
  • Better context understanding
  • Easier management
  • Improved team collaboration

File Naming

Good naming:

  • 2024_Climate_Report_IPCC.pdf
  • Product_Spec_v2.3_Final.docx
  • Research_AI_Safety_Smith_2025.pdf

Avoid:

  • document.pdf
  • final_FINAL_v3_REAL.docx
  • untitled.txt

Why: Good names help with:

  • Finding documents later
  • Understanding content at a glance
  • Better search results

Optimize for Processing

Before uploading:

  1. Remove password protection

    • Encrypted PDFs cannot be processed
    • Remove passwords first
  2. Ensure text is selectable

    • Scanned PDFs need OCR
    • Prefer native PDFs over scans
  3. Check file integrity

    • Verify file opens correctly
    • No corruption
  4. Consider file size

    • Compress large PDFs if possible
    • Split very large documents (>100MB)

Choosing Chunking Strategy

For most documents: Use Contextual Retrieval (default)

For academic papers: Use Late Chunking or Hierarchical

For code/technical docs: Use Semantic chunking

For mixed content: Use Agentic chunking (AI-powered)

Configure in: Settings → Documents → Chunking Strategy

Troubleshooting

Upload Fails

Error: "File too large"

  • Maximum size is 100MB
  • Compress PDF or split file
  • Use external connector for very large files

Error: "Unsupported file type"

  • Check supported formats above
  • Convert to PDF if possible
  • Contact support for format requests

Error: "Upload interrupted"

  • Check internet connection
  • Try again
  • Use smaller batches

Processing Stuck

Stuck at "Extracting" (25%)

  • Document may have complex formatting
  • Wait up to 5 minutes for large files
  • Cancel and retry if no progress

Stuck at "Embedding" (85%)

  • Embedding model may be slow
  • Check Settings → AI Providers
  • Switch to faster embedding model

Failed with "Processing error"

  • Check backend logs
  • File may be corrupted
  • Try re-uploading

Document Not Searchable

Possible causes:

  1. Document still processing (check status)
  2. Processing failed (check for errors)
  3. Wrong collection selected
  4. Document contains images only (needs OCR)

Solutions:

  1. Wait for processing to complete
  2. Re-upload if failed
  3. Verify collection selection
  4. Enable OCR in Settings

Poor Search Results

If answers don't reference your document:

  1. Check similarity threshold

    • Lower to 0.6-0.7 for broader matches
  2. Verify chunking strategy

    • Try Contextual Retrieval or Late Chunking
  3. Check question phrasing

    • Use terms from your document
  4. Inspect document chunks

    • View processed chunks in document details
    • Verify text extracted correctly

Managing Documents

View Document Details

Click on any document to see:

  • Processing status
  • Metadata (title, author, date)
  • Chunk count and strategy
  • Storage size
  • Upload date
  • Last accessed

Edit Document Metadata

  1. Open document details
  2. Click "Edit"
  3. Update title, description, tags
  4. Save changes

Reprocess Document

If processing failed or used wrong settings:

  1. Open document
  2. Click "Reprocess"
  3. Select new chunking strategy (optional)
  4. Select new embedding model (optional)
  5. Confirm

Delete Document

  1. Open document
  2. Click "Delete"
  3. Confirm deletion
  4. Document and all chunks removed

Deletion is Permanent

Deleted documents cannot be recovered. Download a copy first if needed.

Storage & Quotas

Storage Limits (Desktop App)

  • No hard limit (limited by disk space)
  • Recommended: Reserve 10GB for moderate use
  • 1GB ≈ 500-1000 typical PDFs

Storage Limits (Cloud Plans)

PlanStorageDocuments
Researcher (Free)10 GB~5,000 docs
Professional100 GB~50,000 docs
EnterpriseCustomUnlimited

Check Your Usage

Settings → Account → Storage

  • Current usage
  • Remaining quota
  • Breakdown by collection

Advanced Features

OCR (Optical Character Recognition)

For scanned PDFs and images:

  1. Enable in Settings → Documents → OCR
  2. Upload scanned document
  3. Text is extracted from images
  4. Searchable like any document

Supported: English, Spanish, French, German, Chinese

Document Versioning

Keep track of document versions:

  1. Upload new version with same name
  2. System asks: "Replace or keep both?"
  3. Choose to version or replace
  4. Access version history in document details

Batch Metadata Editing

Edit multiple documents at once:

  1. Select documents (checkbox)
  2. Click "Bulk Edit"
  3. Add tags, change collection, etc.
  4. Apply to all selected

Next: Once your documents are uploaded, learn how to ask effective questions to get the best answers.

Released under the MIT License.