Uploading Documents

Learn how to upload, process, and manage documents in Scrapalot for optimal AI-powered search and analysis.

Upload Methods

Supported File Formats

Documents

PDF (.pdf) - Recommended, best support
Word (.docx, .doc)
Text (.txt)
Markdown (.md)
Rich Text (.rtf)

Books & Publications

EPUB (.epub) - E-books
MOBI (.mobi) - Kindle format (via conversion)

Data Files

CSV (.csv) - Tabular data
JSON (.json) - Structured data
XML (.xml) - Structured documents

Presentations & Spreadsheets

PowerPoint (.pptx, .ppt)
Excel (.xlsx, .xls)

Code & Technical

Jupyter Notebooks (.ipynb)
HTML (.html, .htm)

File Size Limits

Maximum file size: 100 MB per file
Recommended size: Under 50 MB for optimal processing
Large files: May take 2-5 minutes to process

Document Processing Pipeline

Processing Stages Explained

Stage 1: File Validation (0-10%)

What happens:

File type verification
Size check
Corruption detection
Virus scanning (if enabled)

Common errors:

"Unsupported file type"
"File too large"
"Corrupted file"

Stage 2: Text Extraction (10-40%)

What happens:

Extract text from document
Preserve formatting and structure
Handle images (OCR if enabled)
Extract metadata

Processing time:

Small documents (<5MB): 5-15 seconds
Medium documents (5-20MB): 15-45 seconds
Large documents (20-100MB): 45-120 seconds

Stage 3: Chunking (40-60%)

What happens:

Split document into smaller chunks
Apply selected chunking strategy
Maintain context and coherence
Create hierarchical structure

Chunking strategies:

Contextual Retrieval (default) - Best for most documents
Late Chunking - Better for semantic relationships
Hierarchical - Multi-level document structure
Semantic - Meaning-based splitting
Agentic - AI-powered intelligent chunking

See: Document Processing Architecture

Stage 4: Embedding Generation (60-90%)

What happens:

Convert chunks to vector embeddings
Use selected embedding model
Store in vector database

Embedding models:

text-embedding-3-large (OpenAI) - High quality, 3072 dimensions
text-embedding-3-small (OpenAI) - Faster, 1536 dimensions
all-MiniLM-L6-v2 (Local) - Free, runs locally
nomic-embed-text (Local via Ollama) - Free, high quality

Stage 5: Database Storage (90-100%)

What happens:

Store document metadata
Index vector embeddings
Create search indexes
Enable document for querying

Result: Document is ready for search!

Document States

How to Upload Documents

Web Interface Upload

Navigate to your collection
- Open the collection where you want to add documents
- Or create a new collection first
Click the Upload button
- Located in the top-right corner
- Or drag files directly into the collection area
Select your file(s)
- Browse to your file
- Or drag and drop files
Monitor progress
- Watch the progress bar
- See real-time status updates
- Processing typically takes 10-60 seconds
Document ready
- You'll see a notification when complete
- Document appears in your collection
- Ready to search immediately

Bulk Upload

For multiple documents:

Click "Bulk Upload"
Select folder or multiple files
Files are queued for processing
Monitor batch progress
All documents processed in parallel

Bulk Upload Tips

Process up to 50 files simultaneously
Total batch size limit: 1GB
Processing time: ~1-2 minutes per 10 documents

API Upload

For programmatic uploads:

bash

curl -X POST http://localhost:8090/api/v1/documents/upload_stream \
  -H "Authorization: Bearer <access_token>" \
  -F "file=@document.pdf" \
  -F "collection_id=<collection_uuid>" \
  --no-buffer

See: API Reference for full details

External Connectors

Auto-sync from cloud storage:

Supported connectors:

Google Drive
Dropbox
OneDrive
Academic databases (PubMed, arXiv, JSTOR)

Setup:

Go to Settings → Integrations
Click "Connect" for your service
Authorize Scrapalot
Select folders to sync
Documents auto-sync every hour

See: Integrations Guide

Best Practices

Organizing Documents

Create topic-based collections:

Research Papers/
├── AI Safety/
├── Climate Science/
└── Medical Research/

Work Documents/
├── Q1 2025 Reports/
├── Product Specs/
└── Meeting Notes/

Benefits:

Faster, more accurate searches
Better context understanding
Easier management
Improved team collaboration

File Naming

Good naming:

✅ 2024_Climate_Report_IPCC.pdf
✅ Product_Spec_v2.3_Final.docx
✅ Research_AI_Safety_Smith_2025.pdf

Avoid:

❌ document.pdf
❌ final_FINAL_v3_REAL.docx
❌ untitled.txt

Why: Good names help with:

Finding documents later
Understanding content at a glance
Better search results

Optimize for Processing

Before uploading:

Remove password protection
- Encrypted PDFs cannot be processed
- Remove passwords first
Ensure text is selectable
- Scanned PDFs need OCR
- Prefer native PDFs over scans
Check file integrity
- Verify file opens correctly
- No corruption
Consider file size
- Compress large PDFs if possible
- Split very large documents (>100MB)

Choosing Chunking Strategy

For most documents: Use Contextual Retrieval (default)

For academic papers: Use Late Chunking or Hierarchical

For code/technical docs: Use Semantic chunking

For mixed content: Use Agentic chunking (AI-powered)

Configure in: Settings → Documents → Chunking Strategy

Troubleshooting

Upload Fails

Error: "File too large"

Maximum size is 100MB
Compress PDF or split file
Use external connector for very large files

Error: "Unsupported file type"

Check supported formats above
Convert to PDF if possible
Contact support for format requests

Error: "Upload interrupted"

Check internet connection
Try again
Use smaller batches

Processing Stuck

Stuck at "Extracting" (25%)

Document may have complex formatting
Wait up to 5 minutes for large files
Cancel and retry if no progress

Stuck at "Embedding" (85%)

Embedding model may be slow
Check Settings → AI Providers
Switch to faster embedding model

Failed with "Processing error"

Check backend logs
File may be corrupted
Try re-uploading

Document Not Searchable

Possible causes:

Document still processing (check status)
Processing failed (check for errors)
Wrong collection selected
Document contains images only (needs OCR)

Solutions:

Wait for processing to complete
Re-upload if failed
Verify collection selection
Enable OCR in Settings

Poor Search Results

If answers don't reference your document:

Check similarity threshold
- Lower to 0.6-0.7 for broader matches
Verify chunking strategy
- Try Contextual Retrieval or Late Chunking
Check question phrasing
- Use terms from your document
Inspect document chunks
- View processed chunks in document details
- Verify text extracted correctly

Managing Documents

View Document Details

Click on any document to see:

Processing status
Metadata (title, author, date)
Chunk count and strategy
Storage size
Upload date
Last accessed

Edit Document Metadata

Open document details
Click "Edit"
Update title, description, tags
Save changes

Reprocess Document

If processing failed or used wrong settings:

Open document
Click "Reprocess"
Select new chunking strategy (optional)
Select new embedding model (optional)
Confirm

Delete Document

Open document
Click "Delete"
Confirm deletion
Document and all chunks removed

Deletion is Permanent

Deleted documents cannot be recovered. Download a copy first if needed.

Storage & Quotas

Storage Limits (Desktop App)

No hard limit (limited by disk space)
Recommended: Reserve 10GB for moderate use
1GB ≈ 500-1000 typical PDFs

Storage Limits (Cloud Plans)

Plan	Storage	Documents
Researcher (Free)	10 GB	~5,000 docs
Professional	100 GB	~50,000 docs
Enterprise	Custom	Unlimited

Check Your Usage

Settings → Account → Storage

Current usage
Remaining quota
Breakdown by collection

Advanced Features

OCR (Optical Character Recognition)

For scanned PDFs and images:

Enable in Settings → Documents → OCR
Upload scanned document
Text is extracted from images
Searchable like any document

Supported: English, Spanish, French, German, Chinese

Document Versioning

Keep track of document versions:

Upload new version with same name
System asks: "Replace or keep both?"
Choose to version or replace
Access version history in document details

Batch Metadata Editing

Edit multiple documents at once:

Select documents (checkbox)
Click "Bulk Edit"
Add tags, change collection, etc.
Apply to all selected

Asking Questions - Search your uploaded documents
Collections Management - Organize your documents
Document Processing Architecture - Technical details
API Reference - API documentation
Integrations - External connectors

Next: Once your documents are uploaded, learn how to ask effective questions to get the best answers.

Uploading Documents ​

Upload Methods ​

Supported File Formats ​

Documents ​

Books & Publications ​

Data Files ​

Presentations & Spreadsheets ​

Code & Technical ​

Document Processing Pipeline ​

Processing Stages Explained ​

Stage 1: File Validation (0-10%) ​

Stage 2: Text Extraction (10-40%) ​

Stage 3: Chunking (40-60%) ​

Stage 4: Embedding Generation (60-90%) ​

Stage 5: Database Storage (90-100%) ​

Document States ​

How to Upload Documents ​

Web Interface Upload ​

Bulk Upload ​

API Upload ​

External Connectors ​

Best Practices ​

Organizing Documents ​

File Naming ​

Optimize for Processing ​

Choosing Chunking Strategy ​

Troubleshooting ​

Upload Fails ​

Processing Stuck ​

Document Not Searchable ​

Poor Search Results ​

Managing Documents ​

View Document Details ​

Edit Document Metadata ​

Reprocess Document ​

Delete Document ​

Storage & Quotas ​

Storage Limits (Desktop App) ​

Storage Limits (Cloud Plans) ​

Check Your Usage ​

Advanced Features ​

OCR (Optical Character Recognition) ​

Document Versioning ​

Batch Metadata Editing ​

Related Topics ​

Uploading Documents

Upload Methods

Supported File Formats

Documents

Books & Publications

Data Files

Presentations & Spreadsheets

Code & Technical

Document Processing Pipeline

Processing Stages Explained

Stage 1: File Validation (0-10%)

Stage 2: Text Extraction (10-40%)

Stage 3: Chunking (40-60%)

Stage 4: Embedding Generation (60-90%)

Stage 5: Database Storage (90-100%)

Document States

How to Upload Documents

Web Interface Upload

Bulk Upload

API Upload

External Connectors

Best Practices

Organizing Documents

File Naming

Optimize for Processing

Choosing Chunking Strategy

Troubleshooting

Upload Fails

Processing Stuck

Document Not Searchable

Poor Search Results

Managing Documents

View Document Details

Edit Document Metadata

Reprocess Document

Delete Document

Storage & Quotas

Storage Limits (Desktop App)

Storage Limits (Cloud Plans)

Check Your Usage

Advanced Features

OCR (Optical Character Recognition)

Document Versioning

Batch Metadata Editing

Related Topics