Uploading Documents
Learn how to upload, process, and manage documents in Scrapalot for optimal AI-powered search and analysis.
Upload Methods
Supported File Formats
Documents
- PDF (
.pdf) - Recommended, best support - Word (
.docx,.doc) - Text (
.txt) - Markdown (
.md) - Rich Text (
.rtf)
Books & Publications
- EPUB (
.epub) - E-books - MOBI (
.mobi) - Kindle format (via conversion)
Data Files
- CSV (
.csv) - Tabular data - JSON (
.json) - Structured data - XML (
.xml) - Structured documents
Presentations & Spreadsheets
- PowerPoint (
.pptx,.ppt) - Excel (
.xlsx,.xls)
Code & Technical
- Jupyter Notebooks (
.ipynb) - HTML (
.html,.htm)
File Size Limits
- Maximum file size: 100 MB per file
- Recommended size: Under 50 MB for optimal processing
- Large files: May take 2-5 minutes to process
Document Processing Pipeline
Processing Stages Explained
Stage 1: File Validation (0-10%)
What happens:
- File type verification
- Size check
- Corruption detection
- Virus scanning (if enabled)
Common errors:
- "Unsupported file type"
- "File too large"
- "Corrupted file"
Stage 2: Text Extraction (10-40%)
What happens:
- Extract text from document
- Preserve formatting and structure
- Handle images (OCR if enabled)
- Extract metadata
Processing time:
- Small documents (<5MB): 5-15 seconds
- Medium documents (5-20MB): 15-45 seconds
- Large documents (20-100MB): 45-120 seconds
Stage 3: Chunking (40-60%)
What happens:
- Split document into smaller chunks
- Apply selected chunking strategy
- Maintain context and coherence
- Create hierarchical structure
Chunking strategies:
- Contextual Retrieval (default) - Best for most documents
- Late Chunking - Better for semantic relationships
- Hierarchical - Multi-level document structure
- Semantic - Meaning-based splitting
- Agentic - AI-powered intelligent chunking
See: Document Processing Architecture
Stage 4: Embedding Generation (60-90%)
What happens:
- Convert chunks to vector embeddings
- Use selected embedding model
- Store in vector database
Embedding models:
- text-embedding-3-large (OpenAI) - High quality, 3072 dimensions
- text-embedding-3-small (OpenAI) - Faster, 1536 dimensions
- all-MiniLM-L6-v2 (Local) - Free, runs locally
- nomic-embed-text (Local via Ollama) - Free, high quality
Stage 5: Database Storage (90-100%)
What happens:
- Store document metadata
- Index vector embeddings
- Create search indexes
- Enable document for querying
Result: Document is ready for search!
Document States
How to Upload Documents
Web Interface Upload
Navigate to your collection
- Open the collection where you want to add documents
- Or create a new collection first
Click the Upload button
- Located in the top-right corner
- Or drag files directly into the collection area
Select your file(s)
- Browse to your file
- Or drag and drop files
Monitor progress
- Watch the progress bar
- See real-time status updates
- Processing typically takes 10-60 seconds
Document ready
- You'll see a notification when complete
- Document appears in your collection
- Ready to search immediately
Bulk Upload
For multiple documents:
- Click "Bulk Upload"
- Select folder or multiple files
- Files are queued for processing
- Monitor batch progress
- All documents processed in parallel
Bulk Upload Tips
- Process up to 50 files simultaneously
- Total batch size limit: 1GB
- Processing time: ~1-2 minutes per 10 documents
API Upload
For programmatic uploads:
curl -X POST http://localhost:8090/api/v1/documents/upload_stream \
-H "Authorization: Bearer <access_token>" \
-F "file=@document.pdf" \
-F "collection_id=<collection_uuid>" \
--no-bufferSee: API Reference for full details
External Connectors
Auto-sync from cloud storage:
Supported connectors:
- Google Drive
- Dropbox
- OneDrive
- Academic databases (PubMed, arXiv, JSTOR)
Setup:
- Go to Settings → Integrations
- Click "Connect" for your service
- Authorize Scrapalot
- Select folders to sync
- Documents auto-sync every hour
See: Integrations Guide
Best Practices
Organizing Documents
Create topic-based collections:
Research Papers/
├── AI Safety/
├── Climate Science/
└── Medical Research/
Work Documents/
├── Q1 2025 Reports/
├── Product Specs/
└── Meeting Notes/Benefits:
- Faster, more accurate searches
- Better context understanding
- Easier management
- Improved team collaboration
File Naming
Good naming:
- ✅
2024_Climate_Report_IPCC.pdf - ✅
Product_Spec_v2.3_Final.docx - ✅
Research_AI_Safety_Smith_2025.pdf
Avoid:
- ❌
document.pdf - ❌
final_FINAL_v3_REAL.docx - ❌
untitled.txt
Why: Good names help with:
- Finding documents later
- Understanding content at a glance
- Better search results
Optimize for Processing
Before uploading:
Remove password protection
- Encrypted PDFs cannot be processed
- Remove passwords first
Ensure text is selectable
- Scanned PDFs need OCR
- Prefer native PDFs over scans
Check file integrity
- Verify file opens correctly
- No corruption
Consider file size
- Compress large PDFs if possible
- Split very large documents (>100MB)
Choosing Chunking Strategy
For most documents: Use Contextual Retrieval (default)
For academic papers: Use Late Chunking or Hierarchical
For code/technical docs: Use Semantic chunking
For mixed content: Use Agentic chunking (AI-powered)
Configure in: Settings → Documents → Chunking Strategy
Troubleshooting
Upload Fails
Error: "File too large"
- Maximum size is 100MB
- Compress PDF or split file
- Use external connector for very large files
Error: "Unsupported file type"
- Check supported formats above
- Convert to PDF if possible
- Contact support for format requests
Error: "Upload interrupted"
- Check internet connection
- Try again
- Use smaller batches
Processing Stuck
Stuck at "Extracting" (25%)
- Document may have complex formatting
- Wait up to 5 minutes for large files
- Cancel and retry if no progress
Stuck at "Embedding" (85%)
- Embedding model may be slow
- Check Settings → AI Providers
- Switch to faster embedding model
Failed with "Processing error"
- Check backend logs
- File may be corrupted
- Try re-uploading
Document Not Searchable
Possible causes:
- Document still processing (check status)
- Processing failed (check for errors)
- Wrong collection selected
- Document contains images only (needs OCR)
Solutions:
- Wait for processing to complete
- Re-upload if failed
- Verify collection selection
- Enable OCR in Settings
Poor Search Results
If answers don't reference your document:
Check similarity threshold
- Lower to 0.6-0.7 for broader matches
Verify chunking strategy
- Try Contextual Retrieval or Late Chunking
Check question phrasing
- Use terms from your document
Inspect document chunks
- View processed chunks in document details
- Verify text extracted correctly
Managing Documents
View Document Details
Click on any document to see:
- Processing status
- Metadata (title, author, date)
- Chunk count and strategy
- Storage size
- Upload date
- Last accessed
Edit Document Metadata
- Open document details
- Click "Edit"
- Update title, description, tags
- Save changes
Reprocess Document
If processing failed or used wrong settings:
- Open document
- Click "Reprocess"
- Select new chunking strategy (optional)
- Select new embedding model (optional)
- Confirm
Delete Document
- Open document
- Click "Delete"
- Confirm deletion
- Document and all chunks removed
Deletion is Permanent
Deleted documents cannot be recovered. Download a copy first if needed.
Storage & Quotas
Storage Limits (Desktop App)
- No hard limit (limited by disk space)
- Recommended: Reserve 10GB for moderate use
- 1GB ≈ 500-1000 typical PDFs
Storage Limits (Cloud Plans)
| Plan | Storage | Documents |
|---|---|---|
| Researcher (Free) | 10 GB | ~5,000 docs |
| Professional | 100 GB | ~50,000 docs |
| Enterprise | Custom | Unlimited |
Check Your Usage
Settings → Account → Storage
- Current usage
- Remaining quota
- Breakdown by collection
Advanced Features
OCR (Optical Character Recognition)
For scanned PDFs and images:
- Enable in Settings → Documents → OCR
- Upload scanned document
- Text is extracted from images
- Searchable like any document
Supported: English, Spanish, French, German, Chinese
Document Versioning
Keep track of document versions:
- Upload new version with same name
- System asks: "Replace or keep both?"
- Choose to version or replace
- Access version history in document details
Batch Metadata Editing
Edit multiple documents at once:
- Select documents (checkbox)
- Click "Bulk Edit"
- Add tags, change collection, etc.
- Apply to all selected
Related Topics
- Asking Questions - Search your uploaded documents
- Collections Management - Organize your documents
- Document Processing Architecture - Technical details
- API Reference - API documentation
- Integrations - External connectors
Next: Once your documents are uploaded, learn how to ask effective questions to get the best answers.