Document Processing & Intelligent Chunking
Last Updated: March 2026
When you upload documents to Scrapalot, the system intelligently breaks them down into meaningful pieces using 15 chunking strategies that preserve context and relationships. This ensures accurate search results and comprehensive answers to your questions.
Why Document Processing Matters
Large documents need to be broken into smaller, searchable pieces. How this is done dramatically impacts search quality:
Poor Chunking Example
Imagine a book about psychology split mid-sentence:
Chunk 1 ends: "Jung's concept of the collective unconscious differs from Freud's personal..." Chunk 2 starts: "...unconscious. The collective unconscious contains archetypes—universal patterns."
Problem: Split content loses meaning, making it hard to find complete information.
Smart Chunking Example
The same content intelligently split:
Chunk 1: "Jung's concept of the collective unconscious differs from Freud's personal unconscious. The collective unconscious contains archetypes—universal patterns like the anima, animus, and shadow."
Result: Complete, meaningful information that's easy to search and understand.
How Scrapalot Processes Your Documents
Processing Approaches
Scrapalot uses different processing strategies based on your content type. The system automatically selects the best approach.
For Scientific and Academic Content
Best for: Research papers, philosophical texts, scientific books
How it works:
- Detects when topics shift
- Keeps related concepts together
- Preserves argument flow
- Maintains mathematical formulas with explanations
Example: A quantum physics paper discussing wave-particle duality
- Keeps mathematical equations with their conceptual explanations
- Preserves the progression from theory to experiment to conclusion
- Maintains connections between related concepts
Benefits:
- Complete ideas aren't split apart
- Context is preserved
- Related concepts stay together
- Accurate answers to complex questions
For Specialized Domains
Best for: Technical documentation, domain-specific knowledge, specialized fields
How it works:
- Recognizes important terminology
- Understands field-specific concepts
- Preserves definitions with their context
- Tracks concept relationships
Example: Psychology text about Jung's archetypes
- Keeps "collective unconscious" with its full definition
- Preserves relationships: "archetypes include the anima, animus, and shadow"
- Maintains connections between related terms
Benefits:
- Technical terms kept with explanations
- Concept relationships preserved
- Domain knowledge accessible
- Precise search results
For Factual Content
Best for: Encyclopedias, reference material, factual databases
How it works:
- Extracts individual facts
- Creates self-contained statements
- Preserves who, what, when, where details
- Enables precise fact retrieval
Example: Historical information
- "Einstein developed the theory of relativity in 1915"
- "The theory explains gravity as spacetime curvature"
- Each fact stands alone and is precisely searchable
Benefits:
- Exact fact retrieval
- No unnecessary context
- Quick answers to specific questions
- High precision results
For Structured Documents
Best for: Manuals, documentation with headings, organized content
How it works:
- Recognizes document structure
- Preserves hierarchy (chapters, sections, subsections)
- Maintains headings with their content
- Uses structure for better search
Example: User manual with table of contents
- Chapter 3 > Section 3.2 > Subsection 3.2.1
- Heading "Troubleshooting Authentication" stays with its instructions
- Search can filter by section
Benefits:
- Preserves document organization
- Easy to navigate results
- Structured search (e.g., "in chapter 3")
- Clear content hierarchy
For Long-Form Narratives
Best for: Books, biographies, historical accounts
How it works:
- Maintains narrative flow
- Preserves temporal sequences
- Keeps character development intact
- Respects story progression
Example: Biography of a scientist
- Chronological events stay in order
- Development of ideas tracked over time
- Cause-and-effect relationships preserved
Benefits:
- Story flow maintained
- Timeline preserved
- Context from earlier sections available
- Comprehensive understanding
Key Features
Concept Preservation
What it does: Keeps related ideas together
Example: When processing a text about consciousness:
- "Phenomenal experience has subjective qualities" stays with
- "These qualities, called qualia, are the 'what it's like' aspect"
- Both pieces together provide complete understanding
Why it matters: You get complete explanations, not fragments
Relationship Mapping
What it does: Understands how concepts connect
Example: In a systems theory document:
- Maps "emergence" → "causes" → "complex behavior"
- Links "feedback loops" → "enable" → "emergence"
- Connects "systems" → "exhibit" → "both properties"
Why it matters: Answers questions about how things relate, not just what they are
Context Enrichment
What it does: Adds surrounding information to each piece
Example: A chunk about "quantum decoherence" includes:
- The concept itself
- How it relates to the broader measurement problem
- Its position in quantum mechanics theory
Why it matters: Each search result provides enough context to understand the answer
Smart Boundaries
What it does: Splits content at natural transitions, not arbitrary lengths
Example: Recognizing topic shifts:
- "...thus proving the theorem." [Natural end point]
- "In the next chapter, we explore consciousness." [Natural start point]
Why it matters: Chunks contain complete thoughts, not mid-sentence cuts
Multimodal Element Extraction
What it does: Pulls images, tables, and equations out of the PDF as first-class queryable items, not just text fragments
Example: A scientific paper with:
- A bar chart on page 3 → extracted, captioned, described, and indexed so a question like "what does the chart on page 3 show?" surfaces it
- A results table on page 8 → parsed into rows / columns with statistical pre-analysis (min, max, mean per numeric column)
- An equation
E = mc²→ stored with a symbol map (E → energy, m → mass, c → speed of light) so a query for "speed of light" hits the same chunk as a query for "c"
Why it matters: Visual content is no longer invisible to the AI — it answers questions about figures and tables, not only the surrounding prose
Citation Highlighting
What it does: When you click a citation in the chat, the PDF viewer scrolls to the cited page AND draws a transient pulsing highlight over the exact passage
Example:
- AI answers: "Sun Tzu defines five constant factors [1]"
- You click [1] → PDF opens at page 30, yellow border pulses around the cited paragraph for 3 seconds
Why it matters: You can verify any AI claim in one click without skimming the page hunting for the source
Content Type Matching
The system automatically chooses the best processing approach based on your content:
Academic and Scientific
Content includes: Research papers, scientific books, theoretical texts
Processing approach: Semantic understanding
- Preserves complex arguments
- Keeps equations with explanations
- Maintains research flow from hypothesis to conclusion
Result: Deep, comprehensive answers to complex questions
Technical Documentation
Content includes: API docs, code examples, configuration guides
Processing approach: Structure and syntax awareness
- Preserves code blocks intact
- Keeps configuration examples complete
- Maintains technical accuracy
Result: Exact technical answers with proper syntax
Reference Material
Content includes: Encyclopedias, dictionaries, fact databases
Processing approach: Atomic fact extraction
- Individual facts become searchable units
- Precise retrieval of specific information
- Minimal extraneous context
Result: Quick, accurate fact lookup
Specialized Knowledge
Content includes: Medical texts, legal documents, philosophical works
Processing approach: Domain-aware processing
- Recognizes field-specific terminology
- Preserves specialized concept relationships
- Maintains domain-specific structure
Result: Accurate domain knowledge retrieval
Quality Features
Overlap for Continuity
What it does: Includes a small portion of adjacent chunks
Why it helps:
- Ensures no information is lost at boundaries
- Provides context when a concept spans the boundary
- Improves search accuracy
Example: If a concept is explained across a natural boundary, the overlap ensures both chunks have enough context
Size Optimization
What it does: Balances chunk size for best results
Considerations:
- Large enough to preserve meaning
- Small enough for precise retrieval
- Optimized for your content type
Result: Accurate search without information overload
Quality Validation
What it does: Checks processed content for completeness
Ensures:
- No orphaned fragments
- Complete sentences
- Sufficient context
- Meaningful units
Result: Reliable, usable search results
What You Experience
Upload
- Drop your document into Scrapalot
- System analyzes content type
- Processing happens automatically
- Progress indicator shows status
Behind the Scenes
While processing:
- Content is intelligently analyzed
- Optimal processing approach is selected
- Chunks are created with smart boundaries
- Concepts and relationships are mapped
- Everything is made searchable
Ready to Use
After processing:
- Ask questions naturally
- Get accurate, complete answers
- Results cite specific sources
- Related concepts are connected
Performance Benefits
Better Search Results
- Relevance: Find exactly what you're looking for
- Completeness: Answers include full context
- Accuracy: No fragmented or incomplete information
- Speed: Optimized chunk sizes for fast retrieval
Understanding Relationships
- Connections: Discover how concepts relate
- Dependencies: Understand what depends on what
- Hierarchies: Navigate from general to specific
- Networks: Explore interconnected knowledge
Comprehensive Knowledge
- Multi-source: Synthesize information from multiple sections
- Contextual: Understand information in proper context
- Progressive: Build understanding from foundational to advanced
- Integrated: See how different parts connect
Use Case Examples
Research Paper Analysis
Document: 50-page quantum physics paper
Processing:
- Preserves mathematical proofs intact
- Keeps methodology with results
- Maintains theoretical framework
- Links related equations
Result: Ask "How does the paper address the measurement problem?" → Get complete, sourced explanation
Technical Manual
Document: 200-page API documentation
Processing:
- Preserves code examples
- Keeps endpoints with their descriptions
- Maintains parameter tables
- Respects document hierarchy
Result: Ask "How do I authenticate API requests?" → Get exact code example with context
Philosophical Text
Document: Book on consciousness and phenomenology
Processing:
- Keeps arguments with their justifications
- Preserves terminology with definitions
- Maintains concept relationships
- Respects philosophical progression
Result: Ask "What is Heidegger's concept of Dasein?" → Get comprehensive explanation with proper context
Historical Biography
Document: Life story of a scientist
Processing:
- Preserves chronological flow
- Maintains cause-effect relationships
- Keeps discoveries with their context
- Respects narrative structure
Result: Ask "What led to Einstein's development of general relativity?" → Get chronological account with context
Tips for Best Results
Document Preparation
Good: Clean, well-formatted PDFs or text files Why: Easier to process accurately
Less optimal: Scanned images, poorly formatted files Why: May require extra processing or OCR
Content Organization
Good: Documents with clear structure (headings, paragraphs) Why: Structure helps the system understand content organization
Less optimal: Wall of text without breaks Why: Harder to identify natural boundaries
File Formats
Best supported:
- PDF (text-based)
- Text files
- Markdown
- Word documents
- EPUB
Result: Accurate processing with preserved formatting
What Makes This Effective
Intelligent vs. Simple Splitting
Simple approach:
- Split every 500 words
- Results in mid-sentence cuts
- Loses context
- Fragments concepts
Scrapalot's approach:
- Understand content meaning
- Split at natural boundaries
- Preserve context
- Keep concepts together
Difference: Complete, accurate answers vs. fragmented, incomplete information
Automatic Optimization
You don't need to:
- Choose processing settings
- Specify document type
- Configure chunk sizes
- Manage technical details
The system:
- Analyzes your content
- Selects optimal approach
- Processes intelligently
- Delivers accurate results
Continuous Learning
The system improves over time:
- Adapts to your document types
- Learns domain-specific patterns
- Optimizes for your use cases
- Refines processing quality
This intelligent document processing ensures your knowledge base is searchable, accessible, and delivers accurate answers. Upload your documents and start asking questions—the system handles the complexity automatically.
Annotations and Reading Tools
Beyond the search-and-answer flow, every PDF in Scrapalot is also a workspace you can annotate by hand and share with collaborators.
Annotation toolbar
Select any text in the PDF viewer to bring up the on-selection toolbar with five one-click tools:
- Highlight — paints a coloured rectangle over the selection. Eight colours available.
- Underline — single blue underline (definition / reference).
- Strikethrough — single grey line through the text. When applied, it auto-replaces any prior highlight or underline on the same passage so the "discard" intent is clear.
- Note — same as highlight but with a comment field. Surfaces a small notepad pin in the page gutter you can click to read or edit the note.
- Area capture — draw a bounding box around a figure or table to capture it as a region.
The same row also exposes:
- AI tools dropdown — Cite this in my notes, Explain this passage, Find similar passages. All run against your active provider.
- Colour picker — opens a swatch grid plus a compact legend. Workspace admins can override the legend label per colour ("red = critical", "yellow = insight", "green = cite this") via Settings → Documents → Annotation colour labels; the label flows into every swatch tooltip across the workspace.
Margin notes from chat
Every chat citation row carries two icons next to the citation text:
- 📓 Insert into Note — pastes the citation into your notes editor.
- 🔖 Save as margin note — creates a sticky-note annotation directly on the cited document, with the AI's answer as the note's comment. The next time you open that document, you see a gutter pin on the cited page and can read the AI's commentary inline.
Sharing annotations
Annotations are owned by the user who created them but can be shared with workspace members. Right-click an annotation → Share → pick a colleague from the workspace member list. Permissions:
- View — recipient sees the annotation in their copy of the document.
- Edit — recipient can also amend the comment and colour. The highlight position and selected text remain owner-only.
Real-time sync over WebSocket: when one user adds, edits, or deletes a shared annotation, every other user viewing the same PDF sees the change within ~1 second without refreshing.
Visual entities side panel
A toolbar button in the PDF viewer (image icon) opens a side panel listing every image, table, and equation extracted from the document during ingest. Filter chip row at the top scopes by type with per-type counts. Click a row to jump to that page.
Use this when scanning a long paper for "where are the figures?" without opening the PDF outline. Counts are zero on documents that were uploaded before multimodal extraction shipped — reprocess the document from the document menu to backfill.