Document Processing & Intelligent Chunking

Last Updated: March 2026

When you upload documents to Scrapalot, the system intelligently breaks them down into meaningful pieces using 15 chunking strategies that preserve context and relationships. This ensures accurate search results and comprehensive answers to your questions.

Why Document Processing Matters

Large documents need to be broken into smaller, searchable pieces. How this is done dramatically impacts search quality:

Poor Chunking Example

Imagine a book about psychology split mid-sentence:

Chunk 1 ends: "Jung's concept of the collective unconscious differs from Freud's personal..." Chunk 2 starts: "...unconscious. The collective unconscious contains archetypes—universal patterns."

Problem: Split content loses meaning, making it hard to find complete information.

Smart Chunking Example

The same content intelligently split:

Chunk 1: "Jung's concept of the collective unconscious differs from Freud's personal unconscious. The collective unconscious contains archetypes—universal patterns like the anima, animus, and shadow."

Result: Complete, meaningful information that's easy to search and understand.

How Scrapalot Processes Your Documents

Processing Approaches

Scrapalot uses different processing strategies based on your content type. The system automatically selects the best approach.

For Scientific and Academic Content

Best for: Research papers, philosophical texts, scientific books

How it works:

Detects when topics shift
Keeps related concepts together
Preserves argument flow
Maintains mathematical formulas with explanations

Example: A quantum physics paper discussing wave-particle duality

Keeps mathematical equations with their conceptual explanations
Preserves the progression from theory to experiment to conclusion
Maintains connections between related concepts

Benefits:

Complete ideas aren't split apart
Context is preserved
Related concepts stay together
Accurate answers to complex questions

For Specialized Domains

Best for: Technical documentation, domain-specific knowledge, specialized fields

How it works:

Recognizes important terminology
Understands field-specific concepts
Preserves definitions with their context
Tracks concept relationships

Example: Psychology text about Jung's archetypes

Keeps "collective unconscious" with its full definition
Preserves relationships: "archetypes include the anima, animus, and shadow"
Maintains connections between related terms

Benefits:

Technical terms kept with explanations
Concept relationships preserved
Domain knowledge accessible
Precise search results

For Factual Content

Best for: Encyclopedias, reference material, factual databases

How it works:

Extracts individual facts
Creates self-contained statements
Preserves who, what, when, where details
Enables precise fact retrieval

Example: Historical information

"Einstein developed the theory of relativity in 1915"
"The theory explains gravity as spacetime curvature"
Each fact stands alone and is precisely searchable

Benefits:

Exact fact retrieval
No unnecessary context
Quick answers to specific questions
High precision results

For Structured Documents

Best for: Manuals, documentation with headings, organized content

How it works:

Recognizes document structure
Preserves hierarchy (chapters, sections, subsections)
Maintains headings with their content
Uses structure for better search

Example: User manual with table of contents

Chapter 3 > Section 3.2 > Subsection 3.2.1
Heading "Troubleshooting Authentication" stays with its instructions
Search can filter by section

Benefits:

Preserves document organization
Easy to navigate results
Structured search (e.g., "in chapter 3")
Clear content hierarchy

For Long-Form Narratives

Best for: Books, biographies, historical accounts

How it works:

Maintains narrative flow
Preserves temporal sequences
Keeps character development intact
Respects story progression

Example: Biography of a scientist

Chronological events stay in order
Development of ideas tracked over time
Cause-and-effect relationships preserved

Benefits:

Story flow maintained
Timeline preserved
Context from earlier sections available
Comprehensive understanding

Key Features

Concept Preservation

What it does: Keeps related ideas together

Example: When processing a text about consciousness:

"Phenomenal experience has subjective qualities" stays with
"These qualities, called qualia, are the 'what it's like' aspect"
Both pieces together provide complete understanding

Why it matters: You get complete explanations, not fragments

Relationship Mapping

What it does: Understands how concepts connect

Example: In a systems theory document:

Maps "emergence" → "causes" → "complex behavior"
Links "feedback loops" → "enable" → "emergence"
Connects "systems" → "exhibit" → "both properties"

Why it matters: Answers questions about how things relate, not just what they are

Context Enrichment

What it does: Adds surrounding information to each piece

Example: A chunk about "quantum decoherence" includes:

The concept itself
How it relates to the broader measurement problem
Its position in quantum mechanics theory

Why it matters: Each search result provides enough context to understand the answer

Smart Boundaries

What it does: Splits content at natural transitions, not arbitrary lengths

Example: Recognizing topic shifts:

"...thus proving the theorem." [Natural end point]
"In the next chapter, we explore consciousness." [Natural start point]

Why it matters: Chunks contain complete thoughts, not mid-sentence cuts

Multimodal Element Extraction

What it does: Pulls images, tables, and equations out of the PDF as first-class queryable items, not just text fragments

Example: A scientific paper with:

A bar chart on page 3 → extracted, captioned, described, and indexed so a question like "what does the chart on page 3 show?" surfaces it
A results table on page 8 → parsed into rows / columns with statistical pre-analysis (min, max, mean per numeric column)
An equation E = mc² → stored with a symbol map (E → energy, m → mass, c → speed of light) so a query for "speed of light" hits the same chunk as a query for "c"

Why it matters: Visual content is no longer invisible to the AI — it answers questions about figures and tables, not only the surrounding prose

Citation Highlighting

What it does: When you click a citation in the chat, the PDF viewer scrolls to the cited page AND draws a transient pulsing highlight over the exact passage

Example:

AI answers: "Sun Tzu defines five constant factors [1]"
You click [1] → PDF opens at page 30, yellow border pulses around the cited paragraph for 3 seconds

Why it matters: You can verify any AI claim in one click without skimming the page hunting for the source

Content Type Matching

The system automatically chooses the best processing approach based on your content:

Academic and Scientific

Content includes: Research papers, scientific books, theoretical texts

Processing approach: Semantic understanding

Preserves complex arguments
Keeps equations with explanations
Maintains research flow from hypothesis to conclusion

Result: Deep, comprehensive answers to complex questions

Technical Documentation

Content includes: API docs, code examples, configuration guides

Processing approach: Structure and syntax awareness

Preserves code blocks intact
Keeps configuration examples complete
Maintains technical accuracy

Result: Exact technical answers with proper syntax

Reference Material

Content includes: Encyclopedias, dictionaries, fact databases

Processing approach: Atomic fact extraction

Individual facts become searchable units
Precise retrieval of specific information
Minimal extraneous context

Result: Quick, accurate fact lookup

Specialized Knowledge

Content includes: Medical texts, legal documents, philosophical works

Processing approach: Domain-aware processing

Recognizes field-specific terminology
Preserves specialized concept relationships
Maintains domain-specific structure

Result: Accurate domain knowledge retrieval

Quality Features

Overlap for Continuity

What it does: Includes a small portion of adjacent chunks

Why it helps:

Ensures no information is lost at boundaries
Provides context when a concept spans the boundary
Improves search accuracy

Example: If a concept is explained across a natural boundary, the overlap ensures both chunks have enough context

Size Optimization

What it does: Balances chunk size for best results

Considerations:

Large enough to preserve meaning
Small enough for precise retrieval
Optimized for your content type

Result: Accurate search without information overload

Quality Validation

What it does: Checks processed content for completeness

Ensures:

No orphaned fragments
Complete sentences
Sufficient context
Meaningful units

Result: Reliable, usable search results

What You Experience

Upload

Drop your document into Scrapalot
System analyzes content type
Processing happens automatically
Progress indicator shows status

Behind the Scenes

While processing:

Content is intelligently analyzed
Optimal processing approach is selected
Chunks are created with smart boundaries
Concepts and relationships are mapped
Everything is made searchable

Ready to Use

After processing:

Ask questions naturally
Get accurate, complete answers
Results cite specific sources
Related concepts are connected

Performance Benefits

Better Search Results

Relevance: Find exactly what you're looking for
Completeness: Answers include full context
Accuracy: No fragmented or incomplete information
Speed: Optimized chunk sizes for fast retrieval

Understanding Relationships

Connections: Discover how concepts relate
Dependencies: Understand what depends on what
Hierarchies: Navigate from general to specific
Networks: Explore interconnected knowledge

Comprehensive Knowledge

Multi-source: Synthesize information from multiple sections
Contextual: Understand information in proper context
Progressive: Build understanding from foundational to advanced
Integrated: See how different parts connect

Use Case Examples

Research Paper Analysis

Document: 50-page quantum physics paper

Processing:

Preserves mathematical proofs intact
Keeps methodology with results
Maintains theoretical framework
Links related equations

Result: Ask "How does the paper address the measurement problem?" → Get complete, sourced explanation

Technical Manual

Document: 200-page API documentation

Processing:

Preserves code examples
Keeps endpoints with their descriptions
Maintains parameter tables
Respects document hierarchy

Result: Ask "How do I authenticate API requests?" → Get exact code example with context

Philosophical Text

Document: Book on consciousness and phenomenology

Processing:

Keeps arguments with their justifications
Preserves terminology with definitions
Maintains concept relationships
Respects philosophical progression

Result: Ask "What is Heidegger's concept of Dasein?" → Get comprehensive explanation with proper context

Historical Biography

Document: Life story of a scientist

Processing:

Preserves chronological flow
Maintains cause-effect relationships
Keeps discoveries with their context
Respects narrative structure

Result: Ask "What led to Einstein's development of general relativity?" → Get chronological account with context

Tips for Best Results

Document Preparation

Good: Clean, well-formatted PDFs or text files Why: Easier to process accurately

Less optimal: Scanned images, poorly formatted files Why: May require extra processing or OCR

Content Organization

Good: Documents with clear structure (headings, paragraphs) Why: Structure helps the system understand content organization

Less optimal: Wall of text without breaks Why: Harder to identify natural boundaries

File Formats

Best supported:

PDF (text-based)
Text files
Markdown
Word documents
EPUB

Result: Accurate processing with preserved formatting

What Makes This Effective

Intelligent vs. Simple Splitting

Simple approach:

Split every 500 words
Results in mid-sentence cuts
Loses context
Fragments concepts

Scrapalot's approach:

Understand content meaning
Split at natural boundaries
Preserve context
Keep concepts together

Difference: Complete, accurate answers vs. fragmented, incomplete information

Automatic Optimization

You don't need to:

Choose processing settings
Specify document type
Configure chunk sizes
Manage technical details

The system:

Analyzes your content
Selects optimal approach
Processes intelligently
Delivers accurate results

Continuous Learning

The system improves over time:

Adapts to your document types
Learns domain-specific patterns
Optimizes for your use cases
Refines processing quality

This intelligent document processing ensures your knowledge base is searchable, accessible, and delivers accurate answers. Upload your documents and start asking questions—the system handles the complexity automatically.

Annotations and Reading Tools

Beyond the search-and-answer flow, every PDF in Scrapalot is also a workspace you can annotate by hand and share with collaborators.

Select any text in the PDF viewer to bring up the on-selection toolbar with five one-click tools:

Highlight — paints a coloured rectangle over the selection. Eight colours available.
Underline — single blue underline (definition / reference).
Strikethrough — single grey line through the text. When applied, it auto-replaces any prior highlight or underline on the same passage so the "discard" intent is clear.
Note — same as highlight but with a comment field. Surfaces a small notepad pin in the page gutter you can click to read or edit the note.
Area capture — draw a bounding box around a figure or table to capture it as a region.

The same row also exposes:

AI tools dropdown — Cite this in my notes, Explain this passage, Find similar passages. All run against your active provider.
Colour picker — opens a swatch grid plus a compact legend. Workspace admins can override the legend label per colour ("red = critical", "yellow = insight", "green = cite this") via Settings → Documents → Annotation colour labels; the label flows into every swatch tooltip across the workspace.

Margin notes from chat

Every chat citation row carries two icons next to the citation text:

📓 Insert into Note — pastes the citation into your notes editor.
🔖 Save as margin note — creates a sticky-note annotation directly on the cited document, with the AI's answer as the note's comment. The next time you open that document, you see a gutter pin on the cited page and can read the AI's commentary inline.

Annotations are owned by the user who created them but can be shared with workspace members. Right-click an annotation → Share → pick a colleague from the workspace member list. Permissions:

View — recipient sees the annotation in their copy of the document.
Edit — recipient can also amend the comment and colour. The highlight position and selected text remain owner-only.

Real-time sync over WebSocket: when one user adds, edits, or deletes a shared annotation, every other user viewing the same PDF sees the change within ~1 second without refreshing.

Visual entities side panel

A toolbar button in the PDF viewer (image icon) opens a side panel listing every image, table, and equation extracted from the document during ingest. Filter chip row at the top scopes by type with per-type counts. Click a row to jump to that page.

Use this when scanning a long paper for "where are the figures?" without opening the PDF outline. Counts are zero on documents that were uploaded before multimodal extraction shipped — reprocess the document from the document menu to backfill.

Document Processing & Intelligent Chunking ​

Why Document Processing Matters ​

Poor Chunking Example ​

Smart Chunking Example ​

How Scrapalot Processes Your Documents ​

Processing Approaches ​

For Scientific and Academic Content ​

For Specialized Domains ​

For Factual Content ​

For Structured Documents ​

For Long-Form Narratives ​

Key Features ​

Concept Preservation ​

Relationship Mapping ​

Context Enrichment ​

Smart Boundaries ​

Multimodal Element Extraction ​

Citation Highlighting ​

Content Type Matching ​

Academic and Scientific ​

Technical Documentation ​

Reference Material ​

Specialized Knowledge ​

Quality Features ​

Overlap for Continuity ​

Size Optimization ​

Quality Validation ​

What You Experience ​

Upload ​

Behind the Scenes ​

Ready to Use ​

Performance Benefits ​

Better Search Results ​

Understanding Relationships ​

Comprehensive Knowledge ​

Use Case Examples ​

Research Paper Analysis ​

Technical Manual ​

Philosophical Text ​

Historical Biography ​

Tips for Best Results ​

Document Preparation ​

Content Organization ​

File Formats ​

What Makes This Effective ​

Intelligent vs. Simple Splitting ​

Automatic Optimization ​

Continuous Learning ​

Annotations and Reading Tools ​

Annotation toolbar ​

Margin notes from chat ​

Sharing annotations ​

Visual entities side panel ​

Document Processing & Intelligent Chunking

Why Document Processing Matters

Poor Chunking Example

Smart Chunking Example

How Scrapalot Processes Your Documents

Processing Approaches

For Scientific and Academic Content

For Specialized Domains

For Factual Content

For Structured Documents

For Long-Form Narratives

Key Features

Concept Preservation

Relationship Mapping

Context Enrichment

Smart Boundaries

Multimodal Element Extraction

Citation Highlighting

Content Type Matching

Academic and Scientific

Technical Documentation

Reference Material

Specialized Knowledge

Quality Features

Overlap for Continuity

Size Optimization

Quality Validation

What You Experience

Upload

Behind the Scenes

Ready to Use

Performance Benefits

Better Search Results

Understanding Relationships

Comprehensive Knowledge

Use Case Examples

Research Paper Analysis

Technical Manual

Philosophical Text

Historical Biography

Tips for Best Results

Document Preparation

Content Organization

File Formats

What Makes This Effective

Intelligent vs. Simple Splitting

Automatic Optimization

Continuous Learning

Annotations and Reading Tools

Annotation toolbar

Margin notes from chat

Sharing annotations

Visual entities side panel