Skip to content

Document Processing & Intelligent Chunking

When you upload documents to Scrapalot, the system intelligently breaks them down into meaningful pieces that preserve context and relationships. This ensures accurate search results and comprehensive answers to your questions.

Why Document Processing Matters

Large documents need to be broken into smaller, searchable pieces. How this is done dramatically impacts search quality:

Poor Chunking Example

Imagine a book about psychology split mid-sentence:

Chunk 1 ends: "Jung's concept of the collective unconscious differs from Freud's personal..." Chunk 2 starts: "...unconscious. The collective unconscious contains archetypes—universal patterns."

Problem: Split content loses meaning, making it hard to find complete information.

Smart Chunking Example

The same content intelligently split:

Chunk 1: "Jung's concept of the collective unconscious differs from Freud's personal unconscious. The collective unconscious contains archetypes—universal patterns like the anima, animus, and shadow."

Result: Complete, meaningful information that's easy to search and understand.

How Scrapalot Processes Your Documents

Processing Approaches

Scrapalot uses different processing strategies based on your content type. The system automatically selects the best approach.

For Scientific and Academic Content

Best for: Research papers, philosophical texts, scientific books

How it works:

  • Detects when topics shift
  • Keeps related concepts together
  • Preserves argument flow
  • Maintains mathematical formulas with explanations

Example: A quantum physics paper discussing wave-particle duality

  • Keeps mathematical equations with their conceptual explanations
  • Preserves the progression from theory to experiment to conclusion
  • Maintains connections between related concepts

Benefits:

  • Complete ideas aren't split apart
  • Context is preserved
  • Related concepts stay together
  • Accurate answers to complex questions

For Specialized Domains

Best for: Technical documentation, domain-specific knowledge, specialized fields

How it works:

  • Recognizes important terminology
  • Understands field-specific concepts
  • Preserves definitions with their context
  • Tracks concept relationships

Example: Psychology text about Jung's archetypes

  • Keeps "collective unconscious" with its full definition
  • Preserves relationships: "archetypes include the anima, animus, and shadow"
  • Maintains connections between related terms

Benefits:

  • Technical terms kept with explanations
  • Concept relationships preserved
  • Domain knowledge accessible
  • Precise search results

For Factual Content

Best for: Encyclopedias, reference material, factual databases

How it works:

  • Extracts individual facts
  • Creates self-contained statements
  • Preserves who, what, when, where details
  • Enables precise fact retrieval

Example: Historical information

  • "Einstein developed the theory of relativity in 1915"
  • "The theory explains gravity as spacetime curvature"
  • Each fact stands alone and is precisely searchable

Benefits:

  • Exact fact retrieval
  • No unnecessary context
  • Quick answers to specific questions
  • High precision results

For Structured Documents

Best for: Manuals, documentation with headings, organized content

How it works:

  • Recognizes document structure
  • Preserves hierarchy (chapters, sections, subsections)
  • Maintains headings with their content
  • Uses structure for better search

Example: User manual with table of contents

  • Chapter 3 > Section 3.2 > Subsection 3.2.1
  • Heading "Troubleshooting Authentication" stays with its instructions
  • Search can filter by section

Benefits:

  • Preserves document organization
  • Easy to navigate results
  • Structured search (e.g., "in chapter 3")
  • Clear content hierarchy

For Long-Form Narratives

Best for: Books, biographies, historical accounts

How it works:

  • Maintains narrative flow
  • Preserves temporal sequences
  • Keeps character development intact
  • Respects story progression

Example: Biography of a scientist

  • Chronological events stay in order
  • Development of ideas tracked over time
  • Cause-and-effect relationships preserved

Benefits:

  • Story flow maintained
  • Timeline preserved
  • Context from earlier sections available
  • Comprehensive understanding

Key Features

Concept Preservation

What it does: Keeps related ideas together

Example: When processing a text about consciousness:

  • "Phenomenal experience has subjective qualities" stays with
  • "These qualities, called qualia, are the 'what it's like' aspect"
  • Both pieces together provide complete understanding

Why it matters: You get complete explanations, not fragments

Relationship Mapping

What it does: Understands how concepts connect

Example: In a systems theory document:

  • Maps "emergence" → "causes" → "complex behavior"
  • Links "feedback loops" → "enable" → "emergence"
  • Connects "systems" → "exhibit" → "both properties"

Why it matters: Answers questions about how things relate, not just what they are

Context Enrichment

What it does: Adds surrounding information to each piece

Example: A chunk about "quantum decoherence" includes:

  • The concept itself
  • How it relates to the broader measurement problem
  • Its position in quantum mechanics theory

Why it matters: Each search result provides enough context to understand the answer

Smart Boundaries

What it does: Splits content at natural transitions, not arbitrary lengths

Example: Recognizing topic shifts:

  • "...thus proving the theorem." [Natural end point]
  • "In the next chapter, we explore consciousness." [Natural start point]

Why it matters: Chunks contain complete thoughts, not mid-sentence cuts

Content Type Matching

The system automatically chooses the best processing approach based on your content:

Academic and Scientific

Content includes: Research papers, scientific books, theoretical texts

Processing approach: Semantic understanding

  • Preserves complex arguments
  • Keeps equations with explanations
  • Maintains research flow from hypothesis to conclusion

Result: Deep, comprehensive answers to complex questions

Technical Documentation

Content includes: API docs, code examples, configuration guides

Processing approach: Structure and syntax awareness

  • Preserves code blocks intact
  • Keeps configuration examples complete
  • Maintains technical accuracy

Result: Exact technical answers with proper syntax

Reference Material

Content includes: Encyclopedias, dictionaries, fact databases

Processing approach: Atomic fact extraction

  • Individual facts become searchable units
  • Precise retrieval of specific information
  • Minimal extraneous context

Result: Quick, accurate fact lookup

Specialized Knowledge

Content includes: Medical texts, legal documents, philosophical works

Processing approach: Domain-aware processing

  • Recognizes field-specific terminology
  • Preserves specialized concept relationships
  • Maintains domain-specific structure

Result: Accurate domain knowledge retrieval

Quality Features

Overlap for Continuity

What it does: Includes a small portion of adjacent chunks

Why it helps:

  • Ensures no information is lost at boundaries
  • Provides context when a concept spans the boundary
  • Improves search accuracy

Example: If a concept is explained across a natural boundary, the overlap ensures both chunks have enough context

Size Optimization

What it does: Balances chunk size for best results

Considerations:

  • Large enough to preserve meaning
  • Small enough for precise retrieval
  • Optimized for your content type

Result: Accurate search without information overload

Quality Validation

What it does: Checks processed content for completeness

Ensures:

  • No orphaned fragments
  • Complete sentences
  • Sufficient context
  • Meaningful units

Result: Reliable, usable search results

What You Experience

Upload

  1. Drop your document into Scrapalot
  2. System analyzes content type
  3. Processing happens automatically
  4. Progress indicator shows status

Behind the Scenes

While processing:

  • Content is intelligently analyzed
  • Optimal processing approach is selected
  • Chunks are created with smart boundaries
  • Concepts and relationships are mapped
  • Everything is made searchable

Ready to Use

After processing:

  • Ask questions naturally
  • Get accurate, complete answers
  • Results cite specific sources
  • Related concepts are connected

Performance Benefits

Better Search Results

  • Relevance: Find exactly what you're looking for
  • Completeness: Answers include full context
  • Accuracy: No fragmented or incomplete information
  • Speed: Optimized chunk sizes for fast retrieval

Understanding Relationships

  • Connections: Discover how concepts relate
  • Dependencies: Understand what depends on what
  • Hierarchies: Navigate from general to specific
  • Networks: Explore interconnected knowledge

Comprehensive Knowledge

  • Multi-source: Synthesize information from multiple sections
  • Contextual: Understand information in proper context
  • Progressive: Build understanding from foundational to advanced
  • Integrated: See how different parts connect

Use Case Examples

Research Paper Analysis

Document: 50-page quantum physics paper

Processing:

  • Preserves mathematical proofs intact
  • Keeps methodology with results
  • Maintains theoretical framework
  • Links related equations

Result: Ask "How does the paper address the measurement problem?" → Get complete, sourced explanation

Technical Manual

Document: 200-page API documentation

Processing:

  • Preserves code examples
  • Keeps endpoints with their descriptions
  • Maintains parameter tables
  • Respects document hierarchy

Result: Ask "How do I authenticate API requests?" → Get exact code example with context

Philosophical Text

Document: Book on consciousness and phenomenology

Processing:

  • Keeps arguments with their justifications
  • Preserves terminology with definitions
  • Maintains concept relationships
  • Respects philosophical progression

Result: Ask "What is Heidegger's concept of Dasein?" → Get comprehensive explanation with proper context

Historical Biography

Document: Life story of a scientist

Processing:

  • Preserves chronological flow
  • Maintains cause-effect relationships
  • Keeps discoveries with their context
  • Respects narrative structure

Result: Ask "What led to Einstein's development of general relativity?" → Get chronological account with context

Tips for Best Results

Document Preparation

Good: Clean, well-formatted PDFs or text files Why: Easier to process accurately

Less optimal: Scanned images, poorly formatted files Why: May require extra processing or OCR

Content Organization

Good: Documents with clear structure (headings, paragraphs) Why: Structure helps the system understand content organization

Less optimal: Wall of text without breaks Why: Harder to identify natural boundaries

File Formats

Best supported:

  • PDF (text-based)
  • Text files
  • Markdown
  • Word documents
  • EPUB

Result: Accurate processing with preserved formatting

What Makes This Effective

Intelligent vs. Simple Splitting

Simple approach:

  • Split every 500 words
  • Results in mid-sentence cuts
  • Loses context
  • Fragments concepts

Scrapalot's approach:

  • Understand content meaning
  • Split at natural boundaries
  • Preserve context
  • Keep concepts together

Difference: Complete, accurate answers vs. fragmented, incomplete information

Automatic Optimization

You don't need to:

  • Choose processing settings
  • Specify document type
  • Configure chunk sizes
  • Manage technical details

The system:

  • Analyzes your content
  • Selects optimal approach
  • Processes intelligently
  • Delivers accurate results

Continuous Learning

The system improves over time:

  • Adapts to your document types
  • Learns domain-specific patterns
  • Optimizes for your use cases
  • Refines processing quality

This intelligent document processing ensures your knowledge base is searchable, accessible, and delivers accurate answers. Upload your documents and start asking questions—the system handles the complexity automatically.

Released under the MIT License.