API Reference

Comprehensive REST API documentation for Scrapalot with real request/response examples.

Interactive Documentation

For live, interactive API docs with "Try it out" functionality, visit /docs when running Scrapalot:

Local: http://localhost:8090/docs
Production: https://your-domain.com/docs

Quick Start

All API endpoints require authentication via Bearer token or API key.

Base URL: http://localhost:8090/api/v1

Authentication Headers:

bash

# Option 1: Bearer Token (from login)
Authorization: Bearer <access_token>

# Option 2: API Key (for integrations)
X-API-Key: scp-1a2b3c4d5e6f7g8h9i0j...

Authentication

Authentication Flow

Get Access Token

Endpoint: POST /auth/login

Request:

bash

curl -X POST http://localhost:8090/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{
    "username": "user@example.com",
    "password": "your-password"
  }'

Response:

json

{
  "access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "refresh_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "token_type": "bearer",
  "expires_in": 3600
}

Create API Key

Create a long-lived API key for integrations.

Endpoint: POST /auth/api-keys

Request:

bash

curl -X POST http://localhost:8090/api/v1/auth/api-keys \
  -H "Authorization: Bearer <access_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production Integration",
    "expires_in_days": 90
  }'

Response:

json

{
  "id": "123e4567-e89b-12d3-a456-426614174000",
  "name": "Production Integration",
  "key": "scp-1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p",
  "key_prefix": "scp-1a2b3c",
  "created_at": "2025-12-18T10:30:00Z",
  "expires_at": "2026-03-18T10:30:00Z"
}

Store Securely

The full API key is only shown once during creation. Store it securely - you cannot retrieve it later.

List API Keys

Endpoint: GET /auth/api-keys

Response:

json

[
  {
    "id": "123e4567-e89b-12d3-a456-426614174000",
    "name": "Production Integration",
    "key_prefix": "scp-1a2b3c",
    "is_active": true,
    "last_used_at": "2025-12-18T10:30:00Z",
    "created_at": "2025-12-01T08:00:00Z",
    "expires_at": "2026-03-18T10:30:00Z"
  }
]

Collections

Collections organize documents into knowledge stacks.

List Collections

Endpoint: GET /collections/

Query Parameters:

page (integer, default: 1)
limit (integer, default: 20)

Request:

bash

curl "http://localhost:8090/api/v1/collections/?page=1&limit=20" \
  -H "Authorization: Bearer <access_token>"

Response:

json

{
  "collections": [
    {
      "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "name": "Research Papers",
      "workspace_id": "w1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "created_at": "2025-12-01T10:00:00Z",
      "updated_at": "2025-12-18T14:30:00Z",
      "documentCount": 42
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 20,
    "total": 42,
    "has_more": true
  }
}

Create Collection

Endpoint: POST /collections/

Request:

bash

curl -X POST http://localhost:8090/api/v1/collections/ \
  -H "Authorization: Bearer <access_token>" \
  -F "name=My Research" \
  -F "workspace_id=w1b2c3d4-e5f6-7890-abcd-ef1234567890"

Response (201 Created):

json

{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "name": "My Research",
  "workspace_id": "w1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "created_at": "2025-12-18T15:00:00Z",
  "updated_at": "2025-12-18T15:00:00Z"
}

Delete Collection

Endpoint: DELETE /collections/{collection_id}

Request:

bash

curl -X DELETE http://localhost:8090/api/v1/collections/a1b2c3d4-... \
  -H "Authorization: Bearer <access_token>"

Response:

json

{
  "message": "Collection deleted successfully"
}

Documents

Upload, manage, and retrieve documents.

Document Upload Flow

Upload Document (Streaming)

Upload with real-time progress updates.

Endpoint: POST /documents/upload_stream

Request:

bash

curl -X POST http://localhost:8090/api/v1/documents/upload_stream \
  -H "Authorization: Bearer <access_token>" \
  -F "file=@document.pdf" \
  -F "collection_id=a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
  --no-buffer

Response (Streaming NDJSON):

json

{"type":"status","content":{"status":"processing","progress":10,"message":"File uploaded"}}
{"type":"status","content":{"status":"processing","progress":25,"message":"Extracting text..."}}
{"type":"status","content":{"status":"processing","progress":50,"message":"Generating embeddings..."}}
{"type":"status","content":{"status":"processing","progress":85,"message":"Storing embeddings..."}}
{"type":"status","content":{"status":"completed","progress":100,"message":"Complete","document_id":"d1234567-..."}}

List Documents

Endpoint: GET /documents/collection/{collection_id}

Query Parameters:

page (integer, default: 1)
page_size (integer, default: 20)

Response:

json

{
  "documents": [
    {
      "id": "d1234567-e89b-12d3-a456-426614174000",
      "title": "AI Research Paper",
      "filename": "ai_research.pdf",
      "file_metadata": {
        "size": 1048576,
        "content_type": "application/pdf"
      },
      "collection_id": "a1b2c3d4-...",
      "created_at": "2025-12-15T10:00:00Z"
    }
  ],
  "pagination": {
    "page": 1,
    "page_size": 20,
    "total": 42
  }
}

Get Processing Status

Endpoint: GET /documents/processing_status/{job_id}

Response:

json

{
  "job_id": "j1234567-e89b-12d3-a456-426614174000",
  "document_id": "d1234567-...",
  "status": "processing",
  "progress": 65,
  "message": "Generating embeddings..."
}

Status values: pending, processing, completed, failed

Delete Document

Endpoint: DELETE /documents/{document_id}

Response:

json

{
  "status": "success",
  "message": "Document deleted successfully"
}

Chat & Queries

Ask questions with RAG, deep research, or direct chat.

RAG Query Flow

Query Documents

Ask questions about your documents with streaming responses.

Endpoint: POST /chat/generate

Request (Basic RAG):

bash

curl -X POST http://localhost:8090/api/v1/chat/generate \
  -H "Authorization: Bearer <access_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What are the key findings?",
    "collection_ids": ["a1b2c3d4-e5f6-7890-abcd-ef1234567890"],
    "model_name": "gpt-4o",
    "provider_type": "openai",
    "similarity_threshold": 0.7,
    "top_k": 10
  }' \
  --no-buffer

Request Parameters:

Parameter	Type	Default	Description
`prompt`	string	required	Your question
`collection_ids`	UUID[]	`[]`	Collections to search
`model_name`	string	user default	AI model to use
`provider_type`	string	user default	openai, anthropic, local, etc.
`similarity_threshold`	float	`0.5`	RAG relevance cutoff (0.0-1.0)
`top_k`	integer	`15`	Number of chunks (1-30)
`deep_research_enabled`	boolean	`false`	Enable 5-phase research
`research_depth`	integer	`2`	Research depth (1-5)
`research_breadth`	integer	`4`	Number of sources (1-10)
`web_search_enabled`	boolean	`false`	Search the web
`agentic_rag_enabled`	boolean	`false`	Auto-select strategy

Response (Streaming NDJSON):

json

{"type":"message_start","content":{"message_id":"m1234567-...","session_id":"s1234567-..."}}
{"type":"message_delta","content":{"delta":"The","index":0}}
{"type":"message_delta","content":{"delta":" key","index":1}}
{"type":"citation_start","content":{"citation_id":"c1","document_id":"d1234567-..."}}
{"type":"citation_info","content":{"citation_id":"c1","title":"Research Paper","page":5,"relevance_score":0.92}}
{"type":"message_delta","content":{"delta":" findings [1]","index":2}}
{"type":"stream_end","content":{"reason":"complete","token_count":{"input":150,"output":500}}}

Packet Types:

message_start - Response begins
message_delta - Text chunk
citation_start - Citation marker
citation_info - Citation details
status - Progress update
stream_end - Response complete

Deep Research Query

Enable comprehensive multi-source research.

Request:

bash

curl -X POST http://localhost:8090/api/v1/chat/generate \
  -H "Authorization: Bearer <access_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Comprehensive analysis of AI safety research",
    "deep_research_enabled": true,
    "research_depth": 4,
    "research_breadth": 8,
    "web_search_enabled": true,
    "model_name": "gpt-4o"
  }' \
  --no-buffer

Additional Response Packets (Deep Research):

json

{"type":"research_plan","content":{"sections":[{"title":"Background","questions":["What is AI safety?"]}]}}
{"type":"planning_progress","content":{"stage":"generating_plan","progress":0.5}}
{"type":"search_strategy","content":{"provider":"firecrawl","query":"AI safety 2025"}}
{"type":"source_evaluation","content":{"url":"https://...","credibility":0.9}}
{"type":"synthesis_start","content":{"total_sources":25,"quality_score":0.87}}
{"type":"validation_result","content":{"passed":true,"quality_metrics":{"coherence":0.92}}}

Models

Manage AI models and providers.

List Models

Endpoint: GET /llm-inference/list-models

Query Parameters:

providers[] - Filter by provider
model_type - Filter: NORMAL, EMBEDDING, VISION
page, limit - Pagination

Response:

json

{
  "data": [
    {
      "provider_id": "p1234567-...",
      "provider_name": "OpenAI Production",
      "provider_type": "openai",
      "models": [
        {
          "id": "gpt-4o",
          "model_name": "gpt-4o",
          "display_name": "GPT-4o",
          "model_type": "NORMAL",
          "context_window": 128000,
          "is_selected": true
        }
      ]
    }
  ],
  "total": 50,
  "page": 1
}

List Embedding Models

Endpoint: GET /llm-inference/embedding-models

Response:

json

[
  {
    "id": "text-embedding-3-large",
    "model_name": "text-embedding-3-large",
    "display_name": "Text Embedding 3 Large",
    "model_type": "EMBEDDING",
    "provider_type": "openai",
    "dimensions": 3072,
    "context_window": 8191
  }
]

Error Responses

All errors follow a consistent format:

400 Bad Request

json

{
  "detail": "Invalid file type. Only PDF files are supported."
}

401 Unauthorized

json

{
  "detail": "Could not validate credentials"
}

403 Forbidden

json

{
  "detail": "You need editor or owner role to upload files."
}

404 Not Found

json

{
  "detail": "Collection not found"
}

413 Quota Exceeded

json

{
  "detail": "Storage quota exceeded. You have used 9.8GB of your 10GB limit.",
  "quota_info": {
    "current_usage_gb": 9.8,
    "limit_gb": 10.0,
    "percentage_used": 98.0
  }
}

429 Rate Limited

json

{
  "detail": "You have reached the maximum concurrent jobs limit (5)."
}

Best Practices

Authentication

✅ Do:

Use API keys for server-to-server integrations
Store tokens securely in environment variables
Refresh tokens before expiration
Use Bearer tokens for frontend applications

❌ Don't:

Commit API keys to version control
Share API keys between environments
Use expired tokens

File Uploads

✅ Do:

Use streaming endpoint for real-time progress
Handle all progress packet types
Validate file size before upload
Check storage quota

❌ Don't:

Upload files larger than 100MB
Ignore error responses
Skip progress monitoring

RAG Queries

✅ Do:

Start with similarity_threshold: 0.7 for accuracy
Use top_k: 10-15 for balanced context
Enable agentic_rag_enabled for automatic strategy selection
Use Deep Research for complex multi-source queries

❌ Don't:

Set similarity_threshold too low (< 0.4)
Retrieve too many chunks (top_k > 30)
Use Deep Research for simple factual queries

Streaming Responses

✅ Do:

Use --no-buffer with curl
Handle all packet types
Implement reconnection logic
Parse NDJSON line-by-line

❌ Don't:

Buffer the entire response
Ignore stream_end packets
Skip error handling

Rate Limits

Authentication	Limit
API Key	1000 requests/hour
Bearer Token	5000 requests/hour
Document Upload	50 uploads/hour
Chat Requests	200 requests/hour

Headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1610000000

Pagination

List endpoints return paginated results:

Request:

bash

curl "http://localhost:8090/api/v1/documents?page=2&limit=50"

Response:

json

{
  "data": [...],
  "pagination": {
    "page": 2,
    "limit": 50,
    "total": 150,
    "has_more": true
  }
}

Streaming Protocol - WebSocket message formats
Security - Authentication details
Deep Research - Research features
RAG Strategy - Search strategies

Interactive Docs

For the most up-to-date API reference with "Try it out" functionality, visit /docs when running Scrapalot locally.

Version: 1.0.0 Last Updated: December 18, 2025

API Reference ​

Quick Start ​

Authentication ​

Authentication Flow ​

Get Access Token ​

Create API Key ​

List API Keys ​

Collections ​

List Collections ​

Create Collection ​

Delete Collection ​

Documents ​

Document Upload Flow ​

Upload Document (Streaming) ​

List Documents ​

Get Processing Status ​

Delete Document ​

Chat & Queries ​

RAG Query Flow ​

Query Documents ​

Deep Research Query ​

Models ​

List Models ​

List Embedding Models ​

Error Responses ​

400 Bad Request ​

401 Unauthorized ​

403 Forbidden ​

404 Not Found ​

413 Quota Exceeded ​

429 Rate Limited ​

Best Practices ​

Authentication ​

File Uploads ​

RAG Queries ​

Streaming Responses ​

Rate Limits ​

Pagination ​

Related Documentation ​

API Reference

Quick Start

Authentication

Authentication Flow

Get Access Token

Create API Key

List API Keys

Collections

List Collections

Create Collection

Delete Collection

Documents

Document Upload Flow

Upload Document (Streaming)

List Documents

Get Processing Status

Delete Document

Chat & Queries

RAG Query Flow

Query Documents

Deep Research Query

Models

List Models

List Embedding Models

Error Responses

400 Bad Request

401 Unauthorized

403 Forbidden

404 Not Found

413 Quota Exceeded

429 Rate Limited

Best Practices

Authentication

File Uploads

RAG Queries

Streaming Responses

Rate Limits

Pagination

Related Documentation