Skip to content

API Reference

Comprehensive REST API documentation for Scrapalot with real request/response examples.

Interactive Documentation

For live, interactive API docs with "Try it out" functionality, visit /docs when running Scrapalot:

Quick Start

All API endpoints require authentication via Bearer token or API key.

Base URL: http://localhost:8090/api/v1

Authentication Headers:

bash
# Option 1: Bearer Token (from login)
Authorization: Bearer <access_token>

# Option 2: API Key (for integrations)
X-API-Key: scp-1a2b3c4d5e6f7g8h9i0j...

Authentication

Authentication Flow

Get Access Token

Login to get JWT access token.

Endpoint: POST /auth/login

Request:

bash
curl -X POST http://localhost:8090/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{
    "username": "user@example.com",
    "password": "your-password"
  }'

Response:

json
{
  "access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "refresh_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "token_type": "bearer",
  "expires_in": 3600
}

Create API Key

Create a long-lived API key for integrations.

Endpoint: POST /auth/api-keys

Request:

bash
curl -X POST http://localhost:8090/api/v1/auth/api-keys \
  -H "Authorization: Bearer <access_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production Integration",
    "expires_in_days": 90
  }'

Response:

json
{
  "id": "123e4567-e89b-12d3-a456-426614174000",
  "name": "Production Integration",
  "key": "scp-1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p",
  "key_prefix": "scp-1a2b3c",
  "created_at": "2025-12-18T10:30:00Z",
  "expires_at": "2026-03-18T10:30:00Z"
}

Store Securely

The full API key is only shown once during creation. Store it securely - you cannot retrieve it later.

List API Keys

Endpoint: GET /auth/api-keys

Response:

json
[
  {
    "id": "123e4567-e89b-12d3-a456-426614174000",
    "name": "Production Integration",
    "key_prefix": "scp-1a2b3c",
    "is_active": true,
    "last_used_at": "2025-12-18T10:30:00Z",
    "created_at": "2025-12-01T08:00:00Z",
    "expires_at": "2026-03-18T10:30:00Z"
  }
]

Collections

Collections organize documents into knowledge stacks.

List Collections

Endpoint: GET /collections/

Query Parameters:

  • page (integer, default: 1)
  • limit (integer, default: 20)

Request:

bash
curl "http://localhost:8090/api/v1/collections/?page=1&limit=20" \
  -H "Authorization: Bearer <access_token>"

Response:

json
{
  "collections": [
    {
      "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "name": "Research Papers",
      "workspace_id": "w1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "created_at": "2025-12-01T10:00:00Z",
      "updated_at": "2025-12-18T14:30:00Z",
      "documentCount": 42
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 20,
    "total": 42,
    "has_more": true
  }
}

Create Collection

Endpoint: POST /collections/

Request:

bash
curl -X POST http://localhost:8090/api/v1/collections/ \
  -H "Authorization: Bearer <access_token>" \
  -F "name=My Research" \
  -F "workspace_id=w1b2c3d4-e5f6-7890-abcd-ef1234567890"

Response (201 Created):

json
{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "name": "My Research",
  "workspace_id": "w1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "created_at": "2025-12-18T15:00:00Z",
  "updated_at": "2025-12-18T15:00:00Z"
}

Delete Collection

Endpoint: DELETE /collections/{collection_id}

Request:

bash
curl -X DELETE http://localhost:8090/api/v1/collections/a1b2c3d4-... \
  -H "Authorization: Bearer <access_token>"

Response:

json
{
  "message": "Collection deleted successfully"
}

Documents

Upload, manage, and retrieve documents.

Document Upload Flow

Upload Document (Streaming)

Upload with real-time progress updates.

Endpoint: POST /documents/upload_stream

Request:

bash
curl -X POST http://localhost:8090/api/v1/documents/upload_stream \
  -H "Authorization: Bearer <access_token>" \
  -F "file=@document.pdf" \
  -F "collection_id=a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
  --no-buffer

Response (Streaming NDJSON):

json
{"type":"status","content":{"status":"processing","progress":10,"message":"File uploaded"}}
{"type":"status","content":{"status":"processing","progress":25,"message":"Extracting text..."}}
{"type":"status","content":{"status":"processing","progress":50,"message":"Generating embeddings..."}}
{"type":"status","content":{"status":"processing","progress":85,"message":"Storing embeddings..."}}
{"type":"status","content":{"status":"completed","progress":100,"message":"Complete","document_id":"d1234567-..."}}

List Documents

Endpoint: GET /documents/collection/{collection_id}

Query Parameters:

  • page (integer, default: 1)
  • page_size (integer, default: 20)

Response:

json
{
  "documents": [
    {
      "id": "d1234567-e89b-12d3-a456-426614174000",
      "title": "AI Research Paper",
      "filename": "ai_research.pdf",
      "file_metadata": {
        "size": 1048576,
        "content_type": "application/pdf"
      },
      "collection_id": "a1b2c3d4-...",
      "created_at": "2025-12-15T10:00:00Z"
    }
  ],
  "pagination": {
    "page": 1,
    "page_size": 20,
    "total": 42
  }
}

Get Processing Status

Endpoint: GET /documents/processing_status/{job_id}

Response:

json
{
  "job_id": "j1234567-e89b-12d3-a456-426614174000",
  "document_id": "d1234567-...",
  "status": "processing",
  "progress": 65,
  "message": "Generating embeddings..."
}

Status values: pending, processing, completed, failed

Delete Document

Endpoint: DELETE /documents/{document_id}

Response:

json
{
  "status": "success",
  "message": "Document deleted successfully"
}

Chat & Queries

Ask questions with RAG, deep research, or direct chat.

RAG Query Flow

Query Documents

Ask questions about your documents with streaming responses.

Endpoint: POST /chat/generate

Request (Basic RAG):

bash
curl -X POST http://localhost:8090/api/v1/chat/generate \
  -H "Authorization: Bearer <access_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What are the key findings?",
    "collection_ids": ["a1b2c3d4-e5f6-7890-abcd-ef1234567890"],
    "model_name": "gpt-4o",
    "provider_type": "openai",
    "similarity_threshold": 0.7,
    "top_k": 10
  }' \
  --no-buffer

Request Parameters:

ParameterTypeDefaultDescription
promptstringrequiredYour question
collection_idsUUID[][]Collections to search
model_namestringuser defaultAI model to use
provider_typestringuser defaultopenai, anthropic, local, etc.
similarity_thresholdfloat0.5RAG relevance cutoff (0.0-1.0)
top_kinteger15Number of chunks (1-30)
deep_research_enabledbooleanfalseEnable 5-phase research
research_depthinteger2Research depth (1-5)
research_breadthinteger4Number of sources (1-10)
web_search_enabledbooleanfalseSearch the web
agentic_rag_enabledbooleanfalseAuto-select strategy

Response (Streaming NDJSON):

json
{"type":"message_start","content":{"message_id":"m1234567-...","session_id":"s1234567-..."}}
{"type":"message_delta","content":{"delta":"The","index":0}}
{"type":"message_delta","content":{"delta":" key","index":1}}
{"type":"citation_start","content":{"citation_id":"c1","document_id":"d1234567-..."}}
{"type":"citation_info","content":{"citation_id":"c1","title":"Research Paper","page":5,"relevance_score":0.92}}
{"type":"message_delta","content":{"delta":" findings [1]","index":2}}
{"type":"stream_end","content":{"reason":"complete","token_count":{"input":150,"output":500}}}

Packet Types:

  • message_start - Response begins
  • message_delta - Text chunk
  • citation_start - Citation marker
  • citation_info - Citation details
  • status - Progress update
  • stream_end - Response complete

Deep Research Query

Enable comprehensive multi-source research.

Request:

bash
curl -X POST http://localhost:8090/api/v1/chat/generate \
  -H "Authorization: Bearer <access_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Comprehensive analysis of AI safety research",
    "deep_research_enabled": true,
    "research_depth": 4,
    "research_breadth": 8,
    "web_search_enabled": true,
    "model_name": "gpt-4o"
  }' \
  --no-buffer

Additional Response Packets (Deep Research):

json
{"type":"research_plan","content":{"sections":[{"title":"Background","questions":["What is AI safety?"]}]}}
{"type":"planning_progress","content":{"stage":"generating_plan","progress":0.5}}
{"type":"search_strategy","content":{"provider":"firecrawl","query":"AI safety 2025"}}
{"type":"source_evaluation","content":{"url":"https://...","credibility":0.9}}
{"type":"synthesis_start","content":{"total_sources":25,"quality_score":0.87}}
{"type":"validation_result","content":{"passed":true,"quality_metrics":{"coherence":0.92}}}

Models

Manage AI models and providers.

List Models

Endpoint: GET /llm-inference/list-models

Query Parameters:

  • providers[] - Filter by provider
  • model_type - Filter: NORMAL, EMBEDDING, VISION
  • page, limit - Pagination

Response:

json
{
  "data": [
    {
      "provider_id": "p1234567-...",
      "provider_name": "OpenAI Production",
      "provider_type": "openai",
      "models": [
        {
          "id": "gpt-4o",
          "model_name": "gpt-4o",
          "display_name": "GPT-4o",
          "model_type": "NORMAL",
          "context_window": 128000,
          "is_selected": true
        }
      ]
    }
  ],
  "total": 50,
  "page": 1
}

List Embedding Models

Endpoint: GET /llm-inference/embedding-models

Response:

json
[
  {
    "id": "text-embedding-3-large",
    "model_name": "text-embedding-3-large",
    "display_name": "Text Embedding 3 Large",
    "model_type": "EMBEDDING",
    "provider_type": "openai",
    "dimensions": 3072,
    "context_window": 8191
  }
]

Error Responses

All errors follow a consistent format:

400 Bad Request

json
{
  "detail": "Invalid file type. Only PDF files are supported."
}

401 Unauthorized

json
{
  "detail": "Could not validate credentials"
}

403 Forbidden

json
{
  "detail": "You need editor or owner role to upload files."
}

404 Not Found

json
{
  "detail": "Collection not found"
}

413 Quota Exceeded

json
{
  "detail": "Storage quota exceeded. You have used 9.8GB of your 10GB limit.",
  "quota_info": {
    "current_usage_gb": 9.8,
    "limit_gb": 10.0,
    "percentage_used": 98.0
  }
}

429 Rate Limited

json
{
  "detail": "You have reached the maximum concurrent jobs limit (5)."
}

Best Practices

Authentication

Do:

  • Use API keys for server-to-server integrations
  • Store tokens securely in environment variables
  • Refresh tokens before expiration
  • Use Bearer tokens for frontend applications

Don't:

  • Commit API keys to version control
  • Share API keys between environments
  • Use expired tokens

File Uploads

Do:

  • Use streaming endpoint for real-time progress
  • Handle all progress packet types
  • Validate file size before upload
  • Check storage quota

Don't:

  • Upload files larger than 100MB
  • Ignore error responses
  • Skip progress monitoring

RAG Queries

Do:

  • Start with similarity_threshold: 0.7 for accuracy
  • Use top_k: 10-15 for balanced context
  • Enable agentic_rag_enabled for automatic strategy selection
  • Use Deep Research for complex multi-source queries

Don't:

  • Set similarity_threshold too low (< 0.4)
  • Retrieve too many chunks (top_k > 30)
  • Use Deep Research for simple factual queries

Streaming Responses

Do:

  • Use --no-buffer with curl
  • Handle all packet types
  • Implement reconnection logic
  • Parse NDJSON line-by-line

Don't:

  • Buffer the entire response
  • Ignore stream_end packets
  • Skip error handling

Rate Limits

AuthenticationLimit
API Key1000 requests/hour
Bearer Token5000 requests/hour
Document Upload50 uploads/hour
Chat Requests200 requests/hour

Headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1610000000

Pagination

List endpoints return paginated results:

Request:

bash
curl "http://localhost:8090/api/v1/documents?page=2&limit=50"

Response:

json
{
  "data": [...],
  "pagination": {
    "page": 2,
    "limit": 50,
    "total": 150,
    "has_more": true
  }
}


Interactive Docs

For the most up-to-date API reference with "Try it out" functionality, visit /docs when running Scrapalot locally.

Version: 1.0.0 Last Updated: December 18, 2025

Released under the MIT License.