Skip to content

AI Model Selection

Last Updated: March 2026

Choose the AI models that best fit your needs - from powerful cloud models to privacy-focused local options. Scrapalot supports OpenAI, Anthropic, Google Gemini, DeepSeek, and all OpenAI-compatible providers.

Your Model, Your Choice

Scrapalot is model-agnostic. Use any combination of:

  • Cloud models for best quality and speed
  • Local models for privacy and cost savings
  • Self-hosted models for complete control
  • Mix and match different models for different tasks

Cloud Model Providers

OpenAI

Models:

  • GPT-4, GPT-4 Turbo (best quality)
  • GPT-3.5 Turbo (fast and economical)
  • text-embedding-ada-002 (embeddings)

Best for:

  • Highest quality answers
  • Complex reasoning tasks
  • Production deployments
  • When cost is secondary to quality

Pricing: Pay per use, competitive rates

Anthropic Claude

Models:

  • Claude 3 Opus (highest quality)
  • Claude 3 Sonnet (balanced)
  • Claude 3 Haiku (fastest)

Best for:

  • Very long documents (200K token context)
  • Safe, aligned responses
  • Detailed explanations
  • Following complex instructions

Pricing: Pay per use, tiered by model

Google Gemini

Models:

  • Gemini Pro
  • Gemini Ultra

Best for:

  • Multimodal content (text and images)
  • Large context windows
  • Cost-effective alternative
  • Google ecosystem integration

Pricing: Competitive with generous free tier

Local & Self-Hosted Options

Local GGUF Models

What: Run AI models directly on your server

Privacy benefits:

  • Your data never leaves your infrastructure
  • No API costs
  • No rate limits
  • Complete control

Popular models:

  • Llama 2 (7B, 13B, 70B)
  • Mistral (7B, 8x7B)
  • Phi-3 (small and fast)
  • CodeLlama (code-focused)

Requirements:

  • GPU recommended (NVIDIA preferred)
  • CPU-only possible but slower
  • 16GB+ RAM recommended
  • Storage for model files (5-50GB per model)

Ollama

What: Local model server with easy management

Benefits:

  • Simple model download and switching
  • Automatic model management
  • Good performance out of box
  • Active community

Best for:

  • Development and testing
  • Privacy-sensitive deployments
  • Cost control
  • Learning and experimentation

Setup: Install Ollama, run models locally

LM Studio

What: Desktop app for running local models

Benefits:

  • User-friendly GUI
  • GPU acceleration built-in
  • Easy model browsing
  • Great for getting started

Best for:

  • Personal use
  • Small teams
  • Development
  • Model experimentation

vLLM

What: Production-grade local inference server

Benefits:

  • Optimized for throughput
  • Advanced GPU utilization
  • Production-ready
  • Scales well

Best for:

  • Large deployments
  • High concurrent users
  • Maximum performance
  • Enterprise use

Choosing Models

For Chat (Answer Generation)

High Quality Needed:

  • Cloud: GPT-4, Claude 3 Opus
  • Local: Llama 2 70B, Mixtral 8x7B
  • When: Complex questions, critical accuracy

Balanced Performance:

  • Cloud: GPT-3.5 Turbo, Claude 3 Sonnet
  • Local: Llama 2 13B, Mistral 7B
  • When: Most use cases, good quality/speed trade-off

Speed Priority:

  • Cloud: GPT-3.5 Turbo, Claude 3 Haiku
  • Local: Llama 2 7B, Phi-3
  • When: High volume, fast responses needed

Cloud Options:

  • OpenAI text-embedding-ada-002 (1536 dimensions)
  • OpenAI text-embedding-3-small (1536 dimensions)
  • OpenAI text-embedding-3-large (3072 dimensions)

Local Options:

  • sentence-transformers/all-MiniLM-L6-v2 (fast, 384 dim)
  • sentence-transformers/all-mpnet-base-v2 (quality, 768 dim)
  • BAAI/bge-large-en-v1.5 (high quality, 1024 dim)

Recommendation: Start with all-MiniLM-L6-v2 (local) or ada-002 (cloud)

Hardware Requirements

For Local Models

Minimum (CPU only):

  • 16GB RAM
  • Any modern CPU
  • 50GB disk space
  • Models: 7B parameters or smaller

Recommended (GPU):

  • 16GB+ RAM
  • NVIDIA GPU with 8GB+ VRAM
  • 100GB disk space
  • Models: up to 13B parameters

High Performance (Multi-GPU):

  • 32GB+ RAM
  • Multiple NVIDIA GPUs (16GB+ each)
  • 200GB+ disk space
  • Models: 70B+ parameters

Performance Guide

Model SizeCPU RAMGPU VRAMSpeedQuality
3B8GB4GBFastGood
7B16GB8GBMediumBetter
13B32GB16GBSlowerGreat
70B128GB40GB+SlowBest

Model Caching

What it does: Keeps frequently used models loaded in memory for instant responses

Benefits:

  • First query: ~194 seconds to load model
  • Subsequent queries: 3-5 seconds response time
  • Automatic management
  • No configuration needed

How it helps:

  • Dramatically faster repeated queries
  • Better user experience
  • Efficient resource use
  • Supports multiple models

Cost Considerations

Cloud Models

Pricing structure:

  • Pay per token (input + output)
  • GPT-4: Most expensive, best quality
  • GPT-3.5: Economical, good quality
  • Embeddings: Very low cost

Typical costs:

  • 1000 documents embedded: $1-5
  • 1000 chat queries: $10-100 (varies by model)
  • Monthly for active use: $50-500 (depends on volume)

Local Models

One-time costs:

  • GPU hardware: $500-5000
  • Setup time: Few hours
  • Model downloads: Free

Ongoing costs:

  • Electricity only
  • No per-query charges
  • No rate limits
  • Scales with hardware

Break-even: Typically 3-6 months for moderate use

Switching Models

Easy Migration

Change chat model:

  • Select new model in settings
  • Existing conversations unaffected
  • Takes effect immediately

Change embedding model:

  • Update collection settings
  • Reprocess documents with new embeddings
  • Search quality may improve or change

Mix and match:

  • Different models per collection
  • Chat and embeddings can use different providers
  • Optimize for each use case

Best Practices

Model Selection

Start simple:

  1. Begin with cloud model (GPT-3.5 Turbo)
  2. Evaluate quality and cost
  3. Try local models if privacy/cost matters
  4. Optimize based on actual usage

Optimize for use case:

  • Customer support: Fast models (GPT-3.5, Llama 2 7B)
  • Research: High quality (GPT-4, Claude Opus)
  • Internal docs: Local models for privacy
  • Public content: Cloud models for scale

Performance Tuning

For local models:

  • Start with smaller models (7B)
  • Increase size if quality insufficient
  • Use GPU acceleration when available
  • Monitor memory usage

For cloud models:

  • Use cheaper models for embeddings
  • Reserve expensive models for complex queries
  • Monitor API costs
  • Set reasonable rate limits

Privacy & Compliance

When privacy matters:

  • Use local models exclusively
  • Self-host embeddings and chat
  • Data never leaves your infrastructure
  • Full audit trail

When cloud is acceptable:

  • Use encrypted connections
  • Review provider privacy policies
  • Consider data residency requirements
  • Understand data retention

Troubleshooting

Model Loading Slow

For local models:

  • Check available RAM/VRAM
  • Reduce model size
  • Enable model caching
  • Use faster storage (SSD)

Poor Answer Quality

Try:

  • Switch to larger/better model
  • Adjust temperature settings
  • Improve document chunking
  • Use better embedding model

Out of Memory

Solutions:

  • Use smaller model
  • Reduce context window
  • Add more RAM/VRAM
  • Use quantized models (Q4, Q5)

API Errors

Check:

  • API key is valid
  • Rate limits not exceeded
  • Sufficient API credits
  • Service status

Cross-Service Model Sync

Model providers and their models are owned by the Python AI service and synced to the Kotlin backend via Redis Streams SAGA (P→K direction). This ensures the Kotlin backend always has up-to-date model resolution data without directly querying Python's database.


Scrapalot works with any model. Start with what's convenient, optimize as you learn your usage patterns. You're never locked in.

Released under the MIT License.