Skip to content

AI Model Selection

Choose the AI models that best fit your needs - from powerful cloud models to privacy-focused local options. Scrapalot supports all major providers and self-hosted alternatives.

Your Model, Your Choice

Scrapalot is model-agnostic. Use any combination of:

  • Cloud models for best quality and speed
  • Local models for privacy and cost savings
  • Self-hosted models for complete control
  • Mix and match different models for different tasks

Cloud Model Providers

OpenAI

Models:

  • GPT-4, GPT-4 Turbo (best quality)
  • GPT-3.5 Turbo (fast and economical)
  • text-embedding-ada-002 (embeddings)

Best for:

  • Highest quality answers
  • Complex reasoning tasks
  • Production deployments
  • When cost is secondary to quality

Pricing: Pay per use, competitive rates

Anthropic Claude

Models:

  • Claude 3 Opus (highest quality)
  • Claude 3 Sonnet (balanced)
  • Claude 3 Haiku (fastest)

Best for:

  • Very long documents (200K token context)
  • Safe, aligned responses
  • Detailed explanations
  • Following complex instructions

Pricing: Pay per use, tiered by model

Google Gemini

Models:

  • Gemini Pro
  • Gemini Ultra

Best for:

  • Multimodal content (text and images)
  • Large context windows
  • Cost-effective alternative
  • Google ecosystem integration

Pricing: Competitive with generous free tier

Local & Self-Hosted Options

Local GGUF Models

What: Run AI models directly on your server

Privacy benefits:

  • Your data never leaves your infrastructure
  • No API costs
  • No rate limits
  • Complete control

Popular models:

  • Llama 2 (7B, 13B, 70B)
  • Mistral (7B, 8x7B)
  • Phi-3 (small and fast)
  • CodeLlama (code-focused)

Requirements:

  • GPU recommended (NVIDIA preferred)
  • CPU-only possible but slower
  • 16GB+ RAM recommended
  • Storage for model files (5-50GB per model)

Ollama

What: Local model server with easy management

Benefits:

  • Simple model download and switching
  • Automatic model management
  • Good performance out of box
  • Active community

Best for:

  • Development and testing
  • Privacy-sensitive deployments
  • Cost control
  • Learning and experimentation

Setup: Install Ollama, run models locally

LM Studio

What: Desktop app for running local models

Benefits:

  • User-friendly GUI
  • GPU acceleration built-in
  • Easy model browsing
  • Great for getting started

Best for:

  • Personal use
  • Small teams
  • Development
  • Model experimentation

vLLM

What: Production-grade local inference server

Benefits:

  • Optimized for throughput
  • Advanced GPU utilization
  • Production-ready
  • Scales well

Best for:

  • Large deployments
  • High concurrent users
  • Maximum performance
  • Enterprise use

Choosing Models

For Chat (Answer Generation)

High Quality Needed:

  • Cloud: GPT-4, Claude 3 Opus
  • Local: Llama 2 70B, Mixtral 8x7B
  • When: Complex questions, critical accuracy

Balanced Performance:

  • Cloud: GPT-3.5 Turbo, Claude 3 Sonnet
  • Local: Llama 2 13B, Mistral 7B
  • When: Most use cases, good quality/speed trade-off

Speed Priority:

  • Cloud: GPT-3.5 Turbo, Claude 3 Haiku
  • Local: Llama 2 7B, Phi-3
  • When: High volume, fast responses needed

Cloud Options:

  • OpenAI text-embedding-ada-002 (1536 dimensions)
  • OpenAI text-embedding-3-small (1536 dimensions)
  • OpenAI text-embedding-3-large (3072 dimensions)

Local Options:

  • sentence-transformers/all-MiniLM-L6-v2 (fast, 384 dim)
  • sentence-transformers/all-mpnet-base-v2 (quality, 768 dim)
  • BAAI/bge-large-en-v1.5 (high quality, 1024 dim)

Recommendation: Start with all-MiniLM-L6-v2 (local) or ada-002 (cloud)

Hardware Requirements

For Local Models

Minimum (CPU only):

  • 16GB RAM
  • Any modern CPU
  • 50GB disk space
  • Models: 7B parameters or smaller

Recommended (GPU):

  • 16GB+ RAM
  • NVIDIA GPU with 8GB+ VRAM
  • 100GB disk space
  • Models: up to 13B parameters

High Performance (Multi-GPU):

  • 32GB+ RAM
  • Multiple NVIDIA GPUs (16GB+ each)
  • 200GB+ disk space
  • Models: 70B+ parameters

Performance Guide

Model SizeCPU RAMGPU VRAMSpeedQuality
3B8GB4GBFastGood
7B16GB8GBMediumBetter
13B32GB16GBSlowerGreat
70B128GB40GB+SlowBest

Model Caching

What it does: Keeps frequently used models loaded in memory for instant responses

Benefits:

  • First query: ~194 seconds to load model
  • Subsequent queries: 3-5 seconds response time
  • Automatic management
  • No configuration needed

How it helps:

  • Dramatically faster repeated queries
  • Better user experience
  • Efficient resource use
  • Supports multiple models

Cost Considerations

Cloud Models

Pricing structure:

  • Pay per token (input + output)
  • GPT-4: Most expensive, best quality
  • GPT-3.5: Economical, good quality
  • Embeddings: Very low cost

Typical costs:

  • 1000 documents embedded: $1-5
  • 1000 chat queries: $10-100 (varies by model)
  • Monthly for active use: $50-500 (depends on volume)

Local Models

One-time costs:

  • GPU hardware: $500-5000
  • Setup time: Few hours
  • Model downloads: Free

Ongoing costs:

  • Electricity only
  • No per-query charges
  • No rate limits
  • Scales with hardware

Break-even: Typically 3-6 months for moderate use

Switching Models

Easy Migration

Change chat model:

  • Select new model in settings
  • Existing conversations unaffected
  • Takes effect immediately

Change embedding model:

  • Update collection settings
  • Reprocess documents with new embeddings
  • Search quality may improve or change

Mix and match:

  • Different models per collection
  • Chat and embeddings can use different providers
  • Optimize for each use case

Best Practices

Model Selection

Start simple:

  1. Begin with cloud model (GPT-3.5 Turbo)
  2. Evaluate quality and cost
  3. Try local models if privacy/cost matters
  4. Optimize based on actual usage

Optimize for use case:

  • Customer support: Fast models (GPT-3.5, Llama 2 7B)
  • Research: High quality (GPT-4, Claude Opus)
  • Internal docs: Local models for privacy
  • Public content: Cloud models for scale

Performance Tuning

For local models:

  • Start with smaller models (7B)
  • Increase size if quality insufficient
  • Use GPU acceleration when available
  • Monitor memory usage

For cloud models:

  • Use cheaper models for embeddings
  • Reserve expensive models for complex queries
  • Monitor API costs
  • Set reasonable rate limits

Privacy & Compliance

When privacy matters:

  • Use local models exclusively
  • Self-host embeddings and chat
  • Data never leaves your infrastructure
  • Full audit trail

When cloud is acceptable:

  • Use encrypted connections
  • Review provider privacy policies
  • Consider data residency requirements
  • Understand data retention

Troubleshooting

Model Loading Slow

For local models:

  • Check available RAM/VRAM
  • Reduce model size
  • Enable model caching
  • Use faster storage (SSD)

Poor Answer Quality

Try:

  • Switch to larger/better model
  • Adjust temperature settings
  • Improve document chunking
  • Use better embedding model

Out of Memory

Solutions:

  • Use smaller model
  • Reduce context window
  • Add more RAM/VRAM
  • Use quantized models (Q4, Q5)

API Errors

Check:

  • API key is valid
  • Rate limits not exceeded
  • Sufficient API credits
  • Service status

Scrapalot works with any model. Start with what's convenient, optimize as you learn your usage patterns. You're never locked in.

Released under the MIT License.