AI Model Selection

Choose the AI models that best fit your needs - from powerful cloud models to privacy-focused local options. Scrapalot supports all major providers and self-hosted alternatives.

Your Model, Your Choice

Scrapalot is model-agnostic. Use any combination of:

Cloud models for best quality and speed
Local models for privacy and cost savings
Self-hosted models for complete control
Mix and match different models for different tasks

Cloud Model Providers

OpenAI

Models:

GPT-4, GPT-4 Turbo (best quality)
GPT-3.5 Turbo (fast and economical)
text-embedding-ada-002 (embeddings)

Best for:

Highest quality answers
Complex reasoning tasks
Production deployments
When cost is secondary to quality

Pricing: Pay per use, competitive rates

Anthropic Claude

Models:

Claude 3 Opus (highest quality)
Claude 3 Sonnet (balanced)
Claude 3 Haiku (fastest)

Best for:

Very long documents (200K token context)
Safe, aligned responses
Detailed explanations
Following complex instructions

Pricing: Pay per use, tiered by model

Google Gemini

Models:

Gemini Pro
Gemini Ultra

Best for:

Multimodal content (text and images)
Large context windows
Cost-effective alternative
Google ecosystem integration

Pricing: Competitive with generous free tier

Local & Self-Hosted Options

Local GGUF Models

What: Run AI models directly on your server

Privacy benefits:

Your data never leaves your infrastructure
No API costs
No rate limits
Complete control

Popular models:

Llama 2 (7B, 13B, 70B)
Mistral (7B, 8x7B)
Phi-3 (small and fast)
CodeLlama (code-focused)

Requirements:

GPU recommended (NVIDIA preferred)
CPU-only possible but slower
16GB+ RAM recommended
Storage for model files (5-50GB per model)

Ollama

What: Local model server with easy management

Benefits:

Simple model download and switching
Automatic model management
Good performance out of box
Active community

Best for:

Development and testing
Privacy-sensitive deployments
Cost control
Learning and experimentation

Setup: Install Ollama, run models locally

LM Studio

What: Desktop app for running local models

Benefits:

User-friendly GUI
GPU acceleration built-in
Easy model browsing
Great for getting started

Best for:

Personal use
Small teams
Development
Model experimentation

vLLM

What: Production-grade local inference server

Benefits:

Optimized for throughput
Advanced GPU utilization
Production-ready
Scales well

Best for:

Large deployments
High concurrent users
Maximum performance
Enterprise use

Choosing Models

For Chat (Answer Generation)

High Quality Needed:

Cloud: GPT-4, Claude 3 Opus
Local: Llama 2 70B, Mixtral 8x7B
When: Complex questions, critical accuracy

Balanced Performance:

Cloud: GPT-3.5 Turbo, Claude 3 Sonnet
Local: Llama 2 13B, Mistral 7B
When: Most use cases, good quality/speed trade-off

Speed Priority:

Cloud: GPT-3.5 Turbo, Claude 3 Haiku
Local: Llama 2 7B, Phi-3
When: High volume, fast responses needed

For Embeddings (Search)

Cloud Options:

OpenAI text-embedding-ada-002 (1536 dimensions)
OpenAI text-embedding-3-small (1536 dimensions)
OpenAI text-embedding-3-large (3072 dimensions)

Local Options:

sentence-transformers/all-MiniLM-L6-v2 (fast, 384 dim)
sentence-transformers/all-mpnet-base-v2 (quality, 768 dim)
BAAI/bge-large-en-v1.5 (high quality, 1024 dim)

Recommendation: Start with all-MiniLM-L6-v2 (local) or ada-002 (cloud)

Hardware Requirements

For Local Models

Minimum (CPU only):

16GB RAM
Any modern CPU
50GB disk space
Models: 7B parameters or smaller

Recommended (GPU):

16GB+ RAM
NVIDIA GPU with 8GB+ VRAM
100GB disk space
Models: up to 13B parameters

High Performance (Multi-GPU):

32GB+ RAM
Multiple NVIDIA GPUs (16GB+ each)
200GB+ disk space
Models: 70B+ parameters

Performance Guide

Model Size	CPU RAM	GPU VRAM	Speed	Quality
3B	8GB	4GB	Fast	Good
7B	16GB	8GB	Medium	Better
13B	32GB	16GB	Slower	Great
70B	128GB	40GB+	Slow	Best

Model Caching

What it does: Keeps frequently used models loaded in memory for instant responses

Benefits:

First query: ~194 seconds to load model
Subsequent queries: 3-5 seconds response time
Automatic management
No configuration needed

How it helps:

Dramatically faster repeated queries
Better user experience
Efficient resource use
Supports multiple models

Cost Considerations

Cloud Models

Pricing structure:

Pay per token (input + output)
GPT-4: Most expensive, best quality
GPT-3.5: Economical, good quality
Embeddings: Very low cost

Typical costs:

1000 documents embedded: $1-5
1000 chat queries: $10-100 (varies by model)
Monthly for active use: $50-500 (depends on volume)

Local Models

One-time costs:

GPU hardware: $500-5000
Setup time: Few hours
Model downloads: Free

Ongoing costs:

Electricity only
No per-query charges
No rate limits
Scales with hardware

Break-even: Typically 3-6 months for moderate use

Switching Models

Easy Migration

Change chat model:

Select new model in settings
Existing conversations unaffected
Takes effect immediately

Change embedding model:

Update collection settings
Reprocess documents with new embeddings
Search quality may improve or change

Mix and match:

Different models per collection
Chat and embeddings can use different providers
Optimize for each use case

Best Practices

Model Selection

Start simple:

Begin with cloud model (GPT-3.5 Turbo)
Evaluate quality and cost
Try local models if privacy/cost matters
Optimize based on actual usage

Optimize for use case:

Customer support: Fast models (GPT-3.5, Llama 2 7B)
Research: High quality (GPT-4, Claude Opus)
Internal docs: Local models for privacy
Public content: Cloud models for scale

Performance Tuning

For local models:

Start with smaller models (7B)
Increase size if quality insufficient
Use GPU acceleration when available
Monitor memory usage

For cloud models:

Use cheaper models for embeddings
Reserve expensive models for complex queries
Monitor API costs
Set reasonable rate limits

Privacy & Compliance

When privacy matters:

Use local models exclusively
Self-host embeddings and chat
Data never leaves your infrastructure
Full audit trail

When cloud is acceptable:

Use encrypted connections
Review provider privacy policies
Consider data residency requirements
Understand data retention

Troubleshooting

Model Loading Slow

For local models:

Check available RAM/VRAM
Reduce model size
Enable model caching
Use faster storage (SSD)

Poor Answer Quality

Try:

Switch to larger/better model
Adjust temperature settings
Improve document chunking
Use better embedding model

Out of Memory

Solutions:

Use smaller model
Reduce context window
Add more RAM/VRAM
Use quantized models (Q4, Q5)

API Errors

Check:

API key is valid
Rate limits not exceeded
Sufficient API credits
Service status

RAG Strategy - How models power retrieval
Document Processing - Embedding usage
Database Design - Model configuration storage
Deployment Guide - Production model setup

Scrapalot works with any model. Start with what's convenient, optimize as you learn your usage patterns. You're never locked in.

AI Model Selection ​

Your Model, Your Choice ​

Cloud Model Providers ​

OpenAI ​

Anthropic Claude ​

Google Gemini ​

Local & Self-Hosted Options ​

Local GGUF Models ​

Ollama ​

LM Studio ​

vLLM ​

Choosing Models ​

For Chat (Answer Generation) ​

For Embeddings (Search) ​

Hardware Requirements ​

For Local Models ​

Performance Guide ​

Model Caching ​

Cost Considerations ​

Cloud Models ​

Local Models ​

Switching Models ​

Easy Migration ​

Best Practices ​

Model Selection ​

Performance Tuning ​

Privacy & Compliance ​

Troubleshooting ​

Model Loading Slow ​

Poor Answer Quality ​

Out of Memory ​

API Errors ​

Related Documentation ​

AI Model Selection

Your Model, Your Choice

Cloud Model Providers

OpenAI

Anthropic Claude

Google Gemini

Local & Self-Hosted Options

Local GGUF Models

Ollama

LM Studio

vLLM

Choosing Models

For Chat (Answer Generation)

For Embeddings (Search)

Hardware Requirements

For Local Models

Performance Guide

Model Caching

Cost Considerations

Cloud Models

Local Models

Switching Models

Easy Migration

Best Practices

Model Selection

Performance Tuning

Privacy & Compliance

Troubleshooting

Model Loading Slow

Poor Answer Quality

Out of Memory

API Errors

Related Documentation