AI Model Selection
Choose the AI models that best fit your needs - from powerful cloud models to privacy-focused local options. Scrapalot supports all major providers and self-hosted alternatives.
Your Model, Your Choice
Scrapalot is model-agnostic. Use any combination of:
- Cloud models for best quality and speed
- Local models for privacy and cost savings
- Self-hosted models for complete control
- Mix and match different models for different tasks
Cloud Model Providers
OpenAI
Models:
- GPT-4, GPT-4 Turbo (best quality)
- GPT-3.5 Turbo (fast and economical)
- text-embedding-ada-002 (embeddings)
Best for:
- Highest quality answers
- Complex reasoning tasks
- Production deployments
- When cost is secondary to quality
Pricing: Pay per use, competitive rates
Anthropic Claude
Models:
- Claude 3 Opus (highest quality)
- Claude 3 Sonnet (balanced)
- Claude 3 Haiku (fastest)
Best for:
- Very long documents (200K token context)
- Safe, aligned responses
- Detailed explanations
- Following complex instructions
Pricing: Pay per use, tiered by model
Google Gemini
Models:
- Gemini Pro
- Gemini Ultra
Best for:
- Multimodal content (text and images)
- Large context windows
- Cost-effective alternative
- Google ecosystem integration
Pricing: Competitive with generous free tier
Local & Self-Hosted Options
Local GGUF Models
What: Run AI models directly on your server
Privacy benefits:
- Your data never leaves your infrastructure
- No API costs
- No rate limits
- Complete control
Popular models:
- Llama 2 (7B, 13B, 70B)
- Mistral (7B, 8x7B)
- Phi-3 (small and fast)
- CodeLlama (code-focused)
Requirements:
- GPU recommended (NVIDIA preferred)
- CPU-only possible but slower
- 16GB+ RAM recommended
- Storage for model files (5-50GB per model)
Ollama
What: Local model server with easy management
Benefits:
- Simple model download and switching
- Automatic model management
- Good performance out of box
- Active community
Best for:
- Development and testing
- Privacy-sensitive deployments
- Cost control
- Learning and experimentation
Setup: Install Ollama, run models locally
LM Studio
What: Desktop app for running local models
Benefits:
- User-friendly GUI
- GPU acceleration built-in
- Easy model browsing
- Great for getting started
Best for:
- Personal use
- Small teams
- Development
- Model experimentation
vLLM
What: Production-grade local inference server
Benefits:
- Optimized for throughput
- Advanced GPU utilization
- Production-ready
- Scales well
Best for:
- Large deployments
- High concurrent users
- Maximum performance
- Enterprise use
Choosing Models
For Chat (Answer Generation)
High Quality Needed:
- Cloud: GPT-4, Claude 3 Opus
- Local: Llama 2 70B, Mixtral 8x7B
- When: Complex questions, critical accuracy
Balanced Performance:
- Cloud: GPT-3.5 Turbo, Claude 3 Sonnet
- Local: Llama 2 13B, Mistral 7B
- When: Most use cases, good quality/speed trade-off
Speed Priority:
- Cloud: GPT-3.5 Turbo, Claude 3 Haiku
- Local: Llama 2 7B, Phi-3
- When: High volume, fast responses needed
For Embeddings (Search)
Cloud Options:
- OpenAI text-embedding-ada-002 (1536 dimensions)
- OpenAI text-embedding-3-small (1536 dimensions)
- OpenAI text-embedding-3-large (3072 dimensions)
Local Options:
- sentence-transformers/all-MiniLM-L6-v2 (fast, 384 dim)
- sentence-transformers/all-mpnet-base-v2 (quality, 768 dim)
- BAAI/bge-large-en-v1.5 (high quality, 1024 dim)
Recommendation: Start with all-MiniLM-L6-v2 (local) or ada-002 (cloud)
Hardware Requirements
For Local Models
Minimum (CPU only):
- 16GB RAM
- Any modern CPU
- 50GB disk space
- Models: 7B parameters or smaller
Recommended (GPU):
- 16GB+ RAM
- NVIDIA GPU with 8GB+ VRAM
- 100GB disk space
- Models: up to 13B parameters
High Performance (Multi-GPU):
- 32GB+ RAM
- Multiple NVIDIA GPUs (16GB+ each)
- 200GB+ disk space
- Models: 70B+ parameters
Performance Guide
| Model Size | CPU RAM | GPU VRAM | Speed | Quality |
|---|---|---|---|---|
| 3B | 8GB | 4GB | Fast | Good |
| 7B | 16GB | 8GB | Medium | Better |
| 13B | 32GB | 16GB | Slower | Great |
| 70B | 128GB | 40GB+ | Slow | Best |
Model Caching
What it does: Keeps frequently used models loaded in memory for instant responses
Benefits:
- First query: ~194 seconds to load model
- Subsequent queries: 3-5 seconds response time
- Automatic management
- No configuration needed
How it helps:
- Dramatically faster repeated queries
- Better user experience
- Efficient resource use
- Supports multiple models
Cost Considerations
Cloud Models
Pricing structure:
- Pay per token (input + output)
- GPT-4: Most expensive, best quality
- GPT-3.5: Economical, good quality
- Embeddings: Very low cost
Typical costs:
- 1000 documents embedded: $1-5
- 1000 chat queries: $10-100 (varies by model)
- Monthly for active use: $50-500 (depends on volume)
Local Models
One-time costs:
- GPU hardware: $500-5000
- Setup time: Few hours
- Model downloads: Free
Ongoing costs:
- Electricity only
- No per-query charges
- No rate limits
- Scales with hardware
Break-even: Typically 3-6 months for moderate use
Switching Models
Easy Migration
Change chat model:
- Select new model in settings
- Existing conversations unaffected
- Takes effect immediately
Change embedding model:
- Update collection settings
- Reprocess documents with new embeddings
- Search quality may improve or change
Mix and match:
- Different models per collection
- Chat and embeddings can use different providers
- Optimize for each use case
Best Practices
Model Selection
Start simple:
- Begin with cloud model (GPT-3.5 Turbo)
- Evaluate quality and cost
- Try local models if privacy/cost matters
- Optimize based on actual usage
Optimize for use case:
- Customer support: Fast models (GPT-3.5, Llama 2 7B)
- Research: High quality (GPT-4, Claude Opus)
- Internal docs: Local models for privacy
- Public content: Cloud models for scale
Performance Tuning
For local models:
- Start with smaller models (7B)
- Increase size if quality insufficient
- Use GPU acceleration when available
- Monitor memory usage
For cloud models:
- Use cheaper models for embeddings
- Reserve expensive models for complex queries
- Monitor API costs
- Set reasonable rate limits
Privacy & Compliance
When privacy matters:
- Use local models exclusively
- Self-host embeddings and chat
- Data never leaves your infrastructure
- Full audit trail
When cloud is acceptable:
- Use encrypted connections
- Review provider privacy policies
- Consider data residency requirements
- Understand data retention
Troubleshooting
Model Loading Slow
For local models:
- Check available RAM/VRAM
- Reduce model size
- Enable model caching
- Use faster storage (SSD)
Poor Answer Quality
Try:
- Switch to larger/better model
- Adjust temperature settings
- Improve document chunking
- Use better embedding model
Out of Memory
Solutions:
- Use smaller model
- Reduce context window
- Add more RAM/VRAM
- Use quantized models (Q4, Q5)
API Errors
Check:
- API key is valid
- Rate limits not exceeded
- Sufficient API credits
- Service status
Related Documentation
- RAG Strategy - How models power retrieval
- Document Processing - Embedding usage
- Database Design - Model configuration storage
- Deployment Guide - Production model setup
Scrapalot works with any model. Start with what's convenient, optimize as you learn your usage patterns. You're never locked in.