External Connectors
Automatically fetch and sync documents from external sources. Keep your knowledge base up-to-date without manual uploads.
What Are Connectors?
Connectors integrate Scrapalot with external services to automatically:
- Fetch documents from cloud storage and web sources
- Sync on schedule to keep content current
- Handle authentication securely
- Monitor for updates and fetch new content
- Respect rate limits to avoid service issues
Supported Sources
Google Drive
Automatically sync folders from Google Drive
Use cases:
- Team documentation stored in shared folders
- Project files that update regularly
- Policies and procedures that change
Features:
- Sync entire folders with subfolders
- Filter by file type (PDF, Word, etc.)
- Automatic updates when files change
- OAuth 2.0 secure authentication
Setup:
- Add Google Drive connector to collection
- Authorize with your Google account
- Select folder to sync
- Choose sync schedule
- Documents appear automatically
Firecrawl (Web Scraping)
Extract content from websites
Use cases:
- Documentation sites
- Knowledge bases
- Help centers
- Blog content
Features:
- Handles JavaScript-heavy sites
- Waits for dynamic content to load
- Extracts clean text
- Follows links to specified depth
Setup:
- Get Firecrawl API key (free tier available)
- Add Firecrawl connector
- Enter website URL
- Configure crawl depth
- Start fetching
Web Scraper (Simple Pages)
Fetch content from static web pages
Use cases:
- Simple documentation pages
- Static content sites
- Public knowledge bases
Features:
- Fast, lightweight
- No external API needed
- Custom CSS selectors
- Rate limiting built-in
Setup:
- Add Web Scraper connector
- Enter page URLs
- Optionally specify CSS selectors
- Configure delays between requests
- Fetch content
Custom API
Connect to any REST API
Use cases:
- Internal company systems
- Custom document repositories
- Third-party services
- Legacy systems
Features:
- Flexible endpoint configuration
- Custom headers and authentication
- Response parsing options
- Error handling
Setup:
- Add API connector
- Configure endpoint URL
- Set authentication headers
- Define response format
- Test and activate
Sync Scheduling
Schedule Options
Manual:
- Fetch only when you trigger it
- Good for one-time imports
- Full control over timing
Hourly:
- Keep content very current
- Good for rapidly changing content
- Higher API usage
Daily:
- Balance between freshness and efficiency
- Recommended for most use cases
- Runs during low-activity hours
Weekly:
- Light API usage
- Good for stable content
- Minimal resource impact
Automatic Updates
What happens during sync:
- Connector checks source for new/updated documents
- Downloads only changed files
- Queues documents for processing
- Updates existing documents if modified
- Sends notification when complete
Smart syncing:
- Only fetches what changed
- Deduplicates identical content
- Preserves existing document metadata
- Maintains citation links
Authentication & Security
OAuth 2.0 (Google Drive)
Secure, standard authentication:
- Authorize once, works indefinitely
- Revoke access anytime
- No password storage
- Automatic token refresh
Permission scope:
- Read-only access to selected folders
- Cannot modify your files
- Limited to folders you choose
API Keys (Firecrawl, Custom APIs)
Simple key-based authentication:
- Store keys securely encrypted
- Never exposed in logs
- Easy to rotate
- Revoke anytime
Security:
- Keys encrypted at rest
- Transmitted over TLS
- Access controlled per user
Error Handling
Automatic Retry
If fetching fails:
- Automatic retry with exponential backoff
- Skip problematic documents, continue with others
- Detailed error logging
- User notification of issues
Common failures handled:
- Temporary network issues
- Rate limit exceeded (waits and retries)
- Document temporarily unavailable
- Authentication token expired (auto-refresh)
Notifications
You're informed when:
- Sync completes successfully
- Documents fail to fetch
- Authentication expires
- Rate limits approached
- Service unavailable
Monitoring & Management
Connector Status
Track connector health:
- Last successful sync time
- Next scheduled sync
- Documents fetched
- Success/failure counts
- Current status (active, paused, error)
Available actions:
- Trigger manual sync
- Pause/resume syncing
- Edit configuration
- View sync history
- Delete connector
Sync History
View past activity:
- Sync timestamps
- Success/failure status
- Documents processed
- Error messages
- Processing time
Use for:
- Troubleshooting issues
- Verifying sync schedule
- Monitoring API usage
- Audit trail
Rate Limiting & Quotas
Automatic Rate Management
Respects API limits:
- Configurable delays between requests
- Automatic backoff on limit warnings
- Queue management to spread load
- Pause and resume on quota exhaustion
Google Drive:
- 1000 requests per 100 seconds (Google limit)
- Automatic throttling built-in
- Batch operations when possible
Firecrawl:
- Free tier: 500 pages/month
- Paid tier: Higher limits
- Tracks usage automatically
Quota Monitoring
Track API usage:
- Current usage vs. limits
- Usage by connector
- Alerts when approaching limits
- Recommendations to optimize
Best Practices
Connector Setup
Optimize your connectors:
- Use specific folders/URLs, not entire drives
- Filter by relevant file types
- Set appropriate sync frequency
- Group related content in same connector
Performance
Efficient syncing:
- Schedule during low-usage hours
- Avoid hourly sync unless necessary
- Use manual sync for one-time imports
- Monitor document count growth
Organization
Keep it maintainable:
- Name connectors descriptively
- Document what each connector fetches
- Review and clean unused connectors
- Archive completed syncs
Security
Protect your data:
- Use minimum necessary permissions
- Review connector access regularly
- Rotate API keys periodically
- Remove unused connectors
Troubleshooting
Connector Won't Authenticate
Check:
- Credentials are correct
- OAuth consent not expired
- API key is valid
- Service is accessible
Solutions:
- Re-authorize OAuth
- Generate new API key
- Check firewall/network
- Verify service status
No Documents Fetched
Common causes:
- Empty folder/source
- File type filters too restrictive
- Permission issues
- Rate limit reached
Solutions:
- Verify source has content
- Adjust file type filters
- Check permissions
- Review quota usage
Sync Failing Repeatedly
Investigate:
- Error messages in history
- Service health status
- Authentication validity
- Network connectivity
Fix:
- Address specific error
- Re-authenticate if needed
- Check source availability
- Contact support if persistent
Use Case Examples
Team Documentation
Scenario: Engineering team stores docs in Google Drive
Setup:
- Connect to Drive folder
- Daily sync schedule
- PDF and Markdown files only
- Notify on updates
Benefits:
- Always current documentation
- No manual uploads
- Automatic processing
- Team stays informed
Product Knowledge Base
Scenario: Public help center needs to be searchable
Setup:
- Firecrawl connector to help site
- Weekly sync
- 2-level deep crawl
- Main content section only
Benefits:
- Searchable help content
- Updated automatically
- Full-text search
- Citation to original
Compliance Documents
Scenario: Regulatory documents from internal API
Setup:
- Custom API connector
- Monthly sync
- Authenticated endpoint
- Document metadata preserved
Benefits:
- Centralized compliance search
- Automatic updates
- Audit trail maintained
- Secure access
Related Documentation
- Background Workers - How fetched documents are processed
- Document Processing - Content chunking
- Database Design - Storage of synced documents
- Deployment Guide - Production connector setup
Connectors automate document management so you never have to manually upload updates. Set it up once and forget it.