External Connectors
Last Updated: March 2026
Automatically fetch and sync documents from external sources. Keep your knowledge base up-to-date without manual uploads.
What Are Connectors?
Connectors integrate Scrapalot with external services to automatically:
- Fetch documents from cloud storage and web sources
- Sync on schedule to keep content current
- Handle authentication securely
- Monitor for updates and fetch new content
- Respect rate limits to avoid service issues
Connector Architecture
Connectors use Redis Streams K→P sync to propagate connector configuration from the Kotlin backend (owner) to the Python AI service (consumer). This ensures both services stay in sync when connectors are created, updated, or deleted.
Supported Sources (11 Connectors)
Workspace Connectors (8)
Google Drive
Automatically sync folders from Google Drive
Use cases:
- Team documentation stored in shared folders
- Project files that update regularly
- Policies and procedures that change
Features:
- Sync entire folders with subfolders
- Filter by file type (PDF, Word, etc.)
- Automatic updates when files change
- OAuth 2.0 secure authentication
Setup:
- Add Google Drive connector to collection
- Authorize with your Google account
- Select folder to sync
- Choose sync schedule
- Documents appear automatically
Dropbox
Sync files from Dropbox cloud storage
Use cases:
- Team documents stored in Dropbox
- Shared folders and project files
- Automatic sync when files change
Features:
- OAuth 2.0 authentication
- Folder sync with subfolders
- File type filtering
- Automatic updates
Notion
Import pages and databases from Notion
Use cases:
- Team wikis and documentation
- Project management databases
- Knowledge bases stored in Notion
Features:
- OAuth 2.0 integration
- Page and database import
- Automatic sync on schedule
- Rich content extraction
Confluence
Sync documentation from Atlassian Confluence
Use cases:
- Enterprise documentation
- Team knowledge bases
- Technical specifications
Features:
- Space and page import
- Hierarchical content sync
- Authentication integration
Slack
Import messages and files from Slack channels
Use cases:
- Team conversations and decisions
- Shared files and documents
- Knowledge scattered across channels
Features:
- Channel and thread import
- File attachment sync
- OAuth 2.0 authentication
- Message filtering
SharePoint
Connect to Microsoft SharePoint document libraries
Use cases:
- Enterprise document management
- Team collaboration files
- Compliance documentation
Features:
- Document library sync
- Folder filtering
- Microsoft authentication
Wikipedia
Import Wikipedia articles
Use cases:
- Reference material and encyclopedic content
- Background knowledge for research
- Multilingual content
Features:
- Article search and import
- Clean text extraction
- Multilingual support
Zotero
Import references and PDFs from Zotero libraries
Use cases:
- Academic reference management
- Research paper collections
- Bibliography management
Features:
- Library and collection sync
- PDF attachment import
- Metadata extraction
- Citation data preservation
Academic Search Connectors (3)
Google Scholar
Search and import academic papers from Google Scholar
Use cases:
- Academic research papers
- Citation analysis
- Literature reviews
Features:
- Search by keywords, authors, or publication
- Import paper metadata and abstracts
- Follow citation chains
- Filter by publication date
arXiv
Access preprints from arXiv repository
Use cases:
- Latest research in physics, math, CS, and more
- Pre-publication papers
- Technical research
Features:
- Search by topic, author, or arXiv ID
- Download full paper PDFs
- Filter by category and date
- Automatic metadata extraction
Semantic Scholar
AI-powered academic paper search
Use cases:
- Comprehensive literature search
- Finding influential papers
- Research trend analysis
Features:
- Semantic search across academic papers
- Citation and reference tracking
- Author and topic analysis
- Relevance-based ranking
Sync Scheduling
Schedule Options
Manual:
- Fetch only when you trigger it
- Good for one-time imports
- Full control over timing
Hourly:
- Keep content very current
- Good for rapidly changing content
- Higher API usage
Daily:
- Balance between freshness and efficiency
- Recommended for most use cases
- Runs during low-activity hours
Weekly:
- Light API usage
- Good for stable content
- Minimal resource impact
Automatic Updates
What happens during sync:
- Connector checks source for new/updated documents
- Downloads only changed files
- Queues documents for processing
- Updates existing documents if modified
- Sends notification when complete
Smart syncing:
- Only fetches what changed
- Deduplicates identical content
- Preserves existing document metadata
- Maintains citation links
Authentication & Security
OAuth 2.0 (Google Drive)
Secure, standard authentication:
- Authorize once, works indefinitely
- Revoke access anytime
- No password storage
- Automatic token refresh
Permission scope:
- Read-only access to selected folders
- Cannot modify your files
- Limited to folders you choose
API Keys
Simple key-based authentication:
- Store keys securely encrypted
- Never exposed in logs
- Easy to rotate
- Revoke anytime
Security:
- Keys encrypted at rest
- Transmitted over TLS
- Access controlled per user
Error Handling
Automatic Retry
If fetching fails:
- Automatic retry with exponential backoff
- Skip problematic documents, continue with others
- Detailed error logging
- User notification of issues
Common failures handled:
- Temporary network issues
- Rate limit exceeded (waits and retries)
- Document temporarily unavailable
- Authentication token expired (auto-refresh)
Notifications
You're informed when:
- Sync completes successfully
- Documents fail to fetch
- Authentication expires
- Rate limits approached
- Service unavailable
Monitoring & Management
Connector Status
Track connector health:
- Last successful sync time
- Next scheduled sync
- Documents fetched
- Success/failure counts
- Current status (active, paused, error)
Available actions:
- Trigger manual sync
- Pause/resume syncing
- Edit configuration
- View sync history
- Delete connector
Sync History
View past activity:
- Sync timestamps
- Success/failure status
- Documents processed
- Error messages
- Processing time
Use for:
- Troubleshooting issues
- Verifying sync schedule
- Monitoring API usage
- Audit trail
Rate Limiting & Quotas
Automatic Rate Management
Respects API limits:
- Configurable delays between requests
- Automatic backoff on limit warnings
- Queue management to spread load
- Pause and resume on quota exhaustion
Google Drive:
- 1000 requests per 100 seconds (Google limit)
- Automatic throttling built-in
- Batch operations when possible
Firecrawl:
- Free tier: 500 pages/month
- Paid tier: Higher limits
- Tracks usage automatically
Quota Monitoring
Track API usage:
- Current usage vs. limits
- Usage by connector
- Alerts when approaching limits
- Recommendations to optimize
Best Practices
Connector Setup
Optimize your connectors:
- Use specific folders/URLs, not entire drives
- Filter by relevant file types
- Set appropriate sync frequency
- Group related content in same connector
Performance
Efficient syncing:
- Schedule during low-usage hours
- Avoid hourly sync unless necessary
- Use manual sync for one-time imports
- Monitor document count growth
Organization
Keep it maintainable:
- Name connectors descriptively
- Document what each connector fetches
- Review and clean unused connectors
- Archive completed syncs
Security
Protect your data:
- Use minimum necessary permissions
- Review connector access regularly
- Rotate API keys periodically
- Remove unused connectors
Troubleshooting
Connector Won't Authenticate
Check:
- Credentials are correct
- OAuth consent not expired
- API key is valid
- Service is accessible
Solutions:
- Re-authorize OAuth
- Generate new API key
- Check firewall/network
- Verify service status
No Documents Fetched
Common causes:
- Empty folder/source
- File type filters too restrictive
- Permission issues
- Rate limit reached
Solutions:
- Verify source has content
- Adjust file type filters
- Check permissions
- Review quota usage
Sync Failing Repeatedly
Investigate:
- Error messages in history
- Service health status
- Authentication validity
- Network connectivity
Fix:
- Address specific error
- Re-authenticate if needed
- Check source availability
- Contact support if persistent
Use Case Examples
Team Documentation
Scenario: Engineering team stores docs in Google Drive
Setup:
- Connect to Drive folder
- Daily sync schedule
- PDF and Markdown files only
- Notify on updates
Benefits:
- Always current documentation
- No manual uploads
- Automatic processing
- Team stays informed
Product Knowledge Base
Scenario: Public help center needs to be searchable
Setup:
- Firecrawl connector to help site
- Weekly sync
- 2-level deep crawl
- Main content section only
Benefits:
- Searchable help content
- Updated automatically
- Full-text search
- Citation to original
Compliance Documents
Scenario: Regulatory documents from internal API
Setup:
- Custom API connector
- Monthly sync
- Authenticated endpoint
- Document metadata preserved
Benefits:
- Centralized compliance search
- Automatic updates
- Audit trail maintained
- Secure access
Related Documentation
- Background Workers - How fetched documents are processed
- Document Processing - Content chunking
- Database Design - Storage of synced documents
- Deployment Guide - Production connector setup
Connectors automate document management so you never have to manually upload updates. Set it up once and forget it.