Skip to content

External Connectors

Automatically fetch and sync documents from external sources. Keep your knowledge base up-to-date without manual uploads.

What Are Connectors?

Connectors integrate Scrapalot with external services to automatically:

  • Fetch documents from cloud storage and web sources
  • Sync on schedule to keep content current
  • Handle authentication securely
  • Monitor for updates and fetch new content
  • Respect rate limits to avoid service issues

Supported Sources

Google Drive

Automatically sync folders from Google Drive

Use cases:

  • Team documentation stored in shared folders
  • Project files that update regularly
  • Policies and procedures that change

Features:

  • Sync entire folders with subfolders
  • Filter by file type (PDF, Word, etc.)
  • Automatic updates when files change
  • OAuth 2.0 secure authentication

Setup:

  1. Add Google Drive connector to collection
  2. Authorize with your Google account
  3. Select folder to sync
  4. Choose sync schedule
  5. Documents appear automatically

Firecrawl (Web Scraping)

Extract content from websites

Use cases:

  • Documentation sites
  • Knowledge bases
  • Help centers
  • Blog content

Features:

  • Handles JavaScript-heavy sites
  • Waits for dynamic content to load
  • Extracts clean text
  • Follows links to specified depth

Setup:

  1. Get Firecrawl API key (free tier available)
  2. Add Firecrawl connector
  3. Enter website URL
  4. Configure crawl depth
  5. Start fetching

Web Scraper (Simple Pages)

Fetch content from static web pages

Use cases:

  • Simple documentation pages
  • Static content sites
  • Public knowledge bases

Features:

  • Fast, lightweight
  • No external API needed
  • Custom CSS selectors
  • Rate limiting built-in

Setup:

  1. Add Web Scraper connector
  2. Enter page URLs
  3. Optionally specify CSS selectors
  4. Configure delays between requests
  5. Fetch content

Custom API

Connect to any REST API

Use cases:

  • Internal company systems
  • Custom document repositories
  • Third-party services
  • Legacy systems

Features:

  • Flexible endpoint configuration
  • Custom headers and authentication
  • Response parsing options
  • Error handling

Setup:

  1. Add API connector
  2. Configure endpoint URL
  3. Set authentication headers
  4. Define response format
  5. Test and activate

Sync Scheduling

Schedule Options

Manual:

  • Fetch only when you trigger it
  • Good for one-time imports
  • Full control over timing

Hourly:

  • Keep content very current
  • Good for rapidly changing content
  • Higher API usage

Daily:

  • Balance between freshness and efficiency
  • Recommended for most use cases
  • Runs during low-activity hours

Weekly:

  • Light API usage
  • Good for stable content
  • Minimal resource impact

Automatic Updates

What happens during sync:

  1. Connector checks source for new/updated documents
  2. Downloads only changed files
  3. Queues documents for processing
  4. Updates existing documents if modified
  5. Sends notification when complete

Smart syncing:

  • Only fetches what changed
  • Deduplicates identical content
  • Preserves existing document metadata
  • Maintains citation links

Authentication & Security

OAuth 2.0 (Google Drive)

Secure, standard authentication:

  • Authorize once, works indefinitely
  • Revoke access anytime
  • No password storage
  • Automatic token refresh

Permission scope:

  • Read-only access to selected folders
  • Cannot modify your files
  • Limited to folders you choose

API Keys (Firecrawl, Custom APIs)

Simple key-based authentication:

  • Store keys securely encrypted
  • Never exposed in logs
  • Easy to rotate
  • Revoke anytime

Security:

  • Keys encrypted at rest
  • Transmitted over TLS
  • Access controlled per user

Error Handling

Automatic Retry

If fetching fails:

  • Automatic retry with exponential backoff
  • Skip problematic documents, continue with others
  • Detailed error logging
  • User notification of issues

Common failures handled:

  • Temporary network issues
  • Rate limit exceeded (waits and retries)
  • Document temporarily unavailable
  • Authentication token expired (auto-refresh)

Notifications

You're informed when:

  • Sync completes successfully
  • Documents fail to fetch
  • Authentication expires
  • Rate limits approached
  • Service unavailable

Monitoring & Management

Connector Status

Track connector health:

  • Last successful sync time
  • Next scheduled sync
  • Documents fetched
  • Success/failure counts
  • Current status (active, paused, error)

Available actions:

  • Trigger manual sync
  • Pause/resume syncing
  • Edit configuration
  • View sync history
  • Delete connector

Sync History

View past activity:

  • Sync timestamps
  • Success/failure status
  • Documents processed
  • Error messages
  • Processing time

Use for:

  • Troubleshooting issues
  • Verifying sync schedule
  • Monitoring API usage
  • Audit trail

Rate Limiting & Quotas

Automatic Rate Management

Respects API limits:

  • Configurable delays between requests
  • Automatic backoff on limit warnings
  • Queue management to spread load
  • Pause and resume on quota exhaustion

Google Drive:

  • 1000 requests per 100 seconds (Google limit)
  • Automatic throttling built-in
  • Batch operations when possible

Firecrawl:

  • Free tier: 500 pages/month
  • Paid tier: Higher limits
  • Tracks usage automatically

Quota Monitoring

Track API usage:

  • Current usage vs. limits
  • Usage by connector
  • Alerts when approaching limits
  • Recommendations to optimize

Best Practices

Connector Setup

Optimize your connectors:

  • Use specific folders/URLs, not entire drives
  • Filter by relevant file types
  • Set appropriate sync frequency
  • Group related content in same connector

Performance

Efficient syncing:

  • Schedule during low-usage hours
  • Avoid hourly sync unless necessary
  • Use manual sync for one-time imports
  • Monitor document count growth

Organization

Keep it maintainable:

  • Name connectors descriptively
  • Document what each connector fetches
  • Review and clean unused connectors
  • Archive completed syncs

Security

Protect your data:

  • Use minimum necessary permissions
  • Review connector access regularly
  • Rotate API keys periodically
  • Remove unused connectors

Troubleshooting

Connector Won't Authenticate

Check:

  • Credentials are correct
  • OAuth consent not expired
  • API key is valid
  • Service is accessible

Solutions:

  • Re-authorize OAuth
  • Generate new API key
  • Check firewall/network
  • Verify service status

No Documents Fetched

Common causes:

  • Empty folder/source
  • File type filters too restrictive
  • Permission issues
  • Rate limit reached

Solutions:

  • Verify source has content
  • Adjust file type filters
  • Check permissions
  • Review quota usage

Sync Failing Repeatedly

Investigate:

  • Error messages in history
  • Service health status
  • Authentication validity
  • Network connectivity

Fix:

  • Address specific error
  • Re-authenticate if needed
  • Check source availability
  • Contact support if persistent

Use Case Examples

Team Documentation

Scenario: Engineering team stores docs in Google Drive

Setup:

  • Connect to Drive folder
  • Daily sync schedule
  • PDF and Markdown files only
  • Notify on updates

Benefits:

  • Always current documentation
  • No manual uploads
  • Automatic processing
  • Team stays informed

Product Knowledge Base

Scenario: Public help center needs to be searchable

Setup:

  • Firecrawl connector to help site
  • Weekly sync
  • 2-level deep crawl
  • Main content section only

Benefits:

  • Searchable help content
  • Updated automatically
  • Full-text search
  • Citation to original

Compliance Documents

Scenario: Regulatory documents from internal API

Setup:

  • Custom API connector
  • Monthly sync
  • Authenticated endpoint
  • Document metadata preserved

Benefits:

  • Centralized compliance search
  • Automatic updates
  • Audit trail maintained
  • Secure access

Connectors automate document management so you never have to manually upload updates. Set it up once and forget it.

Released under the MIT License.