Case Study: Designing a Multi-Tenant, Dynamically Scheduled Sync Engine

Feb 27, 2026 min read

1. Problem Context

E-commerce teams often rely on spreadsheets as their operational source of truth. Meanwhile, e-commerce platforms expose order data through APIs and webhooks.

Manual exports introduce:

  • Data drift
  • Operational overhead
  • No retry semantics
  • No consistency guarantees

The objective was not simply to export orders to Google Sheets.

It was to design a fault-tolerant, shared-database multi-tenant synchronization engine with strong tenant isolation that:

  • Keeps Google Sheets in sync with order data
  • Supports user-defined export configurations
  • Executes reliably via manual triggers, webhooks, or dynamic schedules
  • Scales safely across tenants
  • Remains idempotent under retries and webhook replay

2. High-Level Architecture

The system consists of three runtime roles:

API Server (FastAPI)

  • Authenticates tenant installations
  • Manages export automation configuration
  • Validates constraints
  • Creates sync runs
  • Enqueues background jobs
  • Optionally runs a scheduler loop

Worker (Celery + Redis)

  • Executes long-running sync runs
  • Streams data from upstream APIs
  • Writes to Google Sheets
  • Updates run state

MongoDB

  • Shared-database multi-tenant persistence
  • Tenant-scoped partitioning and query isolation
  • Automation definitions
  • Run history
  • Scheduler locking
  • Idempotency keys

This architecture supports:

  • Clear separation of concerns
  • Safe background execution
  • Durable state
  • Horizontal scalability

3. Multi-Tenant Isolation Model

All data is scoped by a tenant_id.

Each installation:

  • Exchanges an auth code
  • Receives a locally minted bearer token
  • Has all data partitioned by tenant

Tokens are:

  • Stored hashed (SHA-256)
  • Revocable
  • TTL-controlled

Design principle:

Shared database, strict tenant-level isolation. No intentional cross-tenant queries.

This enables safe horizontal scaling and reduces the risk of accidental data leakage from application-level mistakes.


4. Google OAuth Linking

Each tenant can link one or more Google accounts.

Security measures:

  • Server-stored OAuth state nonce
  • One-time state consumption
  • Expiring state records
  • Linked accounts scoped per tenant
  • Exactly one default account per tenant (enforced via partial unique index)

This helps ensure:

  • Safe OAuth flow
  • No replay attacks
  • Multi-account flexibility

5. Automation as a First-Class Entity

The core abstraction is the Automation.

An automation defines:

  • What to export
  • Optional scope filters
  • Destination spreadsheet and worksheet
  • Dynamic column schema
  • Optional scheduling configuration (cron)

Constraints applied:

  • No two automations can target the same spreadsheet tab
  • Cron must respect minimum execution interval
  • Only active automations can execute

Exports become durable, declarative configuration — not ad-hoc jobs.


6. Dynamic Scheduling Design (Why Not Celery Beat?)

Scheduling is per-automation and user-defined.

Each automation has:

  • Its own cron expression
  • Enable/disable state
  • Next execution timestamp
  • Locking lifecycle

Schedules are dynamic and mutable at runtime.

Celery Beat is designed for centrally defined periodic tasks registered at startup. It is not a natural fit for:

  • Thousands of user-configurable schedules
  • Runtime schedule mutations
  • Tenant-scoped periodic jobs

Instead, the system uses a data-driven scheduler:

  • Eligible automations are selected by next_run_at <= now()
  • An atomic update sets scheduler_locked_until to claim execution
  • The scheduler enqueues a run and computes the next execution time
  • The next run timestamp is persisted

This enables:

  • Safe horizontal scaling
  • Reduced risk of duplicate scheduling
  • No centralized scheduler coordination
  • Fully dynamic per-tenant scheduling

The scheduler is stateful and data-driven, not static and process-bound.


7. Sync Runs as Durable Units of Work

Each execution is represented by a sync_run document.

Every run:

  • Has a lifecycle (pending → running → completed/failed)
  • Stores metadata
  • Stores error summaries
  • Is linked to an automation
  • Can be retried with idempotency safeguards

Triggers:

  • Manual
  • Webhook-driven
  • Periodic scheduler

Persisting runs provides:

  • Observability
  • Debuggability
  • Auditability
  • Retry safety

8. Webhook Ingestion & Idempotency

Webhooks are untrusted and at-least-once by nature.

The system enforces:

  • HMAC-SHA256 signature verification
  • Tenant resolution via headers
  • Event idempotency key stored in Mongo
  • TTL window for deduplication

This is intended to provide:

  • Replay protection
  • Reduced risk of duplicate sync runs
  • Robustness under retry storms

The system is explicitly designed for at-least-once delivery.


9. Streaming Export Pipeline

The main technical challenge was preventing memory blowup during large exports.

Naive approach:

  • Materialize full dataset
  • Transform
  • Write

This scales linearly with export size.

Final design:

  • Stream rows incrementally
  • Transform per-row
  • Group consecutive rows by unique order identifier
  • Flush bounded batches to Sheets

Explicit batch caps:

  • 300 order groups
  • 1200 rows
  • 1.2MB payload

Effects:

  • Memory remains bounded
  • External API payload size controlled
  • Large exports handled safely

The system scales predictably.


10. Idempotent Sheets Writes

Google Sheets writes use upsert semantics keyed by a unique order identifier.

Write strategy:

  • Ensure worksheet exists
  • Rewrite header if schema changes
  • Read existing row identities
  • Update existing rows
  • Append new rows

This helps ensure:

  • Safe retries
  • No logical duplication
  • Consistency under worker restarts (assuming idempotency keys remain stable)

Design goal:

  • At-least-once execution must not corrupt downstream state.

11. Async ↔ Worker Boundary Stability

FastAPI services are async.
Celery tasks are synchronous entrypoints.

To prevent event loop instability in prefork workers:

  • Each task creates a fresh async database client
  • asyncio.run() is called inside the task
  • No event loop reuse across executions

This avoids:

  • RuntimeError: Event loop is closed
  • Loop contamination across tasks

A subtle but important production stability detail.


12. Tradeoffs & Engineering Decisions

Why streaming instead of full materialization?
Memory safety and scalability.

Why DB-backed scheduler instead of Celery Beat?
Dynamic per-tenant scheduling and atomic claim safety.

Why upsert instead of append-only?
Retry safety and webhook idempotency.

Why tenant-scoped auth?
Hard isolation boundaries and simpler access control.

Why persist sync runs?
Operational debugging and audit trail.


13. Reliability & Observability

Each run records:

  • Start and end timestamps
  • Status
  • Errors
  • Metadata

This enables:

  • Operational inspection
  • Debugging
  • Manual re-execution
  • Failure transparency