1. Problem Context
E-commerce teams often rely on spreadsheets as their operational source of truth. Meanwhile, e-commerce platforms expose order data through APIs and webhooks.
Manual exports introduce:
- Data drift
- Operational overhead
- No retry semantics
- No consistency guarantees
The objective was not simply to export orders to Google Sheets.
It was to design a fault-tolerant, shared-database multi-tenant synchronization engine with strong tenant isolation that:
- Keeps Google Sheets in sync with order data
- Supports user-defined export configurations
- Executes reliably via manual triggers, webhooks, or dynamic schedules
- Scales safely across tenants
- Remains idempotent under retries and webhook replay
2. High-Level Architecture
The system consists of three runtime roles:
API Server (FastAPI)
- Authenticates tenant installations
- Manages export automation configuration
- Validates constraints
- Creates sync runs
- Enqueues background jobs
- Optionally runs a scheduler loop
Worker (Celery + Redis)
- Executes long-running sync runs
- Streams data from upstream APIs
- Writes to Google Sheets
- Updates run state
MongoDB
- Shared-database multi-tenant persistence
- Tenant-scoped partitioning and query isolation
- Automation definitions
- Run history
- Scheduler locking
- Idempotency keys
This architecture supports:
- Clear separation of concerns
- Safe background execution
- Durable state
- Horizontal scalability
3. Multi-Tenant Isolation Model
All data is scoped by a tenant_id.
Each installation:
- Exchanges an auth code
- Receives a locally minted bearer token
- Has all data partitioned by tenant
Tokens are:
- Stored hashed (SHA-256)
- Revocable
- TTL-controlled
Design principle:
Shared database, strict tenant-level isolation. No intentional cross-tenant queries.
This enables safe horizontal scaling and reduces the risk of accidental data leakage from application-level mistakes.
4. Google OAuth Linking
Each tenant can link one or more Google accounts.
Security measures:
- Server-stored OAuth state nonce
- One-time state consumption
- Expiring state records
- Linked accounts scoped per tenant
- Exactly one default account per tenant (enforced via partial unique index)
This helps ensure:
- Safe OAuth flow
- No replay attacks
- Multi-account flexibility
5. Automation as a First-Class Entity
The core abstraction is the Automation.
An automation defines:
- What to export
- Optional scope filters
- Destination spreadsheet and worksheet
- Dynamic column schema
- Optional scheduling configuration (cron)
Constraints applied:
- No two automations can target the same spreadsheet tab
- Cron must respect minimum execution interval
- Only active automations can execute
Exports become durable, declarative configuration — not ad-hoc jobs.
6. Dynamic Scheduling Design (Why Not Celery Beat?)
Scheduling is per-automation and user-defined.
Each automation has:
- Its own cron expression
- Enable/disable state
- Next execution timestamp
- Locking lifecycle
Schedules are dynamic and mutable at runtime.
Celery Beat is designed for centrally defined periodic tasks registered at startup. It is not a natural fit for:
- Thousands of user-configurable schedules
- Runtime schedule mutations
- Tenant-scoped periodic jobs
Instead, the system uses a data-driven scheduler:
- Eligible automations are selected by
next_run_at <= now() - An atomic update sets
scheduler_locked_untilto claim execution - The scheduler enqueues a run and computes the next execution time
- The next run timestamp is persisted
This enables:
- Safe horizontal scaling
- Reduced risk of duplicate scheduling
- No centralized scheduler coordination
- Fully dynamic per-tenant scheduling
The scheduler is stateful and data-driven, not static and process-bound.
7. Sync Runs as Durable Units of Work
Each execution is represented by a sync_run document.
Every run:
- Has a lifecycle (pending → running → completed/failed)
- Stores metadata
- Stores error summaries
- Is linked to an automation
- Can be retried with idempotency safeguards
Triggers:
- Manual
- Webhook-driven
- Periodic scheduler
Persisting runs provides:
- Observability
- Debuggability
- Auditability
- Retry safety
8. Webhook Ingestion & Idempotency
Webhooks are untrusted and at-least-once by nature.
The system enforces:
- HMAC-SHA256 signature verification
- Tenant resolution via headers
- Event idempotency key stored in Mongo
- TTL window for deduplication
This is intended to provide:
- Replay protection
- Reduced risk of duplicate sync runs
- Robustness under retry storms
The system is explicitly designed for at-least-once delivery.
9. Streaming Export Pipeline
The main technical challenge was preventing memory blowup during large exports.
Naive approach:
- Materialize full dataset
- Transform
- Write
This scales linearly with export size.
Final design:
- Stream rows incrementally
- Transform per-row
- Group consecutive rows by unique order identifier
- Flush bounded batches to Sheets
Explicit batch caps:
- 300 order groups
- 1200 rows
- 1.2MB payload
Effects:
- Memory remains bounded
- External API payload size controlled
- Large exports handled safely
The system scales predictably.
10. Idempotent Sheets Writes
Google Sheets writes use upsert semantics keyed by a unique order identifier.
Write strategy:
- Ensure worksheet exists
- Rewrite header if schema changes
- Read existing row identities
- Update existing rows
- Append new rows
This helps ensure:
- Safe retries
- No logical duplication
- Consistency under worker restarts (assuming idempotency keys remain stable)
Design goal:
- At-least-once execution must not corrupt downstream state.
11. Async ↔ Worker Boundary Stability
FastAPI services are async.
Celery tasks are synchronous entrypoints.
To prevent event loop instability in prefork workers:
- Each task creates a fresh async database client
asyncio.run()is called inside the task- No event loop reuse across executions
This avoids:
RuntimeError: Event loop is closed- Loop contamination across tasks
A subtle but important production stability detail.
12. Tradeoffs & Engineering Decisions
Why streaming instead of full materialization?
Memory safety and scalability.
Why DB-backed scheduler instead of Celery Beat?
Dynamic per-tenant scheduling and atomic claim safety.
Why upsert instead of append-only?
Retry safety and webhook idempotency.
Why tenant-scoped auth?
Hard isolation boundaries and simpler access control.
Why persist sync runs?
Operational debugging and audit trail.
13. Reliability & Observability
Each run records:
- Start and end timestamps
- Status
- Errors
- Metadata
This enables:
- Operational inspection
- Debugging
- Manual re-execution
- Failure transparency