Developer Learning Wiki
  • About
  • DSA
    • Overview
    • Data Structures
    • Algorithms
    • Problems
  • System Design
    • Overview
  • Technical Blogs
    • Overview
  1. Technical Blogs
  2. 1.5+ million Pdfs in 25 mins
  • Technical Blogs
    • 1.5+ million Pdfs in 25 mins

On this page

  • Overview
  • Legacy System Challenges
  • New System Architecture
    • Design Principles
    • Workflow Overview
  • Core Components and Technical Decisions
    • 1. Worker Implementation
    • 2. PDF Generation Pipeline
    • 3. Digital PDF Signing
    • 4. Storage and File Exchange
    • 5. Scaling and Throughput
    • 6. Monitoring and Reliability
  • Key System Design Insights
  • Quantitative Outcomes
  • Lessons and Trade-offs
  • Conclusion
  1. Technical Blogs
  2. 1.5+ million Pdfs in 25 mins

1.5+ million Pdfs in 25 mins

Check the blog here

Overview

This analysis discusses Zerodha’s overhaul of their PDF contract note generation and delivery system, which transformed a legacy workflow from an 8-hour batch process into a horizontally scalable, event-driven architecture that produces and sends 1.5+ million digital contract notes within 25 minutes. The process leverages cloud-native design principles, distributed job orchestration, and high-throughput storage solutions to meet regulatory deadlines and volatile, scale-driven demand[1].

Legacy System Challenges

  • Monolithic, Vertically Scaled Architecture: The original design was a large, single-server setup that sequentially processed each step, resulting in 7-8 hour runtimes.
  • Process Pipeline:
    • Python-generated HTML contract notes from CSV EOD dumps.
    • HTML-to-PDF via Chrome (Puppeteer).
    • PDF digital signature with a Java CLI.
    • Email delivery via a self-hosted SMTP server.
  • Bottlenecks:
    • Long execution times (risking regulatory non-compliance).
    • SMTP server throughput limitations.
    • Scalability ceiling due to single-machine design.
    • Resource waste on spikes due to vertically scaling for peak demand.

New System Architecture

Design Principles

  • Horizontal Scalability: Processes are distributed across ephemeral EC2 instances, saturating all available cores for short-lived, high-throughput processing.
  • Job Chaining/Orchestration: The system decomposes the workflow into independent jobs handled by specialized workers, managed by a custom Go-based messaging library (Tasqueue).
  • Cloud-Native Choices: Leverages stateless compute, object storage (S3), and a job state backend (Redis).

Workflow Overview

  1. Data Processing Worker: Parses exchange data and prepares template-ready input.
  2. PDF Generation Worker: Converts templates to PDFs.
  3. PDF Signing Worker: Digitally signs, encrypts PDFs.
  4. Email Worker: Delivers signed PDFs to users[1].

Each stage writes outputs to S3, which serves as the shared scratch space for the next.

Distributed Job Chaining Example

  • Each contract note passes through four jobs: data processing → PDF generation → PDF signing → email, with outputs (e.g., PDFs) exchanged via S3.

Core Components and Technical Decisions

1. Worker Implementation

  • Language Shift: Workers were rewritten from Python to Go, yielding significant speed and efficiency gains.
  • Task Distribution: Custom “Tasqueue” job library coordinates which worker does what. Redis tracks job states and triggers targeted retries for failed jobs.

2. PDF Generation Pipeline

Generation Mode Performance Impact Problems
HTML (Puppeteer) Baseline, but slow and resource-intensive High resource usage, slow for scale
LaTeX (pdflatex) 10x faster than Puppeteer Static memory limits (fails for large files)
LaTeX (lualatex) Dynamic memory allocation, handles large PDFs Occasional failures, difficult stack traces, large Docker images
Typst 2–3x faster for small; 18x for massive PDFs None significant; much leaner Docker images
  • Modernization: The switch to Typst, a Rust-based typesetting system, resulted in substantial performance boosts (e.g., 2,000-page PDF in Typst = 1 minute vs. 18 minutes in lualatex), with smaller Docker images and simpler debugging.

3. Digital PDF Signing

  • Java OpenPDF: No open-source tool existed to batch sign PDFs efficiently. Zerodha built a lightweight HTTP wrapper over Java’s OpenPDF library, allowing concurrent signing in a JVM running as a sidecar for each worker container.

4. Storage and File Exchange

Storage Option Read/Write Perf. (10,000 files) Bottlenecks / Suitability
EFS (GP mode) 4–5 seconds Hit operations per second limits, unsuitable
EFS (Max I/O) 17–18 seconds High latency, worse than GP, unsuitable
EBS 400 ms Fast but used only for local ephemeral storage
S3 4–5 seconds Chosen for low cost, acceptable burst, easy ops
  • S3 Chosen: After tests, S3 was selected for its cost and manageability, despite needing to engineer around its per-prefix rate limits (3,500 write/5,500 read requests per second per prefix).
  • Key Management: Initially used KSUIDs with natural prefix collisions (“hot prefixes”); refactored to maximize key space diversity and avoid rate limiting.

5. Scaling and Throughput

  • Ephemeral Computing: Instantiates large EC2 fleets dynamically, saturates CPUs, spins down instantly post-batch.
  • File and Worker Pooling: Maximizes concurrency, efficiently balances variable job sizes (e.g., users with 2- vs. 2,000-page reports).

6. Monitoring and Reliability

  • Redis State Store: Tracks job progress, identifies and triggers retries for failed jobs.
  • Graceful Failure Handling: Intelligent retry policies, due to fail-fast, distributed design.

Key System Design Insights

  • Distributed Batching works best under bursty, high-throughput workloads—design so each worker can independently process units of work with minimal coordination.
  • Toolchain Choices matter: Moving from interpreted (Python) to compiled languages (Go), and from heavyweight containers (Chrome, LaTeX) to single binaries (Typst) enable massive reductions in resource usage.
  • Shared Object Storage (S3) proves practical for high-concurrency workloads if request prefix sharding is handled carefully.
  • Custom Lightweight Libraries can outperform generic job brokers when tuned for specific dataflows, especially at extreme scale.

Quantitative Outcomes

  • End-to-End Performance: 1.5+ million PDFs generated, signed, and emailed in ~25 minutes.
  • Resource Efficiency: Negligible incremental costs due to ephemeral compute and optimal containerization.
  • Reliability: Sub-1% error rates for bursty, high-concurrency jobs, with good retry logic.

Lessons and Trade-offs

  • Pros:
    • Order-of-magnitude speedup.
    • Significant cloud cost reduction.
    • System adaptivity to daily volume fluctuations.
    • Simple, maintainable worker design.
  • Cons/Risks:
    • Reliance on S3 limits—requires ongoing bucket prefix diversity management.
    • Digital signing still bottlenecked by FOSS library limitations.
    • Continuous need for observability and error monitoring.

Conclusion

The re-architecture demonstrates a highly effective adoption of distributed systems patterns for time-sensitive, large-scale workload automation in the financial sector. Key innovations include aggressively parallelized worker pools, cloud-native storage strategies, and constant toolchain optimization for both speed and maintainability. Zerodha’s experience offers a model for similar large-scale, compliance-driven batch processing systems[1].


© 2024 • Developer Learning Wiki • About •

Found an error? Please leave a comment here or submit a PR in this repository, thanks!

The best way to predict the future is to invent it.

Alan Kay

Any fool can write code that a computer can understand. Good programmers write code that humans can understand.

Martin Fowler

First, solve the problem. Then, write the code.

John Johnson

×

🚀 Happy Coding!