Download - [Weekend Drop] Temporal — the iPhone of System Design

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Swyx Mixtape

Technology

[Weekend Drop] Temporal — the iPhone of System Design

2021-07-25

Download Right click and do "save link as"

This is the audio version of the essay I published on Monday.

I'm excited to finally share why I've joined Temporal.io as Head of Developer Experience. It's taken me months to precisely pin down why I have been obsessed with Workflows in general and Temporal in particular.

It boils down to 3 core opinions: Orchestration, Event Sourcing, and Workflows-as-Code.

Target audience: product-focused developers who have some understanding of system design, but limited distributed systems experience and no familiarity with workflow engines

30 Second Pitch

The most valuable, mission-critical workloads in any software company are long-running and tie together multiple services.

Because this work relies on unreliable networks and systems:
- You want to standardize timeouts and retries.
- You want offer "reliability on rails" to every team.
Because this work is so important:
- You must never drop any work.
- You must log all progress.
Because this work is complex:
- You want to easily model dynamic asynchronous logic...
- ...and reuse, test, version and migrate it.

Finally, you want all this to scale. The same programming model going from small usecases to millions of users without re-platforming. Temporal is the best way to do all this — by writing idiomatic code known as "workflows".

Requirement 1: Orchestration

Suppose you are executing some business logic that calls System A, then System B, and then System C. Easy enough right?

But:

System B has rate limiting, so sometimes it fails right away and you're just expected to try again some time later.
System C goes down a lot — and when it does, it doesn't actively report a failure. Your program is perfectly happy to wait an infinite amount of time and never retry C.

You could deal with B by just looping until you get a successful response, but that ties up compute resources. Probably the better way is to persist the incomplete task in a database and set a cron job to periodically retry the call.

Dealing with C is similar, but with a twist. You still need B's code to retry the API call, but you also need another (shorter lived, independent) scheduler to place a reasonable timeout on C's execution time since it doesn't report failures when it goes down.

Do this often enough and you soon realize that writing timeouts and retries are really standard production-grade requirements when crossing any system boundary, whether you are calling an external API or just a different service owned by your own team.

Instead of writing custom code for timeout and retries for every single service every time, is there a better way? Sure, we could centralize it!

We have just rediscovered the need for orchestration over choreography. There are various names for the combined A-B-C system orchestration we are doing — depending who you ask, this is either called a Job Runner, Pipeline, or Workflow.

Honestly, what interests me (more than the deduplication of code) is the deduplication of infrastructure. The maintainer of each system no longer has to provision the additional infrastructure needed for this stateful, potentially long-running work. This drastically simplifies maintenance — you can shrink your systems down to as small as a single serverless function — and makes it easier to spin up new ones, with the retry and timeout standards you now expect from every production-grade service. Workflow orchestrators are "reliability on rails".

But there's a risk of course — you've just added a centralized dependency to every part of your distributed system. What if it ALSO goes down?

Requirement 2: Event Sourcing

The work that your code does is mission critical. What does that really mean?

We cannot drop anything. All requests to start work must either result in error or success - no "it was supposed to be running but got lost somewhere" mismatch in expectations.
During execution, we must be able to resume from any downtime. If any part of the system goes down, we must be able to pick up where we left off.
We need the entire history of what happened when, for legal compliance, in case something went wrong, or if we want to analyze metadata across runs.

There are two ways to track all this state. The usual way starts with a simple task queue, and then adds logging:

(async function workLoop() { const nextTask = taskQueue.pop() await logEvent('starting task:', nextTask.ID) try { await doWork(nextTask) // this could fail! catch (err) { await logEvent('reverting task:', nextTask.ID, err) taskQueue.push(nextTask) } await logEvent('completed task:', nextTask.ID) setTimeout(workLoop, 0) })()

But logs-as-afterthought has a bunch of problems.

The logging is not tightly paired with the queue updates. If it is possible for one to succeed but the other to fail, you either have unreliable logs or dropped work — unacceptable for mission critical work. This could also happen if the central work loop itself goes down while tasks are executing.
At the local level, you can fix this with batch transactions. Between systems, you can create two-phase commits. But this is a messy business and further bloats your business code with a ton of boilerplate — IF (a big if) you have the discipline to instrument every single state change in your code.

The alternative to logs-as-afterthought is logs-as-tr...