POSTS
Why You Should Use Temporal
This document introduces Temporal, a workflow engine, to engineers unfamiliar with its capabilities. Its aim is to demonstrate Temporal’s value for potential use in upcoming projects.
Definitions
Workflow - “At the highest level, workflows tend to be modeled as a set of activities invoked in some sequence where the completion of one activity flows directly into the start of the next activity.” Source Workflows vs Sagas Presentation
Activity - “An Activity is a normal function or method that executes a single, well-defined action (either short or long running), such as calling another service, transcoding a media file, or sending an email message. Activity code can be non-deterministic. “
Global Algorithm - an algorithm that spans multiple components and can have a lifetime longer than a single process.
Local Algorithm - an algorithm that lives inside a single component that has access to local state.
Distributed System - “A distributed system is a set of concurrent, communicating components that communicate by sending and receiving messages over a network. Each component has exclusive access to its own local state, which is not accessible by any other components.Additionally, the network has exclusive access to its own local state, which is not accessible by any other components, capturing messages that are in Flight.” - Dominik Tornow
Core Abstraction - a fundamental abstraction for a given system
Background
Operational engineering teams are tasked with developing technical solutions to address business challenges. These solutions are required to meet several non-functional requirements, including reliability, durability, fault tolerance, and scalability. Building on top of an event-driven micro service architecture necessitates the creation of a global algorithm tailored to specific business needs. A global algorithm spans multiple processes or microservices, orchestrating a cohesive operation across different components. An illustrative example is the sequence of steps needed to complete an order, where multiple services interact to perform a globally coordinated set of actions. Within this framework, each microservice executes its own local algorithms, which are sets of steps confined to a single process.
To construct these global algorithms, engineers utilize a combination of workers, web servers, event queues, and cron jobs. In doing so, they address inherent distributed systems challenges such as error handling, timeouts, failures, and workload distribution. These challenges must be addressed at the application level since there is not a solution at the platform level that sufficiently handles these problems.
These global algorithms are often not straightforward and truth behavior can often depart from what is believed to be happening. It is also challenging to automatically test this global algorithm. This likely increases the number of mistakes made by engineers when adding features which in turn decreases our reliability.
Engineers repeatedly solve the same set of problems in distributed systems for each solution. This is an indication that a generalized solution is needed.
In database systems, there exists a universal solution that addresses a multitude of challenges. Databases that adhere to ACID (Atomicity, Consistency, Isolation, Durability) standards offer core abstractions, such as transactions and rollbacks, while ensuring a robust set of guarantees. These databases guarantee that a transaction is either executed without error or completely rolled back, undoing all changes. This system so effectively abstracts away many challenges, such as concurrent access, that most engineers often overlook them when utilizing the system.
Problem Statement
A generalized solution that abstracts away most of the distributed systems problems is, in the same way ACID compliant databases abstract away most database system problems, is needed. This solution should also make the global algorithm easier to find and understand. This system would allow operational engineers to focus more on business logic and less on distributed systems problems. This leads to increased developer productivity and improved consistency across our infrastructure.
Solution
Distributed systems face 2 similar challenges: the presence of failure and timeliness. Failure can happen when some operations have taken place already and others have yet to take place. Timeliness refers to when an operation experiences a timeout. These 2 challenges are top of mind and there is no established core abstraction that mitigates these challenges at a platform level.
There is a new abstraction called a Workflow. A Workflow is a sequence of commands. It guarantees that a resulting state of a distributed system after executing a workflow is equivalent to executing a Workflow exactly once to completion. Workflows mitigate failure and timeliness on a platform level which entirely eliminates these challenges on an application level.
Temporal is an open source workflow system. Temporal allows you to author your workflow in code. Languages include, but are not limited to, Go, Javascript, and Java. Temporal lifts a regular function into a workflow. Here is an example of a workflow:
This function is very clear on what it does but could not run on top of a distributed system. If the process that were executing this function were to die the state of the function would be lost and it could not be recovered. Temporal allows you to define your functions in this manner (using SDK functions) and will ensure that it executes to completion even in the presence of faults and failures such as server crashes, timeouts, and restarts.
Utilizing Temporal would increase durability and reliability while at the same time increase developer productivity. Nuon, an infra startup, claims a 36x increase in developer efficiency compared to their previous solution. This sounds like an insane number but think about how easy it is to store and access data in a database compared to managing it in application code.
Temporal Core Abstractions
Temporal has two core abstractions: Workflows and Activities. A Workflow definition defines a Workflow Execution. To construct a Workflow Definition you simply write a function using a Temporal SDK. Just like a normal function, it can take arguments when executed. Workflows do have one constraint, they must be deterministic. That means, given the same input, a Workflow must execute the same series of commands in the same order if re-executed. An Activity is a single, well defined action (either short or long running) such as calling to another service, transcoding media, or sending an email. Activity code may be non-deterministic but is recommended to be idempotent. Combined these deliver durability, reliability, and scalability.
Temporal Architecture
Temporal is a set of client SDKs and a cluster/server. The server orchestrates all actions within and communicates with clients over GRPC.
The cluster has a set of services and a database. There are many different configurations ranging from a single container that contains all of the service with a single database to a multi-regional cluster maintained by a set of helm charts. They also have a UI and CLI tools that allow you to observe and interact with the cluster. All of these tools have an MIT License (at the time of writing and are open source) If you’d like more detail please reference their documentation.
When developing on top of Temporal you will write the Workflow and Activity code in your own microservice. You will then register a worker with the temporal cluster under a namespace and activity queue. A service (usually the service that defines the workflow) will trigger the execution of a workflow via the client SDK or CLI. This call will be fielded by the cluster and it will start the execution of the workflow on the worker registered with the cluster. Note that there can be N number of workers registered with the cluster and the command within a workflow could be executed on any of those workers.
Temporal the Company
Temporal Technologies, co-founded in 2019 by Maxim Fateev and Samar Abbas. Fateev and Abbas, known for their expertise in this field, previously led the design and development of Amazon’s Simple Workflow Service and contributed to the creation of business-critical systems like the Durable Task Framework and Uber Cadence. Their extensive experience at major tech companies like Amazon, Microsoft, and Google laid the foundation for Temporal technology.
The company has seen significant growth and interest, evidenced by its $103 million Series B funding, which brought Temporal total funding to $128 million and a valuation of $1.5 billion. The following companies use Temporal Netflix, Doordash, Snap, Box, Stripe, Hashicorp, and Coinbase. Usage at each company varies but some of these companies execute on the order of 1 billion workflow executions a day.
The company has a cloud offering and likely charges for consulting.
Notable Use Cases
Nuon
Nuon is an infra startup. They have an external API that allows clients to spin up infrastructure in customer’s clouds. An example given was a Nuon customer spinning up a MongoDB cluster in one of their clients’ accounts. Infrastructure provisions are error prone and long running. They re-model their whole provisioning and execution system onto Temporal. They make use of long running workflows, workflow queries and signal to develop an event loop
This is a slide from their presentation which describes the event loop. Once the workflow starts, The core operations are triggered via signals. Here’s how simple the code is to represent what we have above:
This part of the workflow looks like a Go select statement and works in a very similar manner. A notable difference is the workflow can exist for an infinite amount of time while a normal select statement will only live for as long as the process it’s running in does.
Outcomes:
- 36x increase in dev velocity
- combine 3 services into 1
- 26 endpoints -> 125+ in ~ 2 months
Their Video Presentation is well worth a watch.
Instacart
A gig economy company has started using temporal for their core operations.
Takeaways:
- Heavy use in infrastructure and payment teams for 2.5+ years
- Stats
- 11 million workflow executions a day
- 200 workflow types
- < 5 incidents, none were SV0 or SV1
- Previously used a Combination of Airflow and Sidekiq jobs
- Chose Temporal over AWS SWF, Cadence, and rolling their own
Closing Thoughts
In conclusion, Temporal represents a significant advancement in the management and execution of workflows in distributed systems. Its ability to abstract away the complexities of distributed systems programming, providing robust solutions to issues of failure and timeliness, positions it as an essential tool for modern building applications on top of distributed systems.
The adoption of Temporal can revolutionize the way operational engineering teams approach the creation and management of global algorithms within an event-driven microservice architecture. By offering a systematic approach to handle error handling, timeouts, failures, and workload distribution, Temporal effectively addresses many of the challenges that are inherent in distributed systems. This not only increases system reliability and durability but also substantially boosts developer productivity.
The case studies and examples from leading companies such as Netflix, Doordash, and Nuon illustrate the scalability and efficiency of Temporal in handling complex, large-scale distributed tasks.
Adopting Temporal could be a strategic move for this company. It aligns well with our focus on developing robust, scalable, and reliable solutions. By leveraging Temporal’s capabilities, we can expect not only an increase in efficiency and reduction in the time spent on developing and debugging distributed systems but also a more straightforward, maintainable, and scalable approach to building our microservices architecture.