A system architecture for streaming data pipelines that scales automatically while minimizing compute costs.
This article describes an architectural pattern for handling variable rate workloads with Kubernetes. Part 2 provides a detailed implementation of this pattern including demo code.
This architecture is suitable for streaming data pipelines and parallelized batch processing, and particularly well-suited for volatile workloads with longer task durations.
The problem with Kubernetes Deployments
Applications deployed as Kubernetes Deployments (or StatefulSets) can be scaled manually or automatically either through the API or with the HorizontalPodAutoscaler (HPA). This can be a great solution for scaling relatively steady workloads with short-lived task durations, such as a REST API backing a web application.
With volatile workloads, back pressure can become a problem. If volume increases faster than resources can scale up, requests will time out.
For tasks with durations longer than a few seconds, scaling down can be problematic. Kubernetes does not distinguish between active and idle pods when scaling down. If it chooses to scale down a pod that is processing a long running task, either the task will be killed or idle pods will be kept running while waiting for the long running task to complete. This can be costly in cloud computing environments.
When searching for a solution to this problem, most of what I found involved attempts to coerce the Kubernetes API to keep or eliminate specific pods when scaling down. My own implementations of this type of solution were overly complex and prone to race conditions and inefficiencies.
The solution: decouple scaling up and scaling down
The core concept of this solution is to decouple scaling up and scaling down. Scaling up is performed by a lightweight custom application. Scaling down is performed by the application instances themselves when they run out of tasks.
How does it work?
Use Message Queues to deliver tasks to application instances
Instead of submitting tasks directly to an application pool deployed as a Kubernetes service, each task is published to a message queue. Application instances then receive tasks by subscribing to the message queue. I’d recommend RabbitMQ but other message queue services offer similar functionality.
Using message queues offers several benefits in this architecture:
- Back pressure — excess workload volume will accumulate in a queue while resources scale up.
- Fault tolerance — failed messages can be automatically requeued. Dead letter queues can be configured to handle exception scenarios.
- Load balancing — work can be distributed across multiple application instances.
- Monitoring — monitor system health and tune your scaling logic.
- API — most message brokers including RabbitMQ offer an API which provides queue counts. This information can be used to trigger a scale up event.
Use Kubernetes Jobs to host your application instances
Kubernetes Jobs are intended to host applications that perform a specific task and then shut down. In this case, application instances process messages in a queue until the queue is empty and then exit after a period of inactivity. When the application process exits Kubernetes automatically terminates the host pod.
Applications scale horizontally simply by creating more Jobs. Each application instance subscribes to the message queue which load balances the work across all available application instances.
Where possible, Jobs hosting data pipeline applications should be configured to run on auto scaling node groups to minimize compute costs. Both AWS and GCP offer Kubernetes environments with auto scaling node groups that scale down to and up from 0 nodes.
How scaling up works
Additional application instances are created via the Kubernetes API in response to a trigger event. This is similar to how the Kubernetes HPA works, but the HPA doesn’t offer one-way scaling or scaling with Jobs.
Another problem with using the HPA in this architecture is that by default, the HPA makes scaling decisions based on resource usage. As noted earlier, the use of message queues shifts back pressure from application instances to the message queue. Consequently resource usage may stay in a normal band even with a large task backlog.
A better way to trigger scaling up in this architecture is by monitoring the size of the message queue. Not only is this a good indicator that more resources are needed, it can help determine how aggressively to scale up. Part 2 of this article will describe an implementation of this logic in detail.
I should note that it is possible to configure the HPA to scale based on external metrics including message queue size but for this use case the HPA scaling algorithm is inefficient and unreliable unless application instances process tasks at a fixed rate. The other HPA limitations cited previously also apply.
Putting it all together
Let’s walk through how a work item gets processed step by step.
- A work item is created
- The item is published to the message queue
- The auto scaler application notices one item in the queue — we’ll assume the auto scaler is configured to scale up with a queue size > 0
- The auto scaler queries the Kubernetes API to see how many application instance Jobs are running and sees there are none currently running
- The auto scaler creates an application instance Job via the Kubernetes API
- The application instance starts and subscribes to the message queue
- The application instance receives the message and performs the task
- After a timeout period the application instance terminates itself
- Kubernetes notices the application instance has stopped and it terminates the Job and the pod on which it was running
Part 2 of this article provides a detailed implementation of this pattern. It includes python code for scaling up a queue worker application and the queue worker logic that terminates the application instance after a period of inactivity.