Minimize the cost of your AWS batch data pipelines with serverless and ephemeral compute


Head image

The most expensive cloud compute is the compute you’re paying for that isn’t doing anything. If you currently manage or are planning to build batch processing data pipelines in AWS you will want to minimize your cloud costs by utilizing serverless and ephemeral compute.

This article demonstrates how to build a data pipeline that watches for files in an S3 bucket, provisions compute to process the files, processes the files, and automatically terminates the compute after an idle timeout period. All code, configuration and detailed instructions required to reproduce this data pipeline are available in Github here. We are using SaaSGlue to schedule and automate the data pipeline.

Watch the 2 1/2 minute demo video

Data pipeline system architecture

SaaSGlue data pipeline system architecture
  1. The SaaSGlue cloud service. We will use the SaaSGlue web console to configure, execute and monitor the data pipeline.
  2. AWS Lambda. We will execute several steps in the data pipeline in AWS Lambda via the SaaSGlue Agent.
  3. AWS S3. The input files to be analyzed will be stored in S3.
  4. AWS EC2. The file processing will be performed via the SaaSGlue Agent using a clojure application on an EC2 instance that is provisioned and terminated as part of the data pipeline.

How it works

Our data pipeline consists of three jobs:

  1. File Watcher
  2. File Analyzer
  3. Stop Agent and Terminate EC2

File Watcher

This job consists of four tasks:

  1. Check for new files. This task runs a nodejs script in AWS Lambda that checks for new files in a specific S3 bucket/prefix. If it finds any files, it outputs a route code of “ok” and creates a runtime variable named “files” with a comma-delimited list of the files found. Otherwise it outputs a route code of “nothing_to_do” in which case the remaining tasks in the job will be skipped.
  2. Launch analyzer host. This task runs a python script in AWS Lambda that launches a new EC2 instance with a bootstrap script which downloads and starts the SaaSGlue Agent, which will facilitate running the actual file processing.
  3. API Login. This task runs a python script in AWS Lambda that obtains a jwt token from the SaaSGlue API and passes it to the next task in the job in a runtime variable.
  4. Create analyze jobs. This task runs a nodejs script in AWS Lambda that creates an instance of the “File Analyzer” job via the SaaSGlue API for each file found in task 1.

The SaaSGlue service orchestrates the job tasks, ensuring that each task executes in order. We utilize conditional routing to stop the job after the first task if no files are found. Notice the route code “ok” of the inbound route from “Check for new files” to “Launch analyzer host”.

Task routes

And a snippet of the “Check for new files” script.

Code snippet

Notice that if there is more than one file (list.length > 0), a runtime variable named “route” is created with the value “ok”. This is done with a simple SaaSGlue construct that can be used in scripts in any language, “@sgo{“variable_name”: “variable_value”}”. By printing this string to stdout, we are telling the SaaSGlue service the route code to use to determine the path of the workflow subsequent to this task.

File Analyzer

This job consists of one task named “Analyze file” with two steps. A SaaSGlue task consists of one or more steps that all run in the same runtime environment. In this case, these steps will run on the new EC2 instance that was created in the File Watcher job, Launch analyzer host task.

  1. download file. This step runs a shell script which downloads the file to be processed to the runtime environment. For demo purposes the file is not deleted or moved. In production you would want to move the file after downloading.
  2. analyze file. This step runs a function in a java jar file compiled from Clojure which counts the words in a file and sends the results to stdout. The jar file is uploaded to SaaSGlue as an artifact and attached to the Analyze file task. At runtime the file is automatically downloaded to the runtime environment.

Stop Agent and Terminate EC2

This job is launched from the SaaSGlue Agent running on the new EC2 instance after a configurable period of time without running any tasks. It consists of three tasks:

  1. API Login. This task re-uses the script used by the task with the same name in the first job. It also runs in AWS Lambda.
  2. Get ec2 inst id and region. This task dynamically targets the new EC2 instance and gets the EC2 instance id and region. These data are passed to the next task via runtime variables.
  3. Terminate agent. This task runs a nodejs script which stops the SaaSGlue Agent and terminates the EC2 host instance.

Conclusion

With the use of serverless compute via AWS Lambda and ephemeral EC2 instances we are only paying for the compute required to process the data. This implementation could be enhanced to handle heavier workloads by scaling up EC2 instances to handle increasing loads while allowing each EC2 worker host to scale down when idle.

In this example we are simply counting the instances of each word in an input file but we could just as easily have spun up an EMR cluster and executed a Spark job (see How to build a twitter analyzer with a hybrid cloud data pipeline and a single page web app.) for an example showing how to execute a Spark job as a SaaSGlue task). In fact we could orchestrate any number of dependent and parallel tasks involving resource allocation and processing using the SaaSGlue platform.


Automation SoftwareData PipelineCloud Cost ManagementServerless Architecture