Build a twitter analyzer with a hybrid cloud data pipeline and a single page web app

This article demonstrates how to build a “what’s trending on twitter” single page web app with a hybrid cloud data pipeline powered by SaaSGlue.
Introduction
Learn how to build a data pipeline that creates batches of streaming tweets, analyzes a moving window of tweet batches and extracts the most significant topics and then displays them using a modern single page web application. All of the code, configuration files and instructions required to reproduce this data pipeline and web app are available in Github here.
The web app looks like this:

It shows the top 5 trending topics on Twitter related to one or more filter rules. Each “topic” is a list of keywords ranked by significance and generated using LDA Natural Language processing and other text analysis data science techniques. Users can change the filter rules and then click the “Reset” button to remove the existing topics and launch the data pipeline to regenerate the trending topics for the new rules. Clicking “Stream” will automatically run the data pipeline at one minute intervals for the next ten minutes. Other schedules can be created using the SaaSGlue web console.
Components
- API: implemented with NodeJS Express and TypeScript. Exposes functionality for retrieving/modifying Twitter filter rules and for running and scheduling the data pipeline.
- Web app client: implemented with VueJS (class style) and TypeScript. Implements browser push technology using StompJS to connect to RabbitMQ to receive trending topics updates. This allows the web app to update with no page reloads.
- Spark cluster: the GitHub repo includes the docker files and docker-compose file required to build the docker images and deploy the cluster.
- LDA/Lemmitization Scala code: this code extracts the topics from the aggregated tweets files. Compiles to a Java jar file and runs in the Spark cluster.
- SaaSGlue automation platform: a SaaS based service that schedules and orchestrates the data pipeline. You can sign up for a free to use account in a minute or two and the first 1,000 script executions are free. You can easily import this data pipeline into your SaaSGlue team environment. For details see the Github repo README.
How the data pipeline works
The Twitter stream data pipeline is implemented as a SaaSGlue job. It can run on a schedule or it can be started manually using the SaaSGlue web console or the SaaSGlue API. A job consists of a DAG that defines the tasks in the pipeline and the routes that determine the pipeline flow. Each task consists of one or more steps, each with an associated script.
When SaaSGlue executes a task, it sends the code and metadata required to run the task to the SaaSGlue agent, which then executes the task steps locally. Tasks can also be executed without an agent as an AWS Lambda function in a SaaSGlue managed AWS environment.
There are two tasks in this data pipeline. The first runs in AWS Lambda. It captures a stream of tweets and uploads them to AWS S3. The second runs on the master node of a Spark cluster. It extracts the five main topics from the captured tweets and pushes them to RabbitMQ.

- SaaSGlue delivers code and metadata to a special SaaSGlue agent which will create an AWS Lambda function using the script code, register the function with AWS Lambda, execute it, tail the associated CloudWatch log and stream the output back to the SaaSGlue API for viewing in the SaaSGlue web console monitor.
- A javascript function captures tweets using the Twitter stream API until it reaches a max number of tweets or a timeout period.
- If any tweets were captured they are serialized to a file and uploaded to an AWS S3 bucket.
- SaaSGlue delivers code and metadata to an agent running on the master node of a Spark cluster.
- The agent runs a javascript script which deletes any tweets outside of the moving window timeframe from the local tweets storage location.
- The agent runs a shell script which downloads the tweet file uploaded to S3 in step 3 to a local path. The path to the tweet file in S3 is passed to this task from the first task as metadata.
- The agent runs a shell script which starts the Spark job which analyzes the aggregate tweets and saves the results to a local file.
- The agent runs a python script which extracts the topics from the local file and pushes them to RabbitMQ.
- RabbitMQ delivers the updated topics to the web client which updates the view.
What’s different about it?
Most data pipeline orchestration solutions are server based. Popular examples include Airflow, Ansible and Jenkins. This design limits the reach of the data pipeline to the network where the server is located. It is possible, but not ideal, to open network firewalls to broaden the reach of a server based solution. In addition to security concerns, executing code remotely across networks introduces the risk of orphaned or failed processes due to network hiccups. Additionally, since pipeline workers are often tightly coupled to the server application, provisioning resources to run pipeline tasks can be challenging.
SaaSGlue’s basic design is different in one critical way — SaaSGlue decouples job orchestration from task execution. Jobs are orchestrated by the SaaSGlue API and job tasks are executed locally by the SaaSGlue agent.
This may appear to be a minor architectural difference but the implications are wide reaching.
- Since it is a SaaS service, there is no server to provision and set up and no plugins to configure. You can literally execute your first SaaSGlue job in minutes after signing up for an account.
- No firewall exceptions. Rather than reaching into the host machine and executing code remotely, the API sends task instructions to agents via a message queue. The agent communicates running task status and metadata to the API over https. Consequently, the only connectivity required by the agent is outbound https.
- Adding a new compute resource to your data pipeline is as easy as running the SaaSGlue agent on the new machine.
- Suitable tasks can be executed in AWS Lambda without installing an agent and without provisioning an AWS account or creating and registering a Lambda function.
- Tasks can run code in any language. Since the agent simply hands code execution off to a local interpreter, you can execute scripts or compiled code in any language supported by the host machine.
- SaaSGlue data pipelines can reach across hybrid and multi-cloud environments.
Summary
In this data pipeline, we execute a javascript function in AWS Lambda which accesses the Twitter stream API, collects tweets and saves them to AWS S3. Then we run a series of javascript, shell and python scripts on the master node of a local Spark cluster to delete stale tweets, download the recently captured tweets, run a Spark job to extract the most significant topics and push them to RabbitMQ. The only difficult part about creating this data pipeline is writing the code. SaaSGlue makes the orchestration and execution aspects simple.
For more detailed information and documentation, please visit saasglue.com.