DATA

Build a Lakehouse Architecture on AWS Using Delta Lake and AWS Glue

Apr 10, 2024

DATA

Build a Lakehouse Architecture on AWS Using Delta Lake and AWS Glue

Apr 10, 2024

DATA

Build a Lakehouse Architecture on AWS Using Delta Lake and AWS Glue

Apr 10, 2024

In my recent blog post titled “How I Built this Data Platform in One Week,” I utilized Snowflake as the primary source of truth.

However, what if you’re seeking a more cost-effective solution, steering clear of Snowflake’s fees while retaining some of the advantages of a database?

One solution: “Lakehouse.”

In this proof of concept (POC), I’ll guide you through how I constructed this Lakehouse architecture using Delta Lake and AWS Glue (which I believe can be replicated using DataBricks).

If you’ve been following my articles, you’ll know that I strive to maintain simplicity, ensuring everyone can grasp and gain a high-level understanding of a typical data engineering workflow.

Infrastructure Overview

To construct the entire infrastructure on AWS, I employed Terraform.

By “entire infrastructure,” I mean:

ECR repository for storing Docker images
Necessary IAM roles for AWS Lambda and AWS Glue
EventBridge rules utilized to trigger the AWS Lambda function every two hours for data generation
Three S3 buckets: one for storing raw data from Lambda, one for storing our Glue scripts, and one for the Lakehouse
Three databases on AWS Glue Data Catalog: bronze, silver, gold. Each database corresponds to a specific folder in our Lakehouse S3 bucket
An EC2 instance to host our Airflow DAG (you can alternatively use the managed version of Airflow on AWS called MWAA).

Data Sources

For this POC, we required data. Instead of spending hours searching for the right dataset, I created a custom Python script responsible for generating two distinct JSON files with the following schema:

Where will this Python script reside? AWS Lambda.

Alright, how do we deploy it to AWS Lambda? Using CI/CD with Github Actions to push a new Docker image to ECR and create (or update) the Lambda function.

How can we activate this code? By attaching a trigger to our Lambda function, specifically an Eventbridge rule.

Now, we have data uploaded to S3 every 2 hours that we can work with :)

Players folder

Data on the players folder

Pipeline

As mentioned earlier, in this architecture, we have three layers, each with different levels of transformations applied.

The first layer/database: bronze.

In this layer, we simply store the raw data in delta format coming from the raw S3 bucket. This is achieved using a straightforward AWS Glue script tasked with reading the data from the S3 buckets where the JSON data is stored, and then uploading that data to the Lakehouse bronze layer using the Delta format.

Here’s a sample function included in the code (the complete code is available on my GitHub):

Second Layer: Silver

Here, we retrieve data from the bronze layer, performing some data cleaning tasks such as type casting, normalization, etc.

The Layer: Gold

This layer is where actual joins and aggregations are performed. It is typically used to construct custom reports for displaying key metrics.

In our case, we create three reports: player-level metrics, game genre metrics, and game session metrics.

Here’s the function responsible for generating the game session metrics, for example:

How Delta tables look like on S3

Athena and Glue Catalog: the Magic

As mentioned, every time a new Delta table is generated or an existing one is updated, the tables are automatically created in the Glue Catalog within the corresponding database.

That makes it very easy to explore the data using SQL in AWS Athena, without the need to use the Glue crawler.

And we can easily connect our BI tool to those databases to build Dashboards.

Tables in Athena — gold layer

sample athena query

query response

Schedule and Orchestrate

For data orchestration and scheduling, Airflow was simply the right choice, with its built-in GlueJobOperator.

By scheduling our data pipeline (in our case every day at 8P.M), we ensured that tables in our Lakehouse are automatically updated every day and can be queried at any time.

Our workflow is very simple: upload all the scripts to S3 and run those scripts when needed.

peace of the dag code

AIRFLOW UI

** We could have added steps in this DAG to send Slack notifications on failure or success.

Monitoring, CI/CD

I didn’t delve too much into it, but for monitoring, since our setup is on AWS, we utilized CloudWatch.

For CI/CD, we employed Github Actions :) (don’t forget to check the repository to see the entire code).

Wrap Up

In this article, I guided you through the process of building a Lakehouse architecture on AWS using services such as S3, IAM, CloudWatch, Glue, and Athena.

I believe you now have an overview of how you can implement this architecture for your business or for learning purposes.

As always, I didn’t delve too deeply into technical details, but you’ll find in the GitHub repository that there was quite a bit of work involved in this project.

GITHUB REPO : https://github.com/Dorianteffo/vg-metrics-lakehouse

— Dorian Teffo

Ready to streamline your data?

Automate your analytics and take control of your metrics.

Get Started Today

Ready to streamline your data?

Automate your analytics and take control of your metrics.

Get Started Today

Ready to streamline your data?

Automate your analytics and take control of your metrics.

Get Started Today