Orchestrate an ETL process using AWS Step Functions for Amazon Redshift

Modern data lakes depend on extract, transform, and load (ETL) operations to convert bulk information into usable data. This post walks through implementing an ETL orchestration process that is loosely coupled using AWS Step Functions, AWS Lambda, and AWS Batch to target an Amazon Redshift cluster.

Because Amazon Redshift uses columnar storage, it is well suited for fast analytical insights using the convenient ANSI SQL queries. You can rapidly scale your Amazon Redshift clusters up and down in minutes to meet the demanding workloads for both your end-user reports and timely data refresh into the data warehouse.

AWS Step Functions makes it easy to develop and use repeatable workflows that scale well. Step Functions lets you build automation workflows from individual Lambda functions. Each function performs a discrete task and lets you develop, test, and modify the components of your workflow quickly and seamlessly.

An ETL process refreshes your data warehouse from source systems, organizing the raw data into a format you can more readily use. Most organizations run ETL as a batch or as part of a real-time ingest process to keep the data warehouse current and provide timely analytics.

A fully automated and highly scalable ETL process helps minimize the operational effort that you must invest in managing the regular ETL pipelines. It also ensures the timely and accurate refresh of your data warehouse. You can tailor this process to refresh data into any data warehouse or the data lake. For more info ETL Testing Training

This post also provides an AWS CloudFormation template that launches the entire sample ETL process in one click to refresh the TPC-DS dataset. Find the template link in the Set up the entire workflow using AWS CloudFormation section.

Architectural overview

The following diagram illustrates the architectural overview of the different components involved in the orchestration of the ETL workflow. This workflow uses Step Functions to fetch source data from Amazon S3 to refresh the Amazon Redshift data warehouse.

.

Here are the core components of the workflow:

  • Amazon CloudWatch triggers the ETL process based on a schedule, through the AWS CLI, or using the various AWS SDKs in a Lambda function.
  • The ETL workflow uses Step Functions for a multi-step ETL process and manages AWS services into serverless workflows. You can build and easily iterate these using JSON-based templates. For example, a typical ETL process may involve refreshing dimensions first and later refreshing the fact tables. You can declare your order of operations using a Step Functions state machine.
  • A Lambda function lets you build microservices to coordinate job submission and monitoring without needing to write code for workflow logic, parallel processes, error handling, timeouts, or retries.
  • AWS Batch runs several ETL jobs such as transforms and loads into Amazon Redshift. AWS Batch manages all the infrastructure for you, avoiding the complexities of provisioning, managing, monitoring, and scaling your batch computing jobs. It also lets you wait for the jobs to complete.
  • The source data in Amazon S3 refreshes an Amazon Redshift data warehouse through a PL/SQL container. To specify the ETL logic, I use.sql files that contain the SQL code for a particular step. For example, a .sql file for a typical dimension table refresh contains steps to load the data from Amazon S3 to a temporary staging table and INSERT/UPDATE the target table. Before beginning, review a sample dimensional table .sql file. Learn Practical skills from ETL Testing Certification

You can execute the workflow and monitor it using the state machine. You can trigger the ETL according to a schedule or an event (for example, as soon as all the data files arrive in S3).

Prerequisites

Before you get started, create a Docker image that can execute .sql files. AWS Batch creates resources for executing the ETL steps using this Docker image. To create the Docker image, you need:

If this is your first time using AWS Batch. Create an environment to build and register the Docker image. For this post, register this image in an Amazon ECR repository. This is a private repository by default, making it useful for AWS Batch jobs.

Building the fetch and running psql Docker image

To build the Docker image, follow the steps outlined in the post Creating a Simple “Fetch & Run” AWS Batch Job.

Use the following Docker configuration and fetch and run psql scripts to build the images.

  1. DockerFetchRunPsqlUbundu
  2. fetch_and_run_psql.sh

Follow the steps in the post to import the Docker image into the ECR container registry. After you complete the previous steps, your Docker image is ready to trigger a .sql execution for an Amazon Redshift cluster.

Example: ETL process using TPC-DS dataset

This example uses a subset of the TPC-DS dataset to demonstrate a typical dimensional model refresh. Here is the Entity Relationship diagram of the TPC-DS data model that I use for this ETL application:

The ETL process refreshes table data for the Store_Sales fact table along with the Customer_Address and Item dimensions for a particular dataset date.

Setting up the ETL workflow using Step Functions

Step Functions make complicated workflows more straightforward. You can set up dependency management and failure handling using a JSON-based template. Workflows are just a series of steps, with the output of one step acting as input into the next.

This example completes various dimensional table transforms and loads before triggering the Fact table load. Also, a workflow can branch out into multiple parallel steps whenever needed. You can monitor each step of execution as it happens, which means you can identify and fix problems quickly.

This illustration outlines the example ETL process set up through Step Functions:

In the above workflow, the ETL process checks the DB connection in Step 1 and triggers the Customer_Address (Step 2.1) and Item_dimension (Step 2.2) steps, which execute in parallel. The Store_Sales (Step 3) FACT table waits for the process to complete the dimensional tables. Each ETL step is autonomous, allowing you to monitor and respond to failures at any stage.

I now examine the Store_Sales step (Step 3) in detail. Other steps follow a similar pattern of implementation.

To get in-depth knowledge, enroll for a live free demo on ETL Testing Online Training

Leave a comment

Design a site like this with WordPress.com
Get started