Step-by-Step Process to Building a Data Lake on AWS with Terraform
Step 1: Launch an EC2 Instance
- Go to AWS Console and launch an EC2 Instance
Step 2: Install Terraform
Now install terraform on our local machine by using following commands
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo
sudo yum -y install terraform
Step 3: Create a file for Terraform Configuration
Now create a file with .tf extension. Inside this file we are defining terraform configuration
Provider Configuration: In this Provider Configuration we are proving provider details
provider "aws" {
region = "eu-north-1" # modify according to your region
}
#S3 Bucket Configuration
Creates an S3 bucket named my-unique-data-lake-bucket-name.
resource "aws_s3_bucket" "data_lake_bucket" {
bucket = "my-unique-data-lake-bucket-name"
force_destroy = true
versioning {
enabled = true
}
server_side_encryption_configuration {
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
tags = {
Name = "DataLakeBucket"
}
}
resource "aws_s3_bucket_public_access_block" "public_access_block" {
bucket = aws_s3_bucket.data_lake_bucket.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
#AWS Glue Catalog Database
Creates an AWS Glue Catalog Database named data_lake_db.
resource "aws_glue_catalog_database" "data_lake_db" {
name = "data_lake_db"
}
#AWS Glue Crawler
Creates an AWS Glue Crawler to crawl data in the S3 bucket and update the Glue Catalog Database.
resource "aws_glue_crawler" "data_lake_crawler" {
name = "data_lake_crawler"
database_name = aws_glue_catalog_database.data_lake_db.name
role = aws_iam_role.glue_service_role.arn
s3_target {
path = "s3://${aws_s3_bucket.data_lake_bucket.bucket}/"
}
}
IAM Role for AWS Glue
Creates an IAM role for AWS Glue.
resource "aws_iam_role" "glue_service_role" {
name = "AWSGlueServiceRole"
assume_role_policy = jsonencode({
"Version": "2012-10-17",
"Statement": [{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
}
}]
})
}
resource "aws_iam_policy_attachment" "glue_service_role_policy" {
name = "glue-service-role-policy"
roles = [aws_iam_role.glue_service_role.name]
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}
Step 4: Applying the Configuration
- To create these resources, run the following Terraform commands:
terraform init
terraform validate
terraform apply --auto-approve
- Here we see that terraform apply completed and total 6 resources be created
Step 5: Resources Created
AWS S3
AWS Glue Crawlers
Data Lake Databases
Step 6: Execute Terraform Destroy Command
To delete created resources now execute Terraform destroy command
terraform destroy --auto-approve
Here we see that total 6 resources was destroyed successfully
Building a Data Lake on AWS with Terraform
Today, in the age of digital supremacy, data has shifted into a strategic asset that drives business decision-making, innovation, and competitive advantage. Now, organizations collect significant amounts of data from different sources, so the task of efficient management, storage, and proper information analysis arises as a significant challenge. That is where a data lake comes into play.
A data lake is a centralized repository for storing structured and unstructured big data at any scale. Unlike traditional data warehouses, data lakes do generally not require making the data’s structure evident in advance. The flexibility associated with raw data in terms of type and content format brings immense opportunities for diverse data analytics, machine learning, and real-time processing of data.
In this guide, we are going through a process to build a data lake on AWS using Terraform, we will cover critical concepts while defining major terminologies and take you step-by-step to help design and build a scalable and maintainable solution for your data lake. Whether you are a data engineer, cloud architect, or IT professional, this guide will provide you with knowledge and tools for harnessing data lakes on AWS.
Contact Us