AWS Lake Formation – Data Lake(Setup)

Overview for Build a Data lake with AWS(Beginner)

AWS Lake Formation enables you to set up a secure data lake. A data lake is a centralized, curated, and secured repository storing all your structured and unstructured data, at any scale. You can store your data as-is, without having first to structure it. And you can run different types of analytics to better guide decision-making—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

The challenges of data lakes

The main challenge to data lake administration stems from the storage of raw data without content oversight. To make the data in your lake usable, you need defined mechanisms for cataloging and securing that data.

Lake Formation provides the mechanisms to implement governance, semantic consistency, and access controls over your data lake. Lake Formation makes your data more usable for analytics and machine learning, providing better value to your business.

Lake Formation allows you to control data lake access and audit those who access data. The AWS Glue Data Catalog integrates data access policies, making sure of compliance regardless of the data’s origin.

Set up the S3 bucket and put the dataset.

Set up the S3 bucket and put the dataset.

Set up Data Lake with AWS Lake Formation.

Step 1: Create a data lake administrator

First, designate yourself a data lake administrator to allow access to any Lake Formation resource.

Create a data lake administrator

Step 2: Register an Amazon S3 path

Next, register an Amazon S3 path to contain your data in the data lake.

Register an Amazon S3 path

Step 3: Create a database

Next, create a database in the AWS Glue Data Catalog to contain the datasetsample00 table definitions.

  •      For Database, enter datasetsample00-db
  •      For Location, enter your S3 bucket/ datasetsample00.
  •      For New tables in this database, do not select Grant All to Everyone.

Step 4: Grant permissions

Next, grant permissions for AWS Glue to use the datasetsample00-db database. For IAM role, select your user and AWSGlueServiceRoleDefault.

Grant your user and AWSServiceRoleForLakeFormationDataAccess permissions to use your data lake using a data location:

  • For IAM role, choose your user and AWSServiceRoleForLakeFormationDataAccess.
  • For Storage locations, enter s3:// datalake-hiennu-ap-northeast-1.

Step 5: Crawl the data with AWS Glue to create the metadata and table

In this step, a crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your AWS Glue Data Catalog.

Create a table using an AWS Glue crawler. Use the following configuration settings:

  • Crawler name: samplecrawler.
    Crawler name: samplecrawler.
  • Data stores: Select this field.
  • Choose a data store: Select S3.
  • Specified path: Select this field.
  • Include path: s3://datalake-hiennu-ap-northeast-1/datasetsample00.
  • Add another data store: Choose No.
  • Choose an existing IAM role: Select this field.
  • IAM role: Select AWSGlueServiceRoleDefault.
  • Run on demand: Select this field.
  • Database: Select datasetsample00-db.

Step 6: Grant access to the table data

Set up your AWS Glue Data Catalog permissions to allow others to manage the data. Use the Lake Formation console to grant and revoke access to tables in the database.

  • In the navigation pane, choose Tables.
  • Choose Grant.
  • Provide the following information:
    1. For IAM role, select your user and AWSGlueServiceRoleDefault.
    2. For Table permissions, choose Select all.

Step 7: Query the data with Athena

Query the data in the data lake using Athena.

  • In the Athena console, choose Query Editor and select the datasetsample00-db
  • Choose Tables and select the datasetsample00 table.
  • Choose Table Options (three vertical dots to the right of the table name).
  • Select Preview table.

Athena issues the following query: SELECT * FROM datasetsample00 limit 10;

AWS – Batch Process and Data Analytics

Overview
(AWS – Batch Process and Data Analytics)

Today, I will describe our current system references architecture for Batch processing and Data Analytics for the sales report system. Our mission is to create a big data analytics system that interacts with Machine Learning features for insights and prediction. Nowadays, too many AI companies are researching and were applied machine learning features to their system to improve services. We also working to research and applied machine learning features to our system such as NLP, Forecast, OCR … that help us have the opportunity to provide better service for customers.

References Architecture for Batch data processing
References Architecture for Batch data processing
  1. Data Source
    We have various sources from multiple systems including both on-premise and on-cloud with a large dataset and unpredictable frequent updates.
  2. Data lake storage
    We use S3 as our data lake storage with unlimited types and volumes. That helps us easy to scale our system easily
  3. Machine Learning
    We focus on Machine learning to build great AI solutions from our dataset. Machine learning models will predict for insight and integration directly to our system as microservices.
  4. Compute
    Compute machines are most important in our system. We choose fit machine services as infrastructure or serverless to maximize optimize cost and performance.
    – We using AWS lambda functions for small jobs such as call AI services, small dataset processing, and integrating
    – We using AWS Glue ETL to build ETL pipeline and build custom pipelines with AWS step functions.
    – We also provide web and APIs service for end-users that system building by microservices architecture and using AWS Fargate for hosting services
  5. Report datastore
    After processing data and export insight data from predictions we store data in DynamoDB and RDS for visualization and build-in AWS insight features.
  6. Data analytics with visualization and insights
    We are using AWS Quicksight and insight for data analytics

According to the specific dashboard, we have specific data processing pipeline to process and prepare data for visualization and insights. Here is our standard pipeline with AWS Step functions orchestrators. We want to hold everything basic as possible. We using AWS Events for schedule, and Lambda functions as interacting functions.
– Advantages: Serverless and easy to develop and scale

– Disadvantages: Most depend on AWS services and challenges when exceeding the limitation of AWS services such as Lambda functions processing time.
– Most use case scenario: Batch data processing pipeline.

Data processing pipeline
Data processing pipeline
  • Using serverless compute for data processing and microservices architecture
  • Easy to develop, deploy and scale system without modifying
  • Flexibility to using build-in services and build custom Machine Learning model with AWS SageMaker
  • Separately with Datasource systems in running
  • Effectively data lake processing

This architecture is basic and I hope you can get somethings in here. We focus to describe our current system and it is building to larger than our design with more features in the future; we should choose a flexible solution for easy to maintain and replaceable. When you choose any architecture or service which helps you resolve your problems, you should consider what is the best fit service for your current situation, and no architecture/service can resolve all problems that depend on the specific problem. And avoid the case for solution finding for problems 😀

— — — — — — — — — — — — — — — —