AWS Lake Formation – Data Lake(Setup)

Overview for Build a Data lake with AWS(Beginner)

AWS Lake Formation enables you to set up a secure data lake. A data lake is a centralized, curated, and secured repository storing all your structured and unstructured data, at any scale. You can store your data as-is, without having first to structure it. And you can run different types of analytics to better guide decision-making—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

The challenges of data lakes

The main challenge to data lake administration stems from the storage of raw data without content oversight. To make the data in your lake usable, you need defined mechanisms for cataloging and securing that data.

Lake Formation provides the mechanisms to implement governance, semantic consistency, and access controls over your data lake. Lake Formation makes your data more usable for analytics and machine learning, providing better value to your business.

Lake Formation allows you to control data lake access and audit those who access data. The AWS Glue Data Catalog integrates data access policies, making sure of compliance regardless of the data’s origin.

Set up the S3 bucket and put the dataset.

Set up the S3 bucket and put the dataset.

Set up Data Lake with AWS Lake Formation.

Step 1: Create a data lake administrator

First, designate yourself a data lake administrator to allow access to any Lake Formation resource.

Create a data lake administrator

Step 2: Register an Amazon S3 path

Next, register an Amazon S3 path to contain your data in the data lake.

Register an Amazon S3 path

Step 3: Create a database

Next, create a database in the AWS Glue Data Catalog to contain the datasetsample00 table definitions.

  •      For Database, enter datasetsample00-db
  •      For Location, enter your S3 bucket/ datasetsample00.
  •      For New tables in this database, do not select Grant All to Everyone.

Step 4: Grant permissions

Next, grant permissions for AWS Glue to use the datasetsample00-db database. For IAM role, select your user and AWSGlueServiceRoleDefault.

Grant your user and AWSServiceRoleForLakeFormationDataAccess permissions to use your data lake using a data location:

  • For IAM role, choose your user and AWSServiceRoleForLakeFormationDataAccess.
  • For Storage locations, enter s3:// datalake-hiennu-ap-northeast-1.

Step 5: Crawl the data with AWS Glue to create the metadata and table

In this step, a crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your AWS Glue Data Catalog.

Create a table using an AWS Glue crawler. Use the following configuration settings:

  • Crawler name: samplecrawler.
    Crawler name: samplecrawler.
  • Data stores: Select this field.
  • Choose a data store: Select S3.
  • Specified path: Select this field.
  • Include path: s3://datalake-hiennu-ap-northeast-1/datasetsample00.
  • Add another data store: Choose No.
  • Choose an existing IAM role: Select this field.
  • IAM role: Select AWSGlueServiceRoleDefault.
  • Run on demand: Select this field.
  • Database: Select datasetsample00-db.

Step 6: Grant access to the table data

Set up your AWS Glue Data Catalog permissions to allow others to manage the data. Use the Lake Formation console to grant and revoke access to tables in the database.

  • In the navigation pane, choose Tables.
  • Choose Grant.
  • Provide the following information:
    1. For IAM role, select your user and AWSGlueServiceRoleDefault.
    2. For Table permissions, choose Select all.

Step 7: Query the data with Athena

Query the data in the data lake using Athena.

  • In the Athena console, choose Query Editor and select the datasetsample00-db
  • Choose Tables and select the datasetsample00 table.
  • Choose Table Options (three vertical dots to the right of the table name).
  • Select Preview table.

Athena issues the following query: SELECT * FROM datasetsample00 limit 10;

AWS – Batch Process and Data Analytics

Overview
(AWS – Batch Process and Data Analytics)

Today, I will describe our current system references architecture for Batch processing and Data Analytics for the sales report system. Our mission is to create a big data analytics system that interacts with Machine Learning features for insights and prediction. Nowadays, too many AI companies are researching and were applied machine learning features to their system to improve services. We also working to research and applied machine learning features to our system such as NLP, Forecast, OCR … that help us have the opportunity to provide better service for customers.

References Architecture for Batch data processing
References Architecture for Batch data processing
  1. Data Source
    We have various sources from multiple systems including both on-premise and on-cloud with a large dataset and unpredictable frequent updates.
  2. Data lake storage
    We use S3 as our data lake storage with unlimited types and volumes. That helps us easy to scale our system easily
  3. Machine Learning
    We focus on Machine learning to build great AI solutions from our dataset. Machine learning models will predict for insight and integration directly to our system as microservices.
  4. Compute
    Compute machines are most important in our system. We choose fit machine services as infrastructure or serverless to maximize optimize cost and performance.
    – We using AWS lambda functions for small jobs such as call AI services, small dataset processing, and integrating
    – We using AWS Glue ETL to build ETL pipeline and build custom pipelines with AWS step functions.
    – We also provide web and APIs service for end-users that system building by microservices architecture and using AWS Fargate for hosting services
  5. Report datastore
    After processing data and export insight data from predictions we store data in DynamoDB and RDS for visualization and build-in AWS insight features.
  6. Data analytics with visualization and insights
    We are using AWS Quicksight and insight for data analytics

According to the specific dashboard, we have specific data processing pipeline to process and prepare data for visualization and insights. Here is our standard pipeline with AWS Step functions orchestrators. We want to hold everything basic as possible. We using AWS Events for schedule, and Lambda functions as interacting functions.
– Advantages: Serverless and easy to develop and scale

– Disadvantages: Most depend on AWS services and challenges when exceeding the limitation of AWS services such as Lambda functions processing time.
– Most use case scenario: Batch data processing pipeline.

Data processing pipeline
Data processing pipeline
  • Using serverless compute for data processing and microservices architecture
  • Easy to develop, deploy and scale system without modifying
  • Flexibility to using build-in services and build custom Machine Learning model with AWS SageMaker
  • Separately with Datasource systems in running
  • Effectively data lake processing

This architecture is basic and I hope you can get somethings in here. We focus to describe our current system and it is building to larger than our design with more features in the future; we should choose a flexible solution for easy to maintain and replaceable. When you choose any architecture or service which helps you resolve your problems, you should consider what is the best fit service for your current situation, and no architecture/service can resolve all problems that depend on the specific problem. And avoid the case for solution finding for problems 😀

— — — — — — — — — — — — — — — —

Machine Learning with AWS Recognition

What is AWS Recognition?

Amazon Recognition is a service that makes it easy to add powerful, image and video-based, visual analysis to your applications.

Recognition Image lets you easily build powerful applications to search, verify, and organize millions of images.

Recognition Video lets you extract motion-based context from stored or live stream videos and helps you analyze them.

You just provide an image or video to the Recognition API, and the service can identify objects, people, text, scenes, and activities. It can detect any inappropriate content as well.

Amazon Recognition also provides highly accurate facial analysis and facial recognition. You can detect, analyze, and compare faces for a wide variety of use cases, including user verification, cataloging, people counting, and public safety.

Amazon Recognition is a HIPAA eligible service

You need to ensure that the Amazon S3 bucket you want to use is in the same region as your Amazon Recognition API endpoint.

How does it work?

Amazon Recognition provides two API sets, they are Amazon Recognition Image for analyzing images and Amazon Recognition Video, for analyzing videos.

Both API sets perform detection and recognition analysis of images and videos.
Amazon Recognition Video can be used to track the path of people in a stored video.
Amazon Recognition Video to searching a streaming video for persons whose facial descriptions match facial descriptions already stored by Amazon Recognition.

RecognizeCelebrities API returns information for up to 100 celebrities detected in an image.

Use cases for AWS recognition

Searchable image and video libraries

Amazon Recognition makes images and stored videos searchable

Face-based user verification

It can be used in building access or similar applications, compares a live image to a reference image

Sentiment and demographic analysis

Amazon Recognition detects emotions such as happiness, sadness, or surprise, and demographic information such as gender from facial images.

Recognition can analyze images and send the emotion and demographic attributes to Amazon Redshift for periodic reporting on trends such as in-store locations and similar scenarios.

Facial recognition

Images, Stored Videos, and Streaming videos can be searches for faces that match those in a face collection A face collection is an index of faces that you own and manage

Unsafe Content Detection

Amazon Recognition can detect explicit and suggestive adult content in images and in videos

For example, social and dating sites, photo-sharing platforms, blogs and forums, apps for children, e-commerce sites, entertainment, and online advertising services.

Celebrity recognition

Amazon Recognition can recognize thousands of celebrities (politicians, sports, business, entertainment, and media) within supplied images and in videos.

Text detection

It  detecting text in an image allows for extracting textual content from images.

Benefits

Integrate powerful image and video recognition into your apps

Amazon Recognition removes the complexity of building image recognition capabilities into applications by making powerful and accurate analysis available with a simple API.

Deep learning-based image and video analysis

Recognition uses deep learning technology to accurately analyze images, find and compare faces in images, and detect objects and scenes within images and videos.

Scalable image analysis

Amazon Recognition enables for the analysis of millions of images.
This allows for curating and organizing massive amounts of visual data.

Low cost

Clients pay for the images and videos they analyze and the face metadata that stored. There are no minimum fees or upfront commitments.

Machine Learning development with AWS Sage Maker

Make your Machine Learning team working easier, focus more on business and quick deployment with AWS managed service SageMaker.

Today, Machine Learning(ML) is resolving complex problems which make more business values for customer and many companies also apply ML to resolve robust business problems. ML have more benefit, but also more challenges to building the ML model with high accuracy. I currently working on the AI team to help the company deliver AI/ML project quickly and help Data Scientist(DS) team developing Data Pipeline, Machine Learning pipeline which helps project grow and quick delivery with high quality.

Overview of Machine Learning development

Machine Learning Sage Maker AWSFigure1. Machine learning process

Here is the basic machine learning process which is the practice model of big companies. We are including multiple phases (business analyst, data processing, model training and deployment), multiple steps each phase and a fleet of tools that we use to result in dedicated steps.

Business problems: is including problems that challenge the business and we can use ML learning as a better solution to resolve it.
ML Problem Framing: is phase help DS and Engineering definition for ML problems, propose ML solutions, data pipeline and planing.
Data processing(Collection, integration, preparation and cleaning, visualization and analysis): This phase including multiple steps that help to prepare data for visualization and ML training.
Model Training(Feature engineering, Model training and parameter tuning, model evaluation): DS and developer will working on this phase to develop engineering features and prepare data for specific model and training model using framework such as Tensorflow, Pytorch.

When we don’t use any platform such as AWS SageMaker or Azure ML Studio then we take more time to develop with complex stack skills. We need to have many skills in compute, network, storage, ML frameworks, programming language, engineering features …Machine Learning Sage Maker AWS - Machine Learning Stack

Actually, when we develop an ML model with a lack of skills and complex components that take more time to handle tasks about programming, compute and challenges for the engineering team which using the model and deploy it. In Figure 2, we have multiple instants of cloud computing (Infrastructure as a ServicePlatform as a ServiceApplication as a Service) that provide resource type according to business necessary at the level of control. We can choose a specific instant layer for business necessary or compact of layers to meet the business objective. So, which Machine Learning projects and research environment, I would like to highly recommend for any DS and Develop should use PaaS and AaaS first to meet your business requirement first to quick delivery, reduce cost and effort. That is the main reason I want to describe AWS SageMaker service which can use as a standard platform for ML team to quickly develop/deploy ML model, focus on resolve ML business problems and improve the quality of the model.

AWS SageMaker offers the base of end to end machine learning development environment

Machine Learning Sage Maker AWS Purpose

Figure 3. AWS SageMaker benefits

When we developing an ML model, we need to take care of multiple partitions and sometimes we want to try with a new model and responsibility accuracy response as quickly as possible. So, that also depends on “Is there enough data for processing and training?” and “How many time will take for training a new model?”. AWS SageMaker is using in thousands of AI companies and developed by experts and best practices of ML which help to improve the ML process, working environment. In my opinion, when I want to focus on building a model and resolve the challenge problem then I want to make another easy at the basic level and spend more time to resolve the main problems at first. SageMaker provides me a notebook solution and a notebook is a great space that I coding and analyzing data for training. Working together with SageMaker SDK, I can easy to connect and use other resources inside of AWS such as S3 bucket and training jobs. Everything helps me quickly develop and deliver a new model. So, I want to highlight the main benefit which we got from this service and also have disadvantages.

Advantages
*💰 Cost-effective:

– SageMaker provide solution for training ML model with training jobs that are distributed, elastic, high-performance computing using spot instance to save cost up to 90%, pay only for training time in seconds. (document)
– Elastic Inference: This feature help to save cost for computes need GPU to processing deep learning model such as prediction time. (document)
* 🎯 Reduce lack of skills and focus to resolve business problems. We can easy to setup training environment from notebook with “click” for elastic of CPUs/GPUs
* 🌐 Connectivity and easy to deploy
– AWS SageMaker is AWS manged service and it easy to integrate with other AWS services inside of private network. Which also impact to big data solution, ETL processing with data can be processing inside of private network and reduce cost for transfer.
– EndPoint function help DS/Develop easy to deploy trained model with “clicks” or from SDK. (document)
* 🏢 Easy to management: When working with multiple teams on AWS, more resource will come everyday and challenges with IT team to manage resources, roles which will impact to cost and security. AWS manged service will help to reduce resource we need to create.

Disadvantages
* AWS SageMaker is a managed service, it implements with best practices and focuses on popular frameworks, sometimes it does not match your requirement and should be considered before choosing it.
* 🎿 Learning new skills and basic knowledge of AWS Cloud: When we working on AWS cloud, basic knowledge about cloud infrastructure is necessary and new knowledge with AWS managed service which you want to use.
* 👮 It also expensive than normal EC2 instance because it supports dedicated for ML, we need to choose the right resource for the development to save cost.

AWS SageMaker is the best suite service for the production environment. It helps to build a quality model and standard environment. Which reduces the risk in product development. We trade-off to get most of the benefit and quickly to achieve the goal of the team. Thank you so much for reading, please let me know if you have any concerns.

References
https://developers.google.com/machine-learning/problem-framing

https://aws.amazon.com/sagemaker/?nc1=h_ls
https://azure.microsoft.com/en-in/overview/what-is-iaas/
https://azure.microsoft.com/en-in/overview/what-is-paas/
https://azure.microsoft.com/en-in/overview/what-is-saas/