AWS Lake Formation – Data Lake(Setup)

Overview for Build a Data lake with AWS(Beginner)

AWS Lake Formation enables you to set up a secure data lake. A data lake is a centralized, curated, and secured repository storing all your structured and unstructured data, at any scale. You can store your data as-is, without having first to structure it. And you can run different types of analytics to better guide decision-making—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

The challenges of data lakes

The main challenge to data lake administration stems from the storage of raw data without content oversight. To make the data in your lake usable, you need defined mechanisms for cataloging and securing that data.

Lake Formation provides the mechanisms to implement governance, semantic consistency, and access controls over your data lake. Lake Formation makes your data more usable for analytics and machine learning, providing better value to your business.

Lake Formation allows you to control data lake access and audit those who access data. The AWS Glue Data Catalog integrates data access policies, making sure of compliance regardless of the data’s origin.

Set up the S3 bucket and put the dataset.

Set up the S3 bucket and put the dataset.

Set up Data Lake with AWS Lake Formation.

Step 1: Create a data lake administrator

First, designate yourself a data lake administrator to allow access to any Lake Formation resource.

Create a data lake administrator

Step 2: Register an Amazon S3 path

Next, register an Amazon S3 path to contain your data in the data lake.

Register an Amazon S3 path

Step 3: Create a database

Next, create a database in the AWS Glue Data Catalog to contain the datasetsample00 table definitions.

  •      For Database, enter datasetsample00-db
  •      For Location, enter your S3 bucket/ datasetsample00.
  •      For New tables in this database, do not select Grant All to Everyone.

Step 4: Grant permissions

Next, grant permissions for AWS Glue to use the datasetsample00-db database. For IAM role, select your user and AWSGlueServiceRoleDefault.

Grant your user and AWSServiceRoleForLakeFormationDataAccess permissions to use your data lake using a data location:

  • For IAM role, choose your user and AWSServiceRoleForLakeFormationDataAccess.
  • For Storage locations, enter s3:// datalake-hiennu-ap-northeast-1.

Step 5: Crawl the data with AWS Glue to create the metadata and table

In this step, a crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your AWS Glue Data Catalog.

Create a table using an AWS Glue crawler. Use the following configuration settings:

  • Crawler name: samplecrawler.
    Crawler name: samplecrawler.
  • Data stores: Select this field.
  • Choose a data store: Select S3.
  • Specified path: Select this field.
  • Include path: s3://datalake-hiennu-ap-northeast-1/datasetsample00.
  • Add another data store: Choose No.
  • Choose an existing IAM role: Select this field.
  • IAM role: Select AWSGlueServiceRoleDefault.
  • Run on demand: Select this field.
  • Database: Select datasetsample00-db.

Step 6: Grant access to the table data

Set up your AWS Glue Data Catalog permissions to allow others to manage the data. Use the Lake Formation console to grant and revoke access to tables in the database.

  • In the navigation pane, choose Tables.
  • Choose Grant.
  • Provide the following information:
    1. For IAM role, select your user and AWSGlueServiceRoleDefault.
    2. For Table permissions, choose Select all.

Step 7: Query the data with Athena

Query the data in the data lake using Athena.

  • In the Athena console, choose Query Editor and select the datasetsample00-db
  • Choose Tables and select the datasetsample00 table.
  • Choose Table Options (three vertical dots to the right of the table name).
  • Select Preview table.

Athena issues the following query: SELECT * FROM datasetsample00 limit 10;

Machine Learning with AWS Recognition

What is AWS Recognition?

Amazon Recognition is a service that makes it easy to add powerful, image and video-based, visual analysis to your applications.

Recognition Image lets you easily build powerful applications to search, verify, and organize millions of images.

Recognition Video lets you extract motion-based context from stored or live stream videos and helps you analyze them.

You just provide an image or video to the Recognition API, and the service can identify objects, people, text, scenes, and activities. It can detect any inappropriate content as well.

Amazon Recognition also provides highly accurate facial analysis and facial recognition. You can detect, analyze, and compare faces for a wide variety of use cases, including user verification, cataloging, people counting, and public safety.

Amazon Recognition is a HIPAA eligible service

You need to ensure that the Amazon S3 bucket you want to use is in the same region as your Amazon Recognition API endpoint.

How does it work?

Amazon Recognition provides two API sets, they are Amazon Recognition Image for analyzing images and Amazon Recognition Video, for analyzing videos.

Both API sets perform detection and recognition analysis of images and videos.
Amazon Recognition Video can be used to track the path of people in a stored video.
Amazon Recognition Video to searching a streaming video for persons whose facial descriptions match facial descriptions already stored by Amazon Recognition.

RecognizeCelebrities API returns information for up to 100 celebrities detected in an image.

Use cases for AWS recognition

Searchable image and video libraries

Amazon Recognition makes images and stored videos searchable

Face-based user verification

It can be used in building access or similar applications, compares a live image to a reference image

Sentiment and demographic analysis

Amazon Recognition detects emotions such as happiness, sadness, or surprise, and demographic information such as gender from facial images.

Recognition can analyze images and send the emotion and demographic attributes to Amazon Redshift for periodic reporting on trends such as in-store locations and similar scenarios.

Facial recognition

Images, Stored Videos, and Streaming videos can be searches for faces that match those in a face collection A face collection is an index of faces that you own and manage

Unsafe Content Detection

Amazon Recognition can detect explicit and suggestive adult content in images and in videos

For example, social and dating sites, photo-sharing platforms, blogs and forums, apps for children, e-commerce sites, entertainment, and online advertising services.

Celebrity recognition

Amazon Recognition can recognize thousands of celebrities (politicians, sports, business, entertainment, and media) within supplied images and in videos.

Text detection

It  detecting text in an image allows for extracting textual content from images.

Benefits

Integrate powerful image and video recognition into your apps

Amazon Recognition removes the complexity of building image recognition capabilities into applications by making powerful and accurate analysis available with a simple API.

Deep learning-based image and video analysis

Recognition uses deep learning technology to accurately analyze images, find and compare faces in images, and detect objects and scenes within images and videos.

Scalable image analysis

Amazon Recognition enables for the analysis of millions of images.
This allows for curating and organizing massive amounts of visual data.

Low cost

Clients pay for the images and videos they analyze and the face metadata that stored. There are no minimum fees or upfront commitments.