[go: up one dir, main page]

0% found this document useful (0 votes)
72 views9 pages

AWS Data Infrastructure Guide

The document outlines the process of building and managing data infrastructure using AWS services, including setting up data lakes, ingesting data, and preparing it for analytics. It emphasizes the importance of data cataloging, security, governance, and automation in data workflows. Additionally, it discusses orchestration and automation tools like AWS Step Functions and AWS Lambda to streamline data processing and analytics tasks.

Uploaded by

Devendra Talele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views9 pages

AWS Data Infrastructure Guide

The document outlines the process of building and managing data infrastructure using AWS services, including setting up data lakes, ingesting data, and preparing it for analytics. It emphasizes the importance of data cataloging, security, governance, and automation in data workflows. Additionally, it discusses orchestration and automation tools like AWS Step Functions and AWS Lambda to streamline data processing and analytics tasks.

Uploaded by

Devendra Talele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Build and manage data infrastructure and platforms

This includes setting up databases, data lakes, and data warehouses on AWS
services like Amazon Simple Storage Service (Amazon S3), AWS Glue, Amazon
Redshift, among others.

Ingest data from various sources


You can use tools like AWS Glue jobs or AWS Lambda functions to ingest data from
databases, applications, files, and streaming devices into the centralized data
platforms.
Prepare the ingested data for analytics
Use technologies like AWS Glue, Apache Spark, or Amazon EMR to prepare data by
cleaning, transforming, and enriching it.

Catalog and document curated datasets


Use AWS Glue crawlers to determine the format and schema, group data into tables,
and write metadata to the AWS Glue Data Catalog. Use metadata tagging in Data
Catalog for data governance and discoverability.

Automate regular data workflows and pipelines


Simplify and accelerate data processing using services like AWS Glue workflows,
AWS Lambda, or AWS Step Functions.

Ensure data quality, security, and compliance


Create access controls, establish authorization policies, and build monitoring
processes. Use Amazon DataZone or AWS Lake Formation to manage and govern
access to data using fine-grained controls. These controls help ensure access with
the right level of privileges and context.
First stage start with deciding the storage place ------- 2 STORE—
Before you can ingest data, you need a place to put it, therefore a modern data
architecture starts with the data lake. A data lake is a centralized repository that
you can use store structured, semi-structured, and unstructured data at scale.
Organizations can use it to ingest, store, and analyze diverse datasets without the
need for extensive preprocessing.

Amazon S3 provides an optimal foundation for a data lake because of its virtually
unlimited scalability and high durability. You can seamlessly and non-disruptively
increase storage from gigabytes to petabytes of content and only pay for what you
use.

1 INGEST : After you have established the data lake, you can use specialized AWS
services to ingest different types of data into your data lake.
3 --- CATALOG
An essential component of a data lake built on Amazon S3 is the data catalog.
Organizations can use cataloging to keep track of data assets and understand what
data exists, where it is located, its quality, and how it is used. A data catalog is
designed to provide a single source of truth about the contents of the data lake.
AWS Glue Data Catalog creates a catalog of metadata about your stored assets.
Use this catalog to help search and find relevant data sources based on various
attributes like name, owner, business terms, and others.

4 --- PROCESS
After the data is cataloged, it can now be processed or transformed into formats
that are more useful for analysis and insights. Transformation can include data type
conversion, filtering, aggregation, standardization, and normalizing.

5 ---- DELIVER -- Analytics Service


Transformed data is delivered to consumers and stakeholders, such as data
scientists, data analysts, and business analysts. The primary purpose of data
analytics is to extract insights from data that can lead to good business or
organizational outcomes. Many AWS services can be used at this stage.
6 – Security and Governance—
Security in data analytics systems refers to measures taken to protect data from
unauthorized access, breaches, or attacks. It involves safeguarding data
confidentiality, integrity, and availability. The entire data analytics system depends
on data being secured and accessible only by authorized users.
Governance encompasses the policies, procedures, and processes that ensure the
proper management, quality, and use of data. It involves defining roles,
responsibilities, and decision-making processes related to data.
Following are some of the AWS services used for security and governance. These
are covered in more detail in this course in the Security and Monitoring in Data
Analytics Systems lesson.

With Lake Formation, you can centrally manage and scale fine-grained data
access permissions and share data with confidence within and outside your
organization.

IAM manages fine-grained access and permissions for human users, software users,
other services, and microservices.
https://aws.amazon.com/big-data/datalakes-and-analytics/
https://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/scenarios.html

Orchestration and Automation Options


As businesses increasingly rely on data analytics to make informed decisions,
managing the complexity of data workflows, processing, and analysis becomes a
significant challenge. Without efficient coordination and automation, data pipelines
can become fragmented, error-prone, and time-consuming to manage. Additionally,
scaling these processes to handle growing volumes of data can be daunting.
Orchestration and automation can help solve these problems.
Orchestration is the process of coordinating multiple services to define and
manage the flow of data through a series of steps. It involves defining workflows
and dependencies between steps.

Automation refers to using tools and services to automate repetitive tasks related
to data ingestion, processing, and analytics.

Automation is suitable for simple repetitive tasks. Orchestration is needed for


complex workflows involving the coordination of multiple services, teams, and
dependencies across stages.
Typically, they are used together in analytics workflows. For example, orchestration
could involve coordinating multiple automated tasks in a defined sequence.
Together, orchestration and automation can streamline operations, improve
reliability, and empower non-programmers to manage complex workflows.

Many AWS services can be used to orchestrate pipelines and workflows. They can
be combined in nearly unlimited ways to meet very demanding requirements.
The following is a partial list of AWS services that can be used in data analytics
systems for orchestration and automation.
AWS Step Functions
Step Functions is a visual workflow service to orchestrate and automate workflows,
pipelines, and processes. Step Functions ensures tasks run in the correct order. It
does the following:
 Orchestrates ETL workflows by connecting Lambda functions that extract the
data from sources, transform it, and load it into databases and data lakes.
 Runs batch jobs on data in AWS Glue, AWS EMR, or other services.
 Processes streaming data by connecting Lambda functions processing data
from Kinesis Data Streams or Amazon Data Firehose for real-time analytics.
AWS Lambda
Lambda runs code (called Lambda functions) without provisioning or managing
servers. Combined with Step Functions, Lambda functions can invoke AWS services
and microservices and perform tasks to that are part of orchestrated workflows.
 Lambda functions can be invoked by events from data sources like Amazon
S3, DynamoDB, or Kinesis Data Streams to process incoming data in real
time.
 Step Functions can be used to orchestrate multiple Lambda functions for
error handling, retries, and visualizations.
 Lambda functions can be used in event-driven architectures with services like
Amazon SNS and Amazon SQS to decouple and coordinate different analytics
tasks.

You might also like