Data Engineering with AWS Cookbook, published by Packt.
Following the chapter list with the link to the companion files as needed.
- Controlling access to S3 buckets
- Storage types in S3 for optimized storage costs
- Enforcing encryption of S3 buckets
- Setting up retention policies for your objects
- Versioning your data
- Replicating your data
- Monitoring your S3 buckets
- Creating read-only replicas for RDS
- Redshift live data sharing among your clusters
- Synchronizing Glue Data Catalog to a different account
- Enforcing fine-grained permissions on S3 data sharing using Lake Formation
- Sharing your S3 data temporarily using a presigned URL
- Real-time sharing of S3 data
- Sharing read-only access to your CloudWatch data with another AWS account
- Creating ETL jobs visually using AWS Glue Studio
- Parameterizing jobs to make them more flexible and reusable
- Handling job failures and reruns for partial results
- Processing data incrementally using bookmarks and bounded execution
- Handling a high quantity of small files in your job
- Reusing libraries in your Glue job
- Using data lake formats to store your data
- Optimizing your catalog data retrieval using pushdown filters and indexes
- Running pandas code using AWS Glue for Ray
- Defining a simple workflow using AWS Glue Workflows
- Setting up event-driven orchestration with AWS Event bridge
- Creating a data workflow using AWS Step Functions
- Managing Data Pipelines with MWAA
- Monitoring your pipeline health
- Setting up a pipeline using AWS Glue to ingest data from a JDBC database into a catalog table
- Running jobs using AWS EMR serverless
- Running your AWS EMR cluster on EKS
- Using the AWS Glue catalog from another account
- Making your cluster highly available
- Scaling your cluster based on workload
- Customizing the cluster nodes easily using bootstrap actions
- Tuning Apache Spark resource usage
- Code development on EMR using Workspaces
- Monitoring your cluster
- Protecting your cluster from security vulnerabilities
- Applying Data Quality check on Glue tables
- Automating the discovery and reporting of sensitive data on your S3 buckets
- Establishing a tagging strategy for AWS resources
- Building your distributed data community with AWS DataZone following data mesh principles
- Handling security-sensitive data (PII and PHI)
- Ensuring S3 compliance with AWS Config
- Creating Data Quality for ETL jobs in AWS Glue Studio notebooks
- Unit testing your data quality using Deequ
- Schema management for ETL pipeline
- Building unit test functions for ETL pipeline
- Building data cleaning and profiling jobs with DataBrew
- Setting up a code deployment pipeline using CDK and AWS CodePipeline
- Setting up a CDK pipeline to deploy on multiple accounts and regions
- Running code in a CloudFormation deployment
- Protecting resources from accidental deletion
- Deploying a data pipeline using Terraform
- Reverse-engineering IaC
- Integrating AWS Glue and Git version control
- Automatically setting CloudWatch log group retention to reduce cost
- Creating custom dashboards to monitor Data Lake services
- Setting up System Manager to remediate non-compliance with AWS Config rules
- Using AWS config to automate non-compliance S3 server access logging policy
- Tracking AWS Data Lake cost per analytics workload
- Accessing the Redshift cluster using JDBC to query data
- Creating a VPC Endpoint to establish public connectivity between a private Redshift cluster and client applications
- Querying large historical data with Redshift Spectrum
- Using Redshift workload management to manage workload priority
- Using AWS SDK for Pandas, Redshift Data API and Lambda to execute SQL statements
- Using AWS SDK for Python to manage Amazon QuickSight
- Reviewing the steps and processes for migrating an on-premises platform to AWS
- Choosing your AWS analytics stack – the re-platforming approach
- Picking the correct migration approach for your workload
- Planning for prototyping and testing
- Converting ETL processes with big data frameworks
- Defining and executing your migration process with Hadoop
- Migrating the existing Hadoop security authentication and authorization processes
- Creating SCT migration assessment report with AWS SCT
- Extracting Data with AWS DMS
- Live example – migrating an Oracle database from a local laptop to AWS RDS using AWS SCT
- Leveraging AWS Snow Family for large-scale data migration
- Calculating total cost of ownership (TCO) using AWS TCO calculators
- Conducting a Hadoop migration assessment using the TCO simulator
- Selecting how to store your data
- Migrating on-premises HDFS data using AWS DataSync
- Migrating the Hive Metastore to AWS
- Migrating and running Apache Oozie on Amazon EMR
- Migrating an Oozie database to the Amazon RDS MySQL
- Setting up networking – establishing a secure connection to your EMR cluster
- Performing a seamless HBase migration to AWS
- Migrating HBase to DynamoDB on AWS