[go: up one dir, main page]

0% found this document useful (0 votes)
47 views22 pages

Azure Data Fundamentals - Study Notes

The document provides comprehensive study notes on Azure Data Fundamentals, covering core data concepts, data storage types, and data processing methodologies. It details relational and non-relational data in Azure, including various Azure services for data management and analytics. Additionally, it outlines data roles, services, and architectures relevant to data analytics within the Azure ecosystem.

Uploaded by

Kadimi Kiranraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views22 pages

Azure Data Fundamentals - Study Notes

The document provides comprehensive study notes on Azure Data Fundamentals, covering core data concepts, data storage types, and data processing methodologies. It details relational and non-relational data in Azure, including various Azure services for data management and analytics. Additionally, it outlines data roles, services, and architectures relevant to data analytics within the Azure ecosystem.

Uploaded by

Kadimi Kiranraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Azure Data Fundamentals

Study Notes

Htay Aung
https://www.linkedin.com/in/htayaung

January 24, 2025


Table of Contents
Core Data Concepts ....................................................................................................................................................... 4
Structured Data ........................................................................................................................................................... 4
Semi-Structured Data................................................................................................................................................ 4
Unstructured Data ...................................................................................................................................................... 4
Data Stores .................................................................................................................................................................... 4
File Stores ...................................................................................................................................................................... 4
Databases ...................................................................................................................................................................... 6
Transactional Data Processing ............................................................................................................................... 6
Analytical Data Processing ...................................................................................................................................... 7
Common Architecture .......................................................................................................................................... 7
Data Roles ..................................................................................................................................................................... 8
Data Services ................................................................................................................................................................ 8
Relational Data in Azure ............................................................................................................................................ 11
SQL ................................................................................................................................................................................ 11
Data Objects .............................................................................................................................................................. 11
Non-Relational Data in Azure.................................................................................................................................. 12
Azure Blob Storage ................................................................................................................................................. 12
Types of blobs ...................................................................................................................................................... 12
Access tiers ............................................................................................................................................................ 13
Azure Data Lake Storage (ADLS) Gen1 ............................................................................................................ 13
Azure Data Lake Storage (ADLS) Gen2 ............................................................................................................ 13
Azure Files .................................................................................................................................................................. 13
Azure Tables............................................................................................................................................................... 14
Azure Cosmos DB .................................................................................................................................................... 14
Supported APIs .................................................................................................................................................... 15
Data Analytics in Azure .............................................................................................................................................. 16
General data analytics architecture .................................................................................................................. 16
Batch Processing ...................................................................................................................................................... 16
Stream Processing ................................................................................................................................................... 17
Real-time analytics services ................................................................................................................................. 17
Sources for stream processing ........................................................................................................................... 17
Sinks for streaming processing .......................................................................................................................... 17
Delta Lake ................................................................................................................................................................... 18
Microsoft Power BI .................................................................................................................................................. 18
Choosing for data visualization ..................................................................................................................... 18
Azure Synapse Analytics ............................................................................................................................................ 22
Resources ........................................................................................................................................................................ 22
Core Data Concepts
▪ Entities
o E.g. customers, products, sales orders, and so on
▪ Attributes
o E.g. a customer’s name, address, phone number, and so on

Structured Data
▪ Fixed schema
▪ The schema type is tubular
▪ Often stored in a database
▪ Reference multiple tables using key values – relational model

Semi-Structured Data
▪ Has some structures
▪ Allow some variation between entities
▪ Common format - JavaScript Object Notation (JSON) document
▪ E.g. Customer entity
o Some customers include one contact
o Some customers include two or more contacts
o Some customers don’t include any contacts

Unstructured Data
▪ Don’t have a specific structure
▪ Don’t have a specific semi-structure
▪ E.g. documents, images, audio, video, and binary files

Data Stores
Common data store categories:

▪ File stores
▪ Databases

File Stores
▪ Local file systems
▪ USB drives
▪ Shared file storage system

Central storage locations hosted in the cloud enable cost-effective, secure, and reliable
storage for large volumes of data.
Common file formats:

▪ Delimited text files


o Comma-separated values (CSV)
o Tab-separated values (TSV)
▪ JSON
o Attributes represented by name : value pairs and separated by comma (,)
o Attribute might contain object collection
o Objects are enclosed in braces ({…})
▪ XML
o Use tags
o Enclosed in angle-brackets (<…/>)
o Contains elements and attributes
▪ Binary Large Object (BLOB)
o Stored as binary data (1’s and 0’s)
o Common file types
▪ Images, video, audio, and application-specific documents
▪ Optimized file formats
o Common file formats
▪ Avro
• Row-based format
• Created by Apache
• Each record contains a header (data structure in the record).
Stored as JSON.
• Stored as binary information
• Good format for compressing data, minimizing storage, and
network bandwidth requirements
▪ ORC (Optimized Row Columnar format)
• Organize data into columns
• Developed by HortonWorks for optimizing read and write
operations in Apache Hive
▪ Parquet
• Columnar data format
• Created by Cloudera and X
• Contains row groups
• Data for each column is stored together in the same row group.
Each row group contains one or more chunks of data
• Includes metadata that describes the set of rows found in each
chunk
• Support efficient compression and encoding schemes
Databases
▪ Can store data and queries
▪ Relational databases
o Apply relational schema to data
o Commonly used to store and query structured data
o Tables represent entities
o Primary key – unique identifier
o Use the primary key to reference other tables. It enables a relational database to
be normalized
o Normalized – elimination of duplicate data values
o Use Structured Query Language (SQL)
o Normalization
▪ Use the least possible amount of storage
▪ Optimizes for updates, inserts, and deletes
▪ Non-relational databases
o No relational schema to data
o Often referred to as NoSQL database
o Common types
▪ Key-value databases
• Each record contains a unique key and associated value in any
format
▪ Document databases
• JSON
▪ Column family databases
• Tabular data – rows and columns
• Columns can be divided into groups (column-families)
▪ Graph databases
• Store as nodes with links to define relationships between them

Transactional Data Processing


▪ Primary function of business computing
▪ Records transactions
▪ A transaction is a small, discrete, unit of work
▪ Often referred to as Online Transactional Processing (OTP)
o Optimized for both read and write operations (CRUD)
o Enforced transactions that support ACID semantics
▪ Atomicity
Each transaction is treated as a single unit
▪ Consistency
Transactions can only take the data in the database from one valid state
to another
▪ Isolation
Concurrent transactions cannot interfere with one another
▪ Durability
When a transaction has been committed, it will remain committed.
o Typically used to support live applications – Line of Business (LOB) applications

Analytical Data Processing


▪ Typically using read-only (or read-mostly) systems that store vast volumes of historical
data or business metrics

Common Architecture

1. Data is Extracted, Transformed, and Loaded (ETL) into a data lake for analysis
2. Data is loaded into a schema of tables – typically in a Spark-based data lakehouse or a
data warehouse
3. Aggregated or loaded into an Online Analytical Processing (OLAP) model or cube
4. Data in the data lake, data warehouse, and analytical model can be queried to produce
reports, visualizations, and dashboards

Data Lake
A centralized repository that ingests and stores large volumes of data in its original form

Data scientists might work directly with data files in a data lake to explore and model data.

Data Warehouse
A centralized repository that stores structured data (database tables, Excel sheets) and semi-
structured data (XML files, webpages) for reporting and analysis

Data Analysts might query tables directly in the data warehouse to produce complex reports and
visualizations.

Data Lakehouse
A data management system that combines the benefits of data lakes and data warehouses

Online Analytical Processing (OLAP) model


- An aggregated type of data storage optimized for analytical workload.
- Enable to drill up/down to view aggregations at multiple hierarchical levels

Business users might consume pre-aggregated data in an analytical model through reports or
dashboards.

Data Roles
Role Responsibilities

Database Administrator - Manage databases


- Assign permissions to users
- Backup and restore data

Data Engineer - Manage infrastructure and processes


for data integration
- Apply data cleaning routines
- Identify data governance rules
- Implement pipelines to transfer and
transform data

Data Analyst - Explore and analyze data


- Create visualizations and charts

Data Services
▪ Azure SQL
o Azure SQL Database
▪ Fully managed PaaS database hosted in Azure
▪ PaaS
▪ For new cloud solutions or to migrate applications that have minimal
instance-level dependencies
▪ Available a Single Database or an Elastic Pool
• Single Database
o Charged per hour
o Able to specify a serverless config
• Elastic Pool
o Multiple databases can share the same resources (memory,
storage space, processing power through multiple tenancy)
o Useful if resource requirement varies over time
o Not fully compatible with on-premises SQL server
installations
o Azure SQL Managed Instance
▪ Hosted SQL Server instance with automated maintenance
▪ More administrative responsibility than Azure SQL DB
▪ PaaS
▪ For most cloud migration scenarios. Minimal changes to existing
applications
o Azure SQL VM
▪ A VM with an SQL Server installation
▪ Maximum configurability
▪ Full management responsibility
▪ IaaS
▪ To migrate or extend on-premises SQL Server solution and retain full
control of the server and database configuration
▪ Open-source databases in Azure
o Azure Database for MySQL
▪ Commonly used in Linux, Apache, MySQL, and PHP (LAMP) stack apps
▪ Automatic backups and point-in-time restore for the last 35 days
o Azure Database for MariaDB
▪ Created by original developers of MySQL
▪ Compatible with Oracle DB
▪ Automatic backups and point-in-time restore for the last 35 days
o Azure Database for PostgreSQL
▪ Hybrid relational-object DB
▪ Able to store custom data types
▪ Azure Cosmos DB
o Global-scale non-relational DB system (NoSQL)
o Support multiple APIs
o Able to store and manage data like JSON documents, key-value pairs, column
families, and graphs
▪ Azure Storage
o Store data in Blob containers – scalable, cost-effective storage for binary files
o Store data in File shares – network file shares
o Store data in Tables – key-value storage
o Data Engineers used to host data lakes – blob storage with a hierarchical
namespace that enables files to be organized in folders in a distributed system
▪ Azure Data Factory
o Able to define and schedule data pipelines to transfer and transform data
o Data Engineers used to build ETL solutions
▪ Microsoft Fabric
o Unified SaaS analytics platform based on the open and governed lakehouse
o Data ingestion and ETL
o Data lakehouse analytics
o Data warehouse analytics
o Data Science and Machine Learning
o Realtime analytics
o Data visualization
o Data governance and management
o AI-powered insights
o Apache Spark-based service
▪ Microsoft OneLake
o A single, unified, logical data lake for the whole organization
o Comes automatically with every Microsoft Fabric tenant
o Built on top of Azure Data Lake Storage Gen2
o Store data in Delta Parquet format
o Able to navigate data from Windows using OneLake file explorer
▪ Azure Databricks
o Azure-integrated version of popular Databricks platform
o Combines Apache Spark data processing platform with SQL database semantics
o Integrated Management Interface
o For large-scale data analytics
▪ Azure Stream Analytics
o Real-time stream processing engine
o Capture data stream from input, apply query to extract and manipulate data, and
write the results to output for analysis or further processing
▪ Azure Data Explorer
o Fully managed, standalone, big data analytics platform
o Offer high-performance querying of log and telemetry data
▪ Microsoft Purview
o Provide enterprise-wide data governance and discoverability
o Able to create a map of data and track data lines across multiple data sources
and systems
▪ Azure Synapse Analytics
o Enterprise analytics service
o Built on Spark
o Databricks and Spark pool run a large amount of data processing using Scala
o SQL + Spark + Data Explorer + Pipelines + ETL/ELT + Integration with Power BI,
CosmosDB, and Azure ML

▪ Data warehouse
o Use fact and dimension tables in a star/snowflake schema

Relational Data in Azure


SQL
SQL statement types

▪ Data Definition Language (DDL)


o CREATE – create a new object (e.g. table, view)
o ALTER
o DROP
o RENAME
▪ Data Control Language (DCL)
o GRANT – grant permission to perform specific actions
o DENY
o REVOKE
▪ Data Manipulation Language (DML)
o SELECT
o INSERT
o UPDATE
o DELETE

Data Objects
▪ View
o A virtual table based on the results of a SELECT query
▪ Stored Procedure
o Encapsulate programmatic logic in a database for actions
▪ Index
o Help search for data in a table
o Create a tree-based structure that the database system’s query optimizer can use
to find rows in the table quickly

Non-Relational Data in Azure


Azure Blob Storage
▪ Able to store massive amounts of unstructured data as binary large objects (blob)
▪ Application can read and write using Azure blob storage API
▪ Within a container, organize blobs in a hierarchy of virtual folders
▪ Can’t perform folder-level operations to control access or bulk operations

Types of blobs
▪ Block blobs
o Handle as a set of blocks
o Each block’s up to 4000 MiB
o Block blob can contain up to 190.7 TiB
o Maximum 5000 MiB
o Best for discrete, large, binary objects that change infrequently
▪ Page blobs
o A collection of fixed-size 512-byte pages
o Can hold up to 8 TB of data
o Azure uses page blobs to implement virtual disk storage for virtual machines.
▪ Append blobs
o Optimized Block blob to support append operations
o Can only add blocks to the end
o Doesn’t support updating or deleting existing blocks
o Each block’s size up to 4 MB
o Maximum 195 GB
Access tiers
▪ Hot
o Frequent access
o High performance
o Can be migrated to the Cool tier
o A few milliseconds for reading latency
▪ Cool
o Lower performance
o Lower storage cost than Hot tier
o Can be migrated back to the Hot tier
o A few milliseconds for reading latency
▪ Archive
o Lowest storage cost
o Increased latency to retrieve
o Stored in an offline state
o Can take hours for the data to become available
o To retrieve a blob, change the access tier to Hot or Cool. The blob will be
rehydrated.

Azure Data Lake Storage (ADLS) Gen1


▪ Historical data storage for analytical data lakes
▪ Big data analytical solution
▪ Work with structured, semi-structured, unstructured data stored in files

Azure Data Lake Storage (ADLS) Gen2


▪ Newer version of Gen1
▪ Integrated into Azure Storage
▪ Often used in the batch processing scenarios
▪ Azure Databricks can mount a distributed file system hosted in Azure Data Lake Store
Gen2
▪ To create a Gen2 file system, you must enable the Hierarchical Namespace option of the
Azure Storage account
▪ Enabling Hierarchical Namespace can’t revert to the flat namespace

Azure Files
▪ A way to create cloud-based network shares
▪ Enable to share up to 100 TB of data in a single storage account
▪ Max size of a single file is 1 TB
▪ Support up to 2000 concurrent connections per shared file
▪ 2 performance tiers
o Standard
▪ Hard disk-based
o Premium
▪ Solid-state disk-based (great throughput, higher rate)
▪ 2 common network file-sharing protocols
o Server Message Block (SMB)
o Network File System (NFS)
▪ Must use the premium tier
▪ Able to upload files to the file share, and grant access to remote users

Azure Tables
▪ NoSQL storage
▪ Key/value data items
▪ Item is represented by a row that contains columns for the data field
▪ Enable to store semi-structured data
▪ All rows must have a unique key (composed of a partition key and a row key)
▪ No concept of FK, relationships, stored procedures, views, or other objects that might be
found in a RDBMS
▪ Usually, denormalized
▪ Splits a table into partitions (grouping related rows, based on common property or
partition key)

Azure Cosmos DB
▪ NoSQL
▪ Fully managed, serverless distributed DB
▪ Support both relational and non-relational workloads
▪ Use indexes and partitioning for fast read and write performance
▪ Index – created and maintained automatically
▪ Each partition can grow up to 10 GB
▪ Can enable multi-region writes (globally distributed users can work with data in their
local replica)
▪ Highly suitable for the following scenarios
o IoT and telematics
o Retail and marketing
o Gaming
o Web and mobile applications

Supported APIs
▪ Azure Cosmos DB for NoSQL
o JSON document format
▪ Azure Cosmos DB for MongoDB
o Able to use MongoDB client libraries
▪ Azure Cosmos DB for PostgreSQL
▪ Azure Cosmos DB for Table
o Work with data in key-value tables (like Azure Table Storage)
▪ Azure Cosmos DB for Apache Cassandra
o Column-family storage structure
o Not mandatory for every row to have the same columns
▪ Azure Cosmos DB for Apache Gremlin
o Graph structure
o Entities – vertices that form nodes in a connected graph
o Nodes are connected by edges – relationships
Data Analytics in Azure
General data analytics architecture

1. Data ingestion and processing


a. Data from different sources is loaded into a data lake or a relational data
warehouse
b. ETL – data is cleaned, filtered, and restructured for analysis
c. 2 types of processing
i. Batch processing – static data
ii. Real-time processing – streaming data
2. Analytical data store
a. Relational data warehouses
b. File-system-based data lakes
c. A hybrid architecture that combines features of the data warehouse and data lake
(data lakehouses/lake databases)
3. Analytical data model
a. Aggregate data (aka cubes)
b. Encapsulates the relationships between data values and dimensional entities to
support drill-up/drill-down analysis
4. Data visualization
a. Consume data from the analytical model and from analytical stores to create the
reports, dashboards, and other visualizations

Batch Processing
▪ Multiple data records are collected and stored before being processed together in a
single operation
▪ Advantages
o Large volumes of data can be processed at a convenient time
o Can be scheduled to run
▪ Disadvantages
o Time delay between ingesting data and getting results
o All input data must be ready before a batch can be processed
o Data problems, errors, and program crashes bring the whole process to a halt

Stream Processing
▪ The source of data is constantly monitored and processed in real-time as new data
events occur
▪ Can aggregate data over time intervals

Real-time analytics services


▪ Azure Stream Analytics
o PaaS
o Can define streaming jobs that ingest data from a streaming source, apply a
perpetual query, and write results to an output
▪ Spark Structured Streaming
o Open-source library
o Enable to develop streaming solutions on Apache Spark-based service (Microsoft
Fabric and Azure Databricks)
▪ Microsoft Fabric

Sources for stream processing


▪ Azure Event Hubs
o Data ingestion service to manage queues of event data
▪ Azure IoT Hub
o Data ingestion service optimized to manage event data from IoT devices
▪ Azure Data Lake Store Gen 2
o Highly scalable storage service, often used in the batch processing scenarios
o Can be used as a source of streaming data
▪ Apache Kafka
o Open-source data ingestion solution
o Commonly used together with Apache Spark

Sinks for streaming processing


▪ Azure Event Hubs
▪ Azure Data Lake Store Gen 2, Microsoft OneLake, Azure blob storage
o Used to persist processed results as a file
▪ Azure SQL Database, Azure Databricks, Microsoft Fabric
o Used to persist processed results in a table for querying and analysis
▪ Microsoft Power BI
o Used to generate real-time data visualizations in reports and dashboards
Delta Lake
▪ Open-source storage layer that adds support for transactional consistency, schema
enforcement, and other common data warehousing features to data lake storage.
▪ Combined with Spark Structured Streaming is a good solution when needing to abstract
batch and stream processed data in a data lake behind a relational schema for SQL-
based querying and analysis

Microsoft Power BI
▪ Power BI Desktop
o Can import data from a wide range of data sources
o Combine and organize data in an analytics data model
o Create reports
▪ Power BI Service
o A cloud service in which reports can be published and interacted with by business
users
▪ Power BI Phone App
o Users can consume reports, dashboards, and apps in the Power BI service
through a web browser or mobile device

Dimensions – used to aggregate data

Fact tables – contain measures that are aggregated

Relationship – used to tie facts to dimensions

Keys – used to establish relationships

Choosing for data visualization


▪ Tables and text
o Tables – when numerous related values must be displayed
o Individual text in cards – when to show important figures or metrics

▪ Bar and column charts


o Good way to visually compare numeric values for discrete categories

▪ Line charts
o Used to compare categorized values
o When need to examine trends, often over time

▪ Pie charts
o To visually compare categorized values as proportions of a total

▪ Scatter plots
o To compare two numeric measures and identify a relationship or correlation
between them

▪ Maps
o To visually compare values for different geographic areas or locations
Azure Synapse Analytics

Linked Services
- to integrate commonly used data stores, processing platforms, and visualization tools

Synapse Analytics workspace


- an instance of the Synapse Analytics service

Two types of SQL pools


- Serverless pool (build-in) – to query file-based data in a data lake
- Dedicated SQL pool (custom) – host relational data warehouses

Data Explorer
- uses Kusto Query Language (KQL)

Azure Synapse Link


- near-real-time synchronization between operational data in Azure Cosmos DB, Azure
SQL Database, SQL Server, and Microsoft Power Platform Dataverse, and analytical
data storage

Resources
▪ Microsoft Certified: Azure Data Fundamentals
▪ Study guide for Exam DP-900: Microsoft Azure Data Fundamentals

You might also like