Azure Data Factory v2 (PDFDrive)
Azure Data Factory v2 (PDFDrive)
Passcamp 2017
Stefan Kirner
Agenda
1. Target Scenarios
2. Current State
3. Intro Data Factory v2
4. Triggers
5. Control Flow
6. SSIS in ADFv2
7. Pricing
8. Roadmap & Q+A
Stefan Kirner
Teamleiter Business Intelligence Solutions
bei der inovex GmbH
› 15+ Jahre Erfahrung mit dem Microsoft Business Intelligence Toolset
› Microsoft Certified Systems Expert (MCSE) Data Management & Analytics
› Microsoft Certified Systems Associate (MCSA) Cloud Platform
› Microsoft P-TSP Data Platform
› Leitung SQL PASS e.V. Community Gruppe Karlsruhe
› Sprecher bei Konferenzen und User Groups zu BI- und Cloud-Themen
› Mail: stefan.kirner@inovex.de
Twitter: @KirnerKa
3
Extract & Load Prepare Transform/Analyze Extract & Load
“Data Lake”
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
Extract & Load Prepare Transform/Analyze Extract & Load
“Data Lake”
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
Extract & Load Prepare Transform/Analyze Extract & Load
“Data Lake”
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
Part of Cortana Analytics Suite
9
Azure Data Factory (ADF)
Provides orchestration, data movement and monitoring services
Orchestration model: time series processing
Hybrid Data movement as a Service w/ many connectors
Programmatic authoring, visual monitoring (.NET, Powershell)
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
Data Factory v1 in Azure Portal
12
New Pipeline Model
Rich pipeline orchestration
Triggers – on-demand, schedule, event
15
Data Factory Essentials
Artefacts in Data Factory
V1 vs. V2 datasets:
•The external property is not supported in v2. It's replaced by
a trigger.
•The policy and availability properties are not supported in V2. The
start time for a pipeline depends on triggers.
•Scoped datasets (datasets defined in a pipeline) are not supported
in V2.
16
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
Pipeline SSIS
Package
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
ADFv2 Pipelines
My Pipeline 1 My Pipeline 2
For Each…
Trigger Activity 3
params params params
Activity 1 Activity 2
Event Activity 4
Wall Clock
On Demand
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
ADFv2 Pipelines
My Pipeline 1 My Pipeline 2
For Each…
Trigger Activity 3
params params params
Activity 1 Data Flow
Event Activity 4
Wall Clock
On Demand
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
Expressions & Parameters
Getting dynamic using inline expressions
1. rich new set of custom inner syntax in JSON
2. parameters as first class citizens in the service to support
expressions factory wide
3. @ symbol starts expressions: e.g.
"name": "@parameters('password') “
4. Different types of functions available:
String (substring..), Collection (union..), Logic (less than),
Conversation (array..), math (add..), Date (addminutes..)
e.g.: replace('the old string', 'old', 'new')
https://docs.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions 22
System variables
https://docs.microsoft.com/en-us/azure/data-factory/control-flow-system-variables 23
Scalable
per job elasticity (cloud data movement units)
aka
Simple
“Copy Activity” Visually author or via code (Python, .Net, etc)
Serverless, no infrastructure to manage
Staged copy (compress/decompress in hybrid scenarios, SQL DW load
using polybase, bypass firewall restrictions)
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
Development
How to develop in ADFv2?
1. .net: Create ADF Objects and Deploy to ADFv2 .net
using .net
2. PowerShell: Create ADF Objects and Deploy to ADFv2
3. Edit & PowerShell: Create ADF Objects per copy and
paste and Deploy json artefacts using Powershell
4. Visual Authoring in Azure Portal (private preview)
5. Visual Studio Project (unkown)
6. BIML (private preview)
25
Hands-on development essentials
Powershell
Run & Monitor
Create ADFv2 Deploy ADFv2 ADFv2 Pipelines
Artefacts local Artefacts On-demand
26
Demo ADFv2
Data Movement: Copy Activity
27
Demo Data Movement
Copy Activity
Pipeline
Activity: Copy data
On- demand from input file to
run
Trigger SQL table
Dataset: Dataset:
Container Table
+ Flat file
https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity 28
Hands-on HOL1
Data Movement: Copy Activity
Achtung: Powershell Modul updaten auf den Passcamp Maschinen:
ISE im Admin mode
Install-Module AzureRM.DataFactoryV2 –AllowClobber
Bestätigen dass du Trusted
29
Triggers
How do pipelines get started
1. on-demand
2. Wall-clock Schedule
3. Tumbling Window (aka time-slices in v1)
4. Event (not yet available)
31
Run pipeline on-demand
1. Power Shell:
Invoke-AzureRmDataFactoryV2Pipeline + Parameters
2. Rest API:
https://management.azure.com/subscriptions/mySubId/resourceGroups/myResou
rceGroup/providers/Microsoft.DataFactory/factories/{yourDataFactory}/pipelines
/{yourPipeline}/createRun?api-version=2017-03-01-preview
3. .NET:
client.Pipelines.CreateRunWithHttpMessagesAsync(+ parameters)
4. Azure Portal
Not possible (state 9.11.2017)
https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers 32
Run pipeline by schedule
33
New Pipeline Jobs View
New
• Superset of original jobs view
• Adds grouping of jobs by pipelines & recurrences
• Jobs and consumption trends per pipeline
• Quickly identify pipelines and jobs to troubleshoot
• Quickly compare failed jobs with “last known good” instance
• Manage pipeline cost, improve efficiency and predict future
cost
How to use
• Create ADF v2 pipelines containing ADLA U-SQL activities
• Pipelines and Recurrences automatically appear in ADLA portal
• Submit and monitor pipeline/recurring jobs using Azure
PowerShell, ADLA SDK and REST APIs
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
Demo ADFv2
Scheduler Trigger
35
Demo Scheduler Trigger
Scheduled Trigger for Copy Activity
1. Start pipeline every 2 minutes
2. …
"frequency": „minute",
"interval": 2,
…
1. Deployment Trigger via Powershell
2. Start Trigger via Powershell to „activate“, default is
stopped
https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers 36
HOL ADFv2
Scheduler Trigger
37
HOL Scheduler Trigger
Scheduler Trigger for Copy Activity
1. Start pipeline every day at 8am, 12am, 5pm until end of
year
2. Deployment Trigger via Powershell
3. Start Trigger via Powershell to „activate“, default is
stopped
4. Check for Trigger Information via Powershell
https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers 38
Activities
Known from v1 - Data Transformation Activities
Data transformation activity Compute environment
WebActivity call a custom REST endpoint and pass datasets and linked services
Lookup Activity look up a record/ table name/ value from any external source to be referenced by
succeeding activities. Could be used for incremental loads!
Get Metadata retrieve metadata of any data in Azure Data Factory e.g. did another pipeline finish
Activity
Do Until Activity similar to Do-Until looping structure in programming languages.
If Condition do something based on condition that evaluates to true or false.
Activity
https://www.purplefrogsystems.com/paul/2017/09/whats-new-in-azure-data-factory-version-2-adfv2/ 42
Visual authoring experience
(only private preview)
1. Control Flow
2. Data Flow Column Mapping
3. Manage IR Runtimes (later more)
43
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com 44
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com 45
Demos Control Flow
For Each Activity
Pipeline
Activity: Copy
On- demand Activity: For each exec
run Activity: Copy
Trigger sink Path
exec
Activity: Copy
data from input
exec
folder to folder
outputFalse
Parameters:
mySourcePath
mySinkPath[el1,el2..]
https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity 46
Hands-on Control Flow
If Condition Activity
47
HOL3: Control Flow
If Condition Activity
Pipeline
Activity: Copy
Parameter data from input
On- demand
run value for true folder to folder
Trigger
routeSelection outputTrue
false
Parameters:
routeSelection
Activity: Copy
inputPath
data from input
outputPathTrue
folder to folder
outputPathFalse
outputFalse
https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity 48
Managed Cloud Environment
Pick # nodes & node size
Resizable
SQL Standard Edition, Enterprise coming soon
Integration Runtime
for SSIS Compatible
Same SSIS runtime across Windows, Linux, Azure Cloud
Get Started
Hourly pricing (no SQL Server license required)
Use existing license (coming soon)
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
Integration runtime
Different capabilities
1. Data Movement
Move data between data stores, built-in connectors, format
conversion, column mapping, and performant and scalable data
transfer
2. Activity Dispatch
Dispatch and monitor transformation activities (e.g. Stored Proc on
SQL Server, Hive on HD Insight..)
3. SSIS package execution
Execute SSIS packages
51
Combinations of IR types, networks and
capabilities
IR type Public network Private network
https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime 52
Integration runtimes
1. Azure Integration Runtime
• move data between cloud data stores
• fully managed
• serverless compute service (PaaS) on Azure
• cost will occur only for time of duration
• user could define data movement units
• compute size auto scaled for copy jobs
53
Integration runtimes
2. Self-hosted Integration Runtime
• perform data integration securely in a private network
environment w/o direct line-of-sight from the public cloud
environment
• Installed on-premises in your environment
• Supports copy activity between a cloud data stores and a
data store in private network
• Supports dispatching the transform activities
• Works in your corporate network or virtual private network
• Only Outbound http based connections to open internet
• Scale out supported
54
Integration runtimes
3. Azure-SSIS Integration Runtime
• fully managed cluster of Azure VMs for native execution
of SSIS packages.
• Access to on-premises data access using Vnet (classic in
preview)
• SSIS Catalog on Azure SQL DB or SQL Managed
Instance
• scale up: set node size
• scale out: number of nodes
• reduce costs by start/stop of service
55
https://docs.microsoft.com/en-us/azure/data-factory/media/concepts-integration-runtime/different-integration-runtimes.png 56
Determining which IR to use
58
Scaleable Integration Services
How to scale up/out using 3 Settings on Azure SSIS IR
1. Configurable number of nodes on which SSIS is executed
$AzureSSISNodeSize = "Standard_A4_v2" # minimum size, others avail.
2. Configurable size of nodes
$AzureSSISNodeNumber = 2 #between 1 and 10 nodes*
3. Configurable maximum parallel executions per node
$AzureSSISMaxParallelExecutionsPerNode = 2 # between 1-8*
3. SSIS packages can be executed via custom code/PSH using SSIS MOM .NET
SDK/API
› Microsoft.SqlServer.Management.IntegrationServices.dll is installed in .NET GAC
with SQL Server/SSMS installation
4. SSIS packages can be executed via T-SQL scripts executing SSISDB sprocs
› Execute SSISDB sprocs [catalog].[create_execution] +
[catalog].[set_execution_parameter_value] + [catalog].[start_execution]
Scheduling Methods
1. SSIS package executions can be directly/explicitly scheduled via ADFv2 App (Work in
Progress)
› For now, SSIS package executions can be indirectly/implicitly scheduled via ADFv1/v2 Sproc
Activity
66
SSIS Demo
Setup and manage environment, deploy packages
Run & Monitor
Setup per SSMS on
environment demand
Create SSIS ADFv2 (Later per
Powershell Trigger)
Artefacts local
67
Pricing
Different factors for billing
1. Number of activities run
2. Volume of data moved
3. SQL Server Integration Services (SSIS) compute hours
4. Whether a pipeline is active or not
https://azure.microsoft.com/en-gb/pricing/details/data-factory/v2/ 69
Pricing
1. Orchestration:
› Activity runs in Azure IR :
› 0,464 € per 1.000 runs / 0,422 € post 50.000 runs
› Activity runs in Self-Hosted IR: 0,633 € per 1,000 runs
2. Volume in Data Movement
› Azure IR: 0,106 € per hour
› Self-hosted IR: 0,043 € per hour
› + Outbound data transfer charges
https://azure.microsoft.com/en-gb/pricing/details/data-factory/v2/ 15.11.2017, Preview prices 70
Pricing
https://azure.microsoft.com/en-gb/pricing/details/data-factory/v2/ 71
Roadmap
2017
• SDKs (Python, .Net, Powershell)
• New control-flow/data-flow based app model
• Familiar to existing SSIS, etc users.
• Serverless, pay per use
• SSIS runtime (in ADFv2)
• For lifting existing on prem SSIS solutions to cloud
• “Use existing license” (coming soon)
• Data movement
• More connectors, scale out, highly available self-hosted integration
runtime
• Visual experience (coming soon)
• For control flow, data movement & monitoring
Roadmap
2018
MS Ignite 2017, Deep dive into SSIS 2017 and beyond, BRK3358, https://myignite.microsoft.com
Click on “Connections”
and
”Integration Runtimes”
MS Ignite 2017, Deep dive into SSIS 2017 and beyond, BRK3358, https://myignite.microsoft.com
Click on “+ New” and
”Lift-and-Shift…”
MS Ignite 2017, Deep dive into SSIS 2017 and beyond, BRK3358, https://myignite.microsoft.com
Links and further informatoin
1. Microsoft documentation:
https://docs.microsoft.com/en-us/azure/data-factory/
2. Powershell cmdlets for ADF and ADFv2 for v5.0.0
https://docs.microsoft.com/en-
us/powershell/module/azurerm.datafactories/?view=azurermps-
5.0.0&viewFallbackFrom=azurermps-5.0.0#data_factories
3. Very good blog article about Azure Data Factory V2:
https://www.purplefrogsystems.com/paul/2017/09/whats-new-in-azure-
data-factory-version-2-adfv2/
4. MS Ignite Sessions:
https://myignite.microsoft.com/videos/55421
https://myignite.microsoft.com/sessions/55271?source=sessions
78
inovex ist ein IT-Projekthaus inovex gibt es in Karlsruhe · Pforzheim ·
mit dem Schwerpunkt „Digitale Transformation“: Stuttgart · München · Köln · Hamburg