4-Data Processing Pipelines in Science and Business

Data processing pipelines involve connecting data processing elements in series to move data from one system to another. Common types of pipelines include instruction pipelines in CPUs, graphics pipelines in GPUs, software pipelines connecting computing processes, and HTTP pipelining for web requests. While ETL systems specifically focus on extracting, transforming and loading data in batches, data pipelines more broadly encompass real-time data movement and transformation between any systems, and can power analytics, data integration and ingestion tools. Effective data pipelines require technologies to capture, queue, store and manage data movement between systems.

Uploaded by

Shiladitya Narania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

272 views22 pages

4-Data Processing Pipelines in Science and Business

Uploaded by

Shiladitya Narania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Management

Data Processing Pipelines in science and business

Concepts
Data Processing Pipeline
Data Processing Pipeline
In computing, a pipeline, also known as a data pipeline,[1] is a set of data processing elements connected in
series, where the output of one element is the input of the next one. The elements of a pipeline are often
executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between
elements.
Computer-related pipelines include:
• Instruction pipelines, such as the classic RISC pipeline, which are used in central processing units (CPUs) and other microprocessors to allow
overlapping execution of multiple instructions with the same circuitry. The circuitry is usually divided up into stages and each stage processes a
specific part of one instruction at a time, passing the partial results to the next stage. Examples of stages are instruction decode, arithmetic/logic
and register fetch. They are related to the technologies of superscalar execution, operand forwarding, speculative execution and out-of-order
execution.
• Graphics pipelines, found in most graphics processing units (GPUs), which consist of multiple arithmetic units, or complete CPUs, that
implement the various stages of common rendering operations (perspective projection, window clipping, color and light calculation, rendering,
etc.).
• Software pipelines, which consist of a sequence of computing processes (commands, program runs, tasks, threads, procedures, etc.),
conceptually executed in parallel, with the output stream of one process being automatically fed as the input stream of the next one. The Unix
system call pipe is a classic example of this concept.
• HTTP pipelining, the technique of issuing multiple HTTP requests through the same TCP connection, without waiting for the previous one to
finish before issuing a new one.
Data Pipeline and ETL?
We may commonly hear the terms ETL and data pipeline used interchangeably. ETL stands for Extract, Transform,
and Load. ETL systems extract data from one system, transform the data and load the data into a database or data
warehouse. Legacy ETL pipelines typically run in batches, meaning that the data is moved in one large chunk at a
specific time to the target system. Typically, this occurs in regular scheduled intervals;

By contrast, "data pipeline" is a broader term that encompasses ETL as a subset. It refers to a system for moving
data from one system to another. The data may or may not be transformed, and it may be processed in real-time
(or streaming) instead of batches. When the data is streamed, it is processed in a continuous flow which is useful
for data that needs constant updating, such as a data from a sensor monitoring traffic. In addition, the data may
not be loaded to a database or data warehouse. It might be loaded to any number of targets, such as an AWS
bucket or a data lake, or it might even trigger a webhook on another system to kick off a specific business
process.

There are many moving parts in a Machine Learning (ML) model that have to be tied together for
an ML model to execute and produce results successfully. This process of tying together different pieces
of the ML process is known as a pipeline. Apipeline is a generalized but very important concept for a
Data Scientist.
Who Needs a Data Pipeline?
While a data pipeline is not a necessity for every business, this technology is especially helpful for those that:
•Generate, rely on, or store large amounts or multiple sources of data.
•Maintain siloed data sources.
•Require real-time or highly sophisticated data analysis.
•Store data in the cloud.

What you can do with Data Pipeline

Here are a few things you can do with Data Pipeline.
1.Convert incoming data to a common format.
2.Prepare data for analysis and visualization.
3.Migrate between databases.
4.Share data processing logic across web apps, batch jobs, and APIs.
5.Power your data ingestion and integration tools.
6.Consume large XML, CSV, and fixed-width files.
7.Replace batch jobs with real-time data.
Why one Needs a Data Pipeline?
•Analytics is computationally taxing. If you use the same systems for analysis that you use for capturing your
data, you risk impairing both the performance of your service (at the capture end), as well as slowing down
your analysis.
•Data from multiple systems or services sometimes needs to be combined in ways that make sense for
analysis. For example, you might have one system that captures events, and another that stores user data or
files. Having a separate system to govern your analytics means you can combine these data types without
impacting or degrading performance.
•You may not want analysts to have access to production systems, or conversely, you may not want
production engineers to have access to all analytics data.
•If you need to change the way you store your data, or what you store, it’s a lot less risky to make those
changes on a separate system while letting the systems that back your services continue on as before.
How on a Data Pipeline?
Moving data between systems can require many steps: from copying data, to moving it from an on-premise
location into the cloud, to reformatting it or joining it with other data sources. Each of these steps needs to be
done, and usually requires separate software.

A good data pipeline functions like the plumbing it is named after: quietly, reliably, and in the background.
But, like plumbing, you’ll want on-site or on-call professionals who can perform repairs in the event of a leak.

In terms of technology, the stages of your data pipeline might use one or more of the following:
Event frameworks help you capture events from your applications more easily, creating an event log that can then be
processed for use.
Message bus is hardware or software that ensures that data sent between clusters of machines is properly queued
and received. A message bus allows systems to immediately send (or receive) data to (or from) other systems without
needing to wait for acknowledgment, and without needing to worry about errors or system inaccessibility. Properly
implemented, a message bus also makes it easier for different systems to communicate using their own protocols.
Data persistence stores your data in files or other non-volatile storage so that it can be processed in batches, rather
than all at once, simultaneously.
Workflow management structures the tasks (or processes) in your data pipeline, and makes it easier to supervise and
manage them.
Serialization frameworks convert data into more compact formats for storage and transmission.
Data Pipeline Technologies
The best tool depends on the step of the pipeline, the data, and the associated technologies. For example, streaming
data might require a different tool than a relational database. Working in a data center might involve different tools
than working in the cloud.
Some examples of products used in building data pipelines:
• Data warehouses
• ETL tools
• Data Prep tools
• Luigi: a workflow scheduler that can be used to manage jobs
and processes in Hadoop and similar systems.
• Python / Java / Ruby: programming languages used to write
processes in many of these systems.
• AWS Data Pipelines: another workflow management service
that schedules and executes data movement and processes
• Kafka: a real time streaming platform that allows you to
move data between systems and applications, can also
transform or react to these data streams.
Types of Data Pipeline Solutions
While a data pipeline is not a necessity for every business, this technology is especially helpful for those that:
•Generate, rely on, or store large amounts or multiple sources of data.
•Maintain siloed data sources.
•Require real-time or highly sophisticated data analysis.
•Store data in the cloud.
Common Data API
Data Pipeline provides you with a single API for working with data. The API treats all data the same
regardless of their source, target, format, or structure.
Data Processing Pipeline
Pipeline Data Model (PODS – Pipeline Open Data Structures)
The PODS Pipeline Data Model provides the database
architecture pipeline operators use to store critical
information, analyze data about their pipeline systems,
and manage this data geospatially in a linear-
referenced database which can then be visualized in
any GIS platform. The PODS Pipeline Data Model
houses the attribute, asset information, construction,
inspection, integrity management, regulatory
compliance, risk analysis, history, and operational data
that pipeline companies have deemed mission-critical
to the successful management of natural gas and
hazardous liquids pipelines.

Pipeline Open Data Standard (PODS) data model is an industry standard, used by pipeline operators to
provide a “single master source of information,” and to eliminate “localized silos of information that are
often unconnected.” As the US and other parts of the world increase their focus on pipeline integrity
management (PIM), the importance of data integration should not be overlooked. Referencing and
integrating data within a spatial context can help to provide pipeline operators with a ‘definitive view’ of
their pipeline assets. The PODS Data Model provides guidelines for the tedious process of reconciling
“as-built” data with Operational and Inspection data in one single source.
A Science Pipeline to New Planet Discoveries

NASA’s ongoing search for life in the universe produces a lot of

data. The agency’s new planet-hunting mission, the Transiting
Exoplanet Survey Satellite, or TESS, will collect 27 gigabytes per
day in its all-sky search for undiscovered planets orbiting 200,000 of
the brightest and closest stars in our solar neighborhood. That’s the
equivalent of about 6,500 song files beaming down to Earth every
two weeks. The music of the stars, however, is not as polished for
human ears as the latest Taylor Swift album. To get ready for
scientific discovery, the data needs a bit of fine tuning.
Weather forecasts Pipelines
What’s your definition of a data pipeline?
Data Pipeline – A arbitrarily complex chain of processes that manipulate data where the output data of one
process becomes the input to the next.
ETL is just one of many types of data pipelines — but that also depends on how you define ETL

ETL process, in my opinion, carries the baggage of the old school relational ETL tools. It was and is simply a
process that picks up data from one system, transforms it and loads it elsewhere. When I hear the term ETL
process, two things ring a bell in my mind – “batch” and “usually periodic”.
When I hear the term data pipeline, I think of something much broader – something that takes data from one
system to another, potentially including transformation along the way. However, this includes newer streaming
like processing and older ETL processes. So, to me data pipeline is a more generic, encompassing term that
includes real-time transformation. One point I would note is that data pipeline don’t have to have a transform. A
replication system (like LinkedIn’s Gobblin) still sets up data pipelines. So, while an ETL process almost always
has a transformation focus, data pipelines don’t need to have transformations.
What’s your definition of a data pipeline?
This term is overloaded. For example, the Spark project uses it very specifically for ML pipelines, although some of the characteristics are similar.
The pipeline to have these characteristics:
1.1 or more data inputs
2.1 or more data outputs
3.optional filtering
4.optional transformation, including schema changes (adding or removing fields) and transforming the format
5.optional aggregation, including group-bys, joins, and statistics
6.robustness features
1. resiliency against failure
2. when any part of the pipeline fails, automated recovery attempts to repair the issue
3. when an interrupted pipeline resumes normal operation, it tries to pick up where it left off, subject to these requirements:
1. If at least once delivery is required, then the pipeline ensures that processing of each record happens at least once, involving some
sort of acknowledgement
2. If at most once delivery is required, the pipeline can start after the last record that it read at the beginning of the pipeline
3. If exactly (effectively?) once delivery is required, the pipeline uses deduplication mechanisms with at least once to output a result
once and only once (subject to the fact that it’s impossible to make this guarantee for all possible scenarios)
4. management and monitoring hooks allow issues, as well as normal operational characteristics, like performance criteria, to be available
Wouldn’t necessary add latency criteria to the basic definition. Sometimes a pipeline is “watch this directory and process each file that shows up.”

The real ETL jobs are pipelines, because they must satisfy these criteria. Depending how broadly you define ETL, then all pipelines could be ETL
jobs.

The Complete Guide To An Enterprise DataOps Transformation (2022)
100% (1)
The Complete Guide To An Enterprise DataOps Transformation (2022)
186 pages
ML Certificate Preparation (Last Version)
No ratings yet
ML Certificate Preparation (Last Version)
288 pages
(Smtebooks - Com) Big Data Processing With Hadoop 1st Edition
100% (1)
(Smtebooks - Com) Big Data Processing With Hadoop 1st Edition
255 pages
Go Bigwith Data Lake Architecture
No ratings yet
Go Bigwith Data Lake Architecture
35 pages
Big Data Governance and Perspectives in Knowledge Management
No ratings yet
Big Data Governance and Perspectives in Knowledge Management
321 pages
Understanding Etl Er1
No ratings yet
Understanding Etl Er1
34 pages
Solved Big Data and Data Science Projects
100% (1)
Solved Big Data and Data Science Projects
85 pages
Introduction To Open Data Certificates
100% (3)
Introduction To Open Data Certificates
26 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
CCD 4,5,6
No ratings yet
CCD 4,5,6
21 pages
Big Book of Data Engineering 3rd Edition 1 27 2025
No ratings yet
Big Book of Data Engineering 3rd Edition 1 27 2025
126 pages
Orreily Trends
No ratings yet
Orreily Trends
43 pages
Azure Data Lake and U-SQL
No ratings yet
Azure Data Lake and U-SQL
51 pages
What Is A Data Pipeline - IBM
No ratings yet
What Is A Data Pipeline - IBM
10 pages
DP-203 Test Prep V2
No ratings yet
DP-203 Test Prep V2
74 pages
Mlops: 5 Steps To Operationalize Machine Learning Models
No ratings yet
Mlops: 5 Steps To Operationalize Machine Learning Models
17 pages
chp4 CCD
No ratings yet
chp4 CCD
8 pages
DZ Data Pipeline Essentials 2024
No ratings yet
DZ Data Pipeline Essentials 2024
6 pages
Tamr EB Getting DataOps Right Full 05-23-19
100% (1)
Tamr EB Getting DataOps Right Full 05-23-19
66 pages
Big Data Engineering Interview Questions
67% (3)
Big Data Engineering Interview Questions
189 pages
Break Down Data Silos With ETL and Unlock Trapped Data With ETL
No ratings yet
Break Down Data Silos With ETL and Unlock Trapped Data With ETL
25 pages
Cloudera Developer Training
100% (1)
Cloudera Developer Training
483 pages
Data Cleaning Ebook
No ratings yet
Data Cleaning Ebook
25 pages
Data Engineering Cookbook
100% (1)
Data Engineering Cookbook
62 pages
De Mod 4 Build Data Pipelines With Delta Live Tables
No ratings yet
De Mod 4 Build Data Pipelines With Delta Live Tables
52 pages
Start Sequence of Motors
100% (1)
Start Sequence of Motors
4 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Azure IoT Architecture BRK1552 Miller
No ratings yet
Azure IoT Architecture BRK1552 Miller
28 pages
Probability and Stochastic Processes: A Friendly Introduction
No ratings yet
Probability and Stochastic Processes: A Friendly Introduction
5 pages
Introduction To Snowflake Warehouses
No ratings yet
Introduction To Snowflake Warehouses
40 pages
Data AI Modernization - CP4D Overview
No ratings yet
Data AI Modernization - CP4D Overview
22 pages
Data Engineering Cookbook
No ratings yet
Data Engineering Cookbook
125 pages
Final - Data and Ai Governance.6sept2023
No ratings yet
Final - Data and Ai Governance.6sept2023
42 pages
Data Lakes White Paper PDF
No ratings yet
Data Lakes White Paper PDF
16 pages
William Glasser Powerpoint
No ratings yet
William Glasser Powerpoint
46 pages
Data Cleaning
100% (1)
Data Cleaning
12 pages
Azure Data Catalog Short Set
No ratings yet
Azure Data Catalog Short Set
23 pages
Course12 2 PDF
No ratings yet
Course12 2 PDF
36 pages
(IJIT-V6I5P7) :ravishankar Belkunde
No ratings yet
(IJIT-V6I5P7) :ravishankar Belkunde
9 pages
Modul 9 - Data Warehousing and Business Intelligence - DMBOK2
No ratings yet
Modul 9 - Data Warehousing and Business Intelligence - DMBOK2
59 pages
ETL
No ratings yet
ETL
50 pages
97 Things Every Data-Engineer Should Know Collective Wisdom From The Experts
0% (2)
97 Things Every Data-Engineer Should Know Collective Wisdom From The Experts
11 pages
Datamesh Ebook
No ratings yet
Datamesh Ebook
46 pages
Aaoc Zc111 Course Handout
No ratings yet
Aaoc Zc111 Course Handout
7 pages
Implementing A Smart Data Platform (2017) PDF
No ratings yet
Implementing A Smart Data Platform (2017) PDF
71 pages
See How Talend Helped Domino's: Integrate Data From 85,000 Sources
No ratings yet
See How Talend Helped Domino's: Integrate Data From 85,000 Sources
6 pages
Edureka Python Ebook
No ratings yet
Edureka Python Ebook
21 pages
Data Pipeline
No ratings yet
Data Pipeline
14 pages
Big Data Engineering PDF
No ratings yet
Big Data Engineering PDF
17 pages
Data Warehouse Design - Ebookv1
No ratings yet
Data Warehouse Design - Ebookv1
18 pages
Making Big Data Simple With Databricks
No ratings yet
Making Big Data Simple With Databricks
25 pages
Galaxian Deck
No ratings yet
Galaxian Deck
6 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
A Guide To: Data Science at Scale
No ratings yet
A Guide To: Data Science at Scale
20 pages
CBSE Class9 English Expanded Study Guide
No ratings yet
CBSE Class9 English Expanded Study Guide
4 pages
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
No ratings yet
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
15 pages
Cloud Data Lakes For Dummies Snowflake Special Edition V1 3
No ratings yet
Cloud Data Lakes For Dummies Snowflake Special Edition V1 3
10 pages
Introduction & Data Science Platforms
No ratings yet
Introduction & Data Science Platforms
31 pages
Mentor Server Admin v9.1 User Guide
No ratings yet
Mentor Server Admin v9.1 User Guide
395 pages
Implemententerprise Data Lake
100% (1)
Implemententerprise Data Lake
9 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
Data Science Use Cases
100% (1)
Data Science Use Cases
10 pages
EAPP Week 8
No ratings yet
EAPP Week 8
17 pages
Perfect Codes in Generalized
No ratings yet
Perfect Codes in Generalized
41 pages
Rousas Rushdoony Creation According To Scripture PDF
No ratings yet
Rousas Rushdoony Creation According To Scripture PDF
168 pages
Understanding ETL
No ratings yet
Understanding ETL
20 pages
UAPSA UST - Logo Manual
No ratings yet
UAPSA UST - Logo Manual
20 pages
Tai Lieu Vong Bi NACHI PDF
No ratings yet
Tai Lieu Vong Bi NACHI PDF
70 pages
Introduction-to-Area-Calculation PPT 2
No ratings yet
Introduction-to-Area-Calculation PPT 2
15 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
22 pages
Lambda Calculus - Handout
100% (1)
Lambda Calculus - Handout
4 pages
Anthropology 2023
No ratings yet
Anthropology 2023
10 pages
MBAAR
67% (3)
MBAAR
37 pages
Monash Data Science
No ratings yet
Monash Data Science
4 pages
Data Modeling ER
33% (3)
Data Modeling ER
89 pages
ES 1 06 - TL and PV of Lines PDF
No ratings yet
ES 1 06 - TL and PV of Lines PDF
27 pages
2nd Pu Google Forms Links-SCPUC
No ratings yet
2nd Pu Google Forms Links-SCPUC
1 page
Intersoft Nexo Weighing System Scales Up Sales
No ratings yet
Intersoft Nexo Weighing System Scales Up Sales
3 pages
DATA SHEET Cloud Data Management
No ratings yet
DATA SHEET Cloud Data Management
2 pages
ICFO Slides Part 2
No ratings yet
ICFO Slides Part 2
33 pages
Snehal XHTML
No ratings yet
Snehal XHTML
34 pages
Visvesvaraya Technological University: Notification
No ratings yet
Visvesvaraya Technological University: Notification
8 pages
Preview Only: Malaysian Standard
No ratings yet
Preview Only: Malaysian Standard
6 pages
Sarana
No ratings yet
Sarana
6 pages
ReadMe OSS
No ratings yet
ReadMe OSS
7 pages
The Synthesis and Structural Properties of (M (Dippe) (G - C H S) ) Complexes of PD and PT and Comparison With Their Ni Analog
No ratings yet
The Synthesis and Structural Properties of (M (Dippe) (G - C H S) ) Complexes of PD and PT and Comparison With Their Ni Analog
8 pages
M.Arch Interior Syllabus - 2014
No ratings yet
M.Arch Interior Syllabus - 2014
41 pages
Leadership
No ratings yet
Leadership
24 pages
Tai Tai
No ratings yet
Tai Tai
2 pages
The Michaelangelo Virus Source Code
No ratings yet
The Michaelangelo Virus Source Code
4 pages

4-Data Processing Pipelines in Science and Business

Uploaded by

4-Data Processing Pipelines in Science and Business

Uploaded by

Data Management

Data Processing Pipelines in science and business

What you can do with Data Pipeline

NASA’s ongoing search for life in the universe produces a lot of

You might also like