[go: up one dir, main page]

100% found this document useful (1 vote)
272 views22 pages

4-Data Processing Pipelines in Science and Business

Data processing pipelines involve connecting data processing elements in series to move data from one system to another. Common types of pipelines include instruction pipelines in CPUs, graphics pipelines in GPUs, software pipelines connecting computing processes, and HTTP pipelining for web requests. While ETL systems specifically focus on extracting, transforming and loading data in batches, data pipelines more broadly encompass real-time data movement and transformation between any systems, and can power analytics, data integration and ingestion tools. Effective data pipelines require technologies to capture, queue, store and manage data movement between systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
272 views22 pages

4-Data Processing Pipelines in Science and Business

Data processing pipelines involve connecting data processing elements in series to move data from one system to another. Common types of pipelines include instruction pipelines in CPUs, graphics pipelines in GPUs, software pipelines connecting computing processes, and HTTP pipelining for web requests. While ETL systems specifically focus on extracting, transforming and loading data in batches, data pipelines more broadly encompass real-time data movement and transformation between any systems, and can power analytics, data integration and ingestion tools. Effective data pipelines require technologies to capture, queue, store and manage data movement between systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Management

Data Processing Pipelines in science and business


Concepts
Data Processing Pipeline
Data Processing Pipeline
In computing, a pipeline, also known as a data pipeline,[1] is a set of data processing elements connected in
series, where the output of one element is the input of the next one. The elements of a pipeline are often
executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between
elements.
Computer-related pipelines include:
• Instruction pipelines, such as the classic RISC pipeline, which are used in central processing units (CPUs) and other microprocessors to allow
overlapping execution of multiple instructions with the same circuitry. The circuitry is usually divided up into stages and each stage processes a
specific part of one instruction at a time, passing the partial results to the next stage. Examples of stages are instruction decode, arithmetic/logic
and register fetch. They are related to the technologies of superscalar execution, operand forwarding, speculative execution and out-of-order
execution.
• Graphics pipelines, found in most graphics processing units (GPUs), which consist of multiple arithmetic units, or complete CPUs, that
implement the various stages of common rendering operations (perspective projection, window clipping, color and light calculation, rendering,
etc.).
• Software pipelines, which consist of a sequence of computing processes (commands, program runs, tasks, threads, procedures, etc.),
conceptually executed in parallel, with the output stream of one process being automatically fed as the input stream of the next one. The Unix
system call pipe is a classic example of this concept.
• HTTP pipelining, the technique of issuing multiple HTTP requests through the same TCP connection, without waiting for the previous one to
finish before issuing a new one.
Data Pipeline and ETL?
We may commonly hear the terms ETL and data pipeline used interchangeably. ETL stands for Extract, Transform,
and Load. ETL systems extract data from one system, transform the data and load the data into a database or data
warehouse. Legacy ETL pipelines typically run in batches, meaning that the data is moved in one large chunk at a
specific time to the target system. Typically, this occurs in regular scheduled intervals;

By contrast, "data pipeline" is a broader term that encompasses ETL as a subset. It refers to a system for moving
data from one system to another. The data may or may not be transformed, and it may be processed in real-time
(or streaming) instead of batches. When the data is streamed, it is processed in a continuous flow which is useful
for data that needs constant updating, such as a data from a sensor monitoring traffic. In addition, the data may
not be loaded to a database or data warehouse. It might be loaded to any number of targets, such as an AWS
bucket or a data lake, or it might even trigger a webhook on another system to kick off a specific business
process.

There are many moving parts in a Machine Learning (ML) model that have to be tied together for
an ML model to execute and produce results successfully. This process of tying together different pieces
of the ML process is known as a pipeline. Apipeline is a generalized but very important concept for a
Data Scientist.
Who Needs a Data Pipeline?
While a data pipeline is not a necessity for every business, this technology is especially helpful for those that:
•Generate, rely on, or store large amounts or multiple sources of data.
•Maintain siloed data sources.
•Require real-time or highly sophisticated data analysis.
•Store data in the cloud.

What you can do with Data Pipeline


Here are a few things you can do with Data Pipeline.
1.Convert incoming data to a common format.
2.Prepare data for analysis and visualization.
3.Migrate between databases.
4.Share data processing logic across web apps, batch jobs, and APIs.
5.Power your data ingestion and integration tools.
6.Consume large XML, CSV, and fixed-width files.
7.Replace batch jobs with real-time data.
Why one Needs a Data Pipeline?
•Analytics is computationally taxing. If you use the same systems for analysis that you use for capturing your
data, you risk impairing both the performance of your service (at the capture end), as well as slowing down
your analysis.
•Data from multiple systems or services sometimes needs to be combined in ways that make sense for
analysis. For example, you might have one system that captures events, and another that stores user data or
files. Having a separate system to govern your analytics means you can combine these data types without
impacting or degrading performance.
•You may not want analysts to have access to production systems, or conversely, you may not want
production engineers to have access to all analytics data.
•If you need to change the way you store your data, or what you store, it’s a lot less risky to make those
changes on a separate system while letting the systems that back your services continue on as before.
How on a Data Pipeline?
Moving data between systems can require many steps: from copying data, to moving it from an on-premise
location into the cloud, to reformatting it or joining it with other data sources. Each of these steps needs to be
done, and usually requires separate software.

A good data pipeline functions like the plumbing it is named after: quietly, reliably, and in the background.
But, like plumbing, you’ll want on-site or on-call professionals who can perform repairs in the event of a leak.

In terms of technology, the stages of your data pipeline might use one or more of the following:
Event frameworks help you capture events from your applications more easily, creating an event log that can then be
processed for use.
Message bus is hardware or software that ensures that data sent between clusters of machines is properly queued
and received. A message bus allows systems to immediately send (or receive) data to (or from) other systems without
needing to wait for acknowledgment, and without needing to worry about errors or system inaccessibility. Properly
implemented, a message bus also makes it easier for different systems to communicate using their own protocols.
Data persistence stores your data in files or other non-volatile storage so that it can be processed in batches, rather
than all at once, simultaneously.
Workflow management structures the tasks (or processes) in your data pipeline, and makes it easier to supervise and
manage them.
Serialization frameworks convert data into more compact formats for storage and transmission.
Data Pipeline Technologies
The best tool depends on the step of the pipeline, the data, and the associated technologies. For example, streaming
data might require a different tool than a relational database. Working in a data center might involve different tools
than working in the cloud.
Some examples of products used in building data pipelines:
• Data warehouses
• ETL tools
• Data Prep tools
• Luigi: a workflow scheduler that can be used to manage jobs
and processes in Hadoop and similar systems.
• Python / Java / Ruby: programming languages used to write
processes in many of these systems.
• AWS Data Pipelines: another workflow management service
that schedules and executes data movement and processes
• Kafka: a real time streaming platform that allows you to
move data between systems and applications, can also
transform or react to these data streams.
Types of Data Pipeline Solutions
While a data pipeline is not a necessity for every business, this technology is especially helpful for those that:
•Generate, rely on, or store large amounts or multiple sources of data.
•Maintain siloed data sources.
•Require real-time or highly sophisticated data analysis.
•Store data in the cloud.
Common Data API
Data Pipeline provides you with a single API for working with data. The API treats all data the same
regardless of their source, target, format, or structure.
Data Processing Pipeline
Pipeline Data Model (PODS – Pipeline Open Data Structures)
The PODS Pipeline Data Model provides the database
architecture pipeline operators use to store critical
information, analyze data about their pipeline systems,
and manage this data geospatially in a linear-
referenced database which can then be visualized in
any GIS platform. The PODS Pipeline Data Model
houses the attribute, asset information, construction,
inspection, integrity management, regulatory
compliance, risk analysis, history, and operational data
that pipeline companies have deemed mission-critical
to the successful management of natural gas and
hazardous liquids pipelines.

Pipeline Open Data Standard (PODS) data model is an industry standard, used by pipeline operators to
provide a “single master source of information,” and to eliminate “localized silos of information that are
often unconnected.” As the US and other parts of the world increase their focus on pipeline integrity
management (PIM), the importance of data integration should not be overlooked. Referencing and
integrating data within a spatial context can help to provide pipeline operators with a ‘definitive view’ of
their pipeline assets. The PODS Data Model provides guidelines for the tedious process of reconciling
“as-built” data with Operational and Inspection data in one single source.
A Science Pipeline to New Planet Discoveries

NASA’s ongoing search for life in the universe produces a lot of


data. The agency’s new planet-hunting mission, the Transiting
Exoplanet Survey Satellite, or TESS, will collect 27 gigabytes per
day in its all-sky search for undiscovered planets orbiting 200,000 of
the brightest and closest stars in our solar neighborhood. That’s the
equivalent of about 6,500 song files beaming down to Earth every
two weeks. The music of the stars, however, is not as polished for
human ears as the latest Taylor Swift album. To get ready for
scientific discovery, the data needs a bit of fine tuning.
Weather forecasts Pipelines
What’s your definition of a data pipeline?
Data Pipeline – A arbitrarily complex chain of processes that manipulate data where the output data of one
process becomes the input to the next.
ETL is just one of many types of data pipelines — but that also depends on how you define ETL

ETL process, in my opinion, carries the baggage of the old school relational ETL tools. It was and is simply a
process that picks up data from one system, transforms it and loads it elsewhere. When I hear the term ETL
process, two things ring a bell in my mind – “batch” and “usually periodic”.
When I hear the term data pipeline, I think of something much broader – something that takes data from one
system to another, potentially including transformation along the way. However, this includes newer streaming
like processing and older ETL processes. So, to me data pipeline is a more generic, encompassing term that
includes real-time transformation. One point I would note is that data pipeline don’t have to have a transform. A
replication system (like LinkedIn’s Gobblin) still sets up data pipelines. So, while an ETL process almost always
has a transformation focus, data pipelines don’t need to have transformations.
What’s your definition of a data pipeline?
This term is overloaded. For example, the Spark project uses it very specifically for ML pipelines, although some of the characteristics are similar.
The pipeline to have these characteristics:
1.1 or more data inputs
2.1 or more data outputs
3.optional filtering
4.optional transformation, including schema changes (adding or removing fields) and transforming the format
5.optional aggregation, including group-bys, joins, and statistics
6.robustness features
1. resiliency against failure
2. when any part of the pipeline fails, automated recovery attempts to repair the issue
3. when an interrupted pipeline resumes normal operation, it tries to pick up where it left off, subject to these requirements:
1. If at least once delivery is required, then the pipeline ensures that processing of each record happens at least once, involving some
sort of acknowledgement
2. If at most once delivery is required, the pipeline can start after the last record that it read at the beginning of the pipeline
3. If exactly (effectively?) once delivery is required, the pipeline uses deduplication mechanisms with at least once to output a result
once and only once (subject to the fact that it’s impossible to make this guarantee for all possible scenarios)
4. management and monitoring hooks allow issues, as well as normal operational characteristics, like performance criteria, to be available
Wouldn’t necessary add latency criteria to the basic definition. Sometimes a pipeline is “watch this directory and process each file that shows up.”

The real ETL jobs are pipelines, because they must satisfy these criteria. Depending how broadly you define ETL, then all pipelines could be ETL
jobs.

You might also like