4-Data Processing Pipelines in Science and Business
4-Data Processing Pipelines in Science and Business
By contrast, "data pipeline" is a broader term that encompasses ETL as a subset. It refers to a system for moving
data from one system to another. The data may or may not be transformed, and it may be processed in real-time
(or streaming) instead of batches. When the data is streamed, it is processed in a continuous flow which is useful
for data that needs constant updating, such as a data from a sensor monitoring traffic. In addition, the data may
not be loaded to a database or data warehouse. It might be loaded to any number of targets, such as an AWS
bucket or a data lake, or it might even trigger a webhook on another system to kick off a specific business
process.
There are many moving parts in a Machine Learning (ML) model that have to be tied together for
an ML model to execute and produce results successfully. This process of tying together different pieces
of the ML process is known as a pipeline. Apipeline is a generalized but very important concept for a
Data Scientist.
Who Needs a Data Pipeline?
While a data pipeline is not a necessity for every business, this technology is especially helpful for those that:
•Generate, rely on, or store large amounts or multiple sources of data.
•Maintain siloed data sources.
•Require real-time or highly sophisticated data analysis.
•Store data in the cloud.
A good data pipeline functions like the plumbing it is named after: quietly, reliably, and in the background.
But, like plumbing, you’ll want on-site or on-call professionals who can perform repairs in the event of a leak.
In terms of technology, the stages of your data pipeline might use one or more of the following:
Event frameworks help you capture events from your applications more easily, creating an event log that can then be
processed for use.
Message bus is hardware or software that ensures that data sent between clusters of machines is properly queued
and received. A message bus allows systems to immediately send (or receive) data to (or from) other systems without
needing to wait for acknowledgment, and without needing to worry about errors or system inaccessibility. Properly
implemented, a message bus also makes it easier for different systems to communicate using their own protocols.
Data persistence stores your data in files or other non-volatile storage so that it can be processed in batches, rather
than all at once, simultaneously.
Workflow management structures the tasks (or processes) in your data pipeline, and makes it easier to supervise and
manage them.
Serialization frameworks convert data into more compact formats for storage and transmission.
Data Pipeline Technologies
The best tool depends on the step of the pipeline, the data, and the associated technologies. For example, streaming
data might require a different tool than a relational database. Working in a data center might involve different tools
than working in the cloud.
Some examples of products used in building data pipelines:
• Data warehouses
• ETL tools
• Data Prep tools
• Luigi: a workflow scheduler that can be used to manage jobs
and processes in Hadoop and similar systems.
• Python / Java / Ruby: programming languages used to write
processes in many of these systems.
• AWS Data Pipelines: another workflow management service
that schedules and executes data movement and processes
• Kafka: a real time streaming platform that allows you to
move data between systems and applications, can also
transform or react to these data streams.
Types of Data Pipeline Solutions
While a data pipeline is not a necessity for every business, this technology is especially helpful for those that:
•Generate, rely on, or store large amounts or multiple sources of data.
•Maintain siloed data sources.
•Require real-time or highly sophisticated data analysis.
•Store data in the cloud.
Common Data API
Data Pipeline provides you with a single API for working with data. The API treats all data the same
regardless of their source, target, format, or structure.
Data Processing Pipeline
Pipeline Data Model (PODS – Pipeline Open Data Structures)
The PODS Pipeline Data Model provides the database
architecture pipeline operators use to store critical
information, analyze data about their pipeline systems,
and manage this data geospatially in a linear-
referenced database which can then be visualized in
any GIS platform. The PODS Pipeline Data Model
houses the attribute, asset information, construction,
inspection, integrity management, regulatory
compliance, risk analysis, history, and operational data
that pipeline companies have deemed mission-critical
to the successful management of natural gas and
hazardous liquids pipelines.
Pipeline Open Data Standard (PODS) data model is an industry standard, used by pipeline operators to
provide a “single master source of information,” and to eliminate “localized silos of information that are
often unconnected.” As the US and other parts of the world increase their focus on pipeline integrity
management (PIM), the importance of data integration should not be overlooked. Referencing and
integrating data within a spatial context can help to provide pipeline operators with a ‘definitive view’ of
their pipeline assets. The PODS Data Model provides guidelines for the tedious process of reconciling
“as-built” data with Operational and Inspection data in one single source.
A Science Pipeline to New Planet Discoveries
ETL process, in my opinion, carries the baggage of the old school relational ETL tools. It was and is simply a
process that picks up data from one system, transforms it and loads it elsewhere. When I hear the term ETL
process, two things ring a bell in my mind – “batch” and “usually periodic”.
When I hear the term data pipeline, I think of something much broader – something that takes data from one
system to another, potentially including transformation along the way. However, this includes newer streaming
like processing and older ETL processes. So, to me data pipeline is a more generic, encompassing term that
includes real-time transformation. One point I would note is that data pipeline don’t have to have a transform. A
replication system (like LinkedIn’s Gobblin) still sets up data pipelines. So, while an ETL process almost always
has a transformation focus, data pipelines don’t need to have transformations.
What’s your definition of a data pipeline?
This term is overloaded. For example, the Spark project uses it very specifically for ML pipelines, although some of the characteristics are similar.
The pipeline to have these characteristics:
1.1 or more data inputs
2.1 or more data outputs
3.optional filtering
4.optional transformation, including schema changes (adding or removing fields) and transforming the format
5.optional aggregation, including group-bys, joins, and statistics
6.robustness features
1. resiliency against failure
2. when any part of the pipeline fails, automated recovery attempts to repair the issue
3. when an interrupted pipeline resumes normal operation, it tries to pick up where it left off, subject to these requirements:
1. If at least once delivery is required, then the pipeline ensures that processing of each record happens at least once, involving some
sort of acknowledgement
2. If at most once delivery is required, the pipeline can start after the last record that it read at the beginning of the pipeline
3. If exactly (effectively?) once delivery is required, the pipeline uses deduplication mechanisms with at least once to output a result
once and only once (subject to the fact that it’s impossible to make this guarantee for all possible scenarios)
4. management and monitoring hooks allow issues, as well as normal operational characteristics, like performance criteria, to be available
Wouldn’t necessary add latency criteria to the basic definition. Sometimes a pipeline is “watch this directory and process each file that shows up.”
The real ETL jobs are pipelines, because they must satisfy these criteria. Depending how broadly you define ETL, then all pipelines could be ETL
jobs.