[go: up one dir, main page]

0% found this document useful (0 votes)
36 views7 pages

DataLineageinModernDataEngineering DZone

Uploaded by

agam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views7 pages

DataLineageinModernDataEngineering DZone

Uploaded by

agam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/379995013

Data Lineage in Modern Data Engineering

Article · February 2024

CITATIONS READS
0 73

1 author:

Kshitiz Jain
EPAM Systems
6 PUBLICATIONS 3 CITATIONS

SEE PROFILE

All content following this page was uploaded by Kshitiz Jain on 22 April 2024.

The user has requested enhancement of the downloaded file.


2/6/24, 10:46 PM Data Lineage in Modern Data Engineering - DZone

(/) REFCARDS (/REFCARDZ)


TREND REPORTS (/TRENDREPORTS)
EVENTS (/EVENTS)   (/search)
Culture and Methodologies Data Engineering Software Design and Architecture Coding Testing, Deployment, and Maintenance

(/culture-and-methodologies) (/data-engineering) (/software-design-and-architecture) (/coding) (/testing-deployment-and-maintenance)


DZone Research Report: A look Getting Started With Large Language Managing API integrations: Assess your Mobile Database Essentials: Assess data
at our developer audience, their Models: A guide for both novices and use case and needs — plus learn patterns needs, storage requirements, and more
tech stacks, and topics and tools seasoned practitioners to unlock the power for the design, build, and maintenance of when leveraging databases for cloud and
they're exploring. of language models. your integrations. edge applications.

Download the Community Report Read Our Refcard Read Our Refcard Download the Refcard
(https://dzone.com/pages/2023- (https://dzone.com/refcardz/getting- (https://dzone.com/refcardz/api-integration- (https://dzone.com/refcardz/mobile-
community-research-report) started-with-large-language-models) patterns) database-essentials)

RELATED

From Chaos to Control: Nurturing a Culture of Data Governance (/articles/from-chaos-to-control-nurturing-a-culture-of-data)

Want To Build Successful Data Products? Start With Ingestion and Integration (/articles/want-to-build-successful-data-products-start-with)

Top 5 Benefits of Data Lineage (/articles/top-5-benefits-of-data-lineage)

Strategies for Governing Data Quality, Accuracy, and Consistency (/articles/strategies-for-governing-data-quality-accuracy-and)

Download the eBook


Get the Most Out of Your Data with Cloud Data
Warehousing. Download the Dummies Guide.

Snowflake Open

Partner Resources

Download the eBook Open


Get the Most Out of Your Data with Cloud Data Warehousing. Download the Dummies Guide. Snowflake

https://dzone.com/articles/data-lineage-in-modern-data-engineering 1/6
2/6/24, 10:46 PM Data Lineage in Modern Data Engineering - DZone

(/) REFCARDS (/REFCARDZ)


TREND REPORTS (/TRENDREPORTS)
EVENTS (/EVENTS)  (/search)
Culture and Methodologies Data Engineering Software Design and Architecture Coding Testing, Deployment, and Maintenance

(/culture-and-methodologies) (/data-engineering) (/software-design-and-architecture) (/coding) (/testing-deployment-and-maintenance)

https://dzone.com/articles/data-lineage-in-modern-data-engineering 2/6
2/6/24, 10:46 PM Data Lineage in Modern Data Engineering - DZone

DZone (https://dzone.com) /(/)Data Engineering (https://dzone.com/data-engineering) / Data (https://dzone.com/data)


REFCARDS (/REFCARDZ)
TREND REPORTS (/TRENDREPORTS)
EVENTS (/EVENTS)  (/search)
/ Data Lineage in Modern Data Engineering

Culture and Methodologies Data Engineering Software Design and Architecture Coding Testing, Deployment, and Maintenance

Data Lineage in Modern Data Engineering


(/culture-and-methodologies) (/data-engineering) (/software-design-and-architecture) (/coding) (/testing-deployment-and-maintenance)

Data lineage is a critical aspect of data engineering that often plays a pivotal role in ensuring data
quality, traceability, and compliance.

by Kshitiz Jain (/users/5065819/kshitizj.html) · Feb. 05, 24 · Opinion

 Like (1)  Comment (0)  Save  Tweet  Share  1.4K Views 

[NEW] DZone's 2023 Software Integration Trend Report


Where is the path to seamless communication and nuanced architecture taking us? Dive into our latest Trend
Report and fill the gaps among modern integration practices by exploring trends in APIs, microservices, and cloud-
based systems and migrations. Download the report ►

Data lineage (https://dzone.com/articles/what-is-data-lineage-and-how-can-it-ensure-data-qu) is the tracking and


visualization of the flow and transformation of data as it moves through various stages of a data pipeline or system.
In simpler terms, it provides a detailed record of the origins, movements, transformations, and destinations of data
within an organization's data infrastructure. This information helps to create a clear and transparent map of how
data is sourced, processed, and utilized across different components of a data ecosystem.

Data lineage allows developers to comprehend the journey of data from its source to its final destination. This
understanding is crucial for designing, optimizing, and troubleshooting data pipelines. When issues arise in a data
pipeline, having a detailed data lineage enables developers to quickly identify the root cause of problems. It facilitates
efficient debugging and troubleshooting by providing insights into the sequence of transformations and actions
performed on the data. Data lineage helps maintain data quality by enabling developers to trace any anomalies or
discrepancies back to their source. It ensures that data transformations are executed correctly and that any
inconsistencies can be easily traced and rectified.

In industries with regulatory requirements and compliance standards, data lineage is essential for demonstrating
data governance and ensuring compliance. It provides a transparent view of how data is handled, processed, and
reported, supporting regulatory audits and requirements.

By visualizing the complete data flow, developers can identify bottlenecks, inefficiencies, or areas for optimization
within the data pipeline. This insight is crucial for improving the overall performance and efficiency of the data
processing workflow.

Types of Data Lineage


There are generally two types of data lineage, namely forward lineage and backward lineage.

Forward Lineage
It is known as downstream lineage; it tracks the flow of data from its source to its destination. It outlines the path that
data takes through various stages of processing, transformations, and storage until it reaches its destination.

It helps developers understand how data is manipulated and transformed, aiding in the design and improvement of
the overall data processing workflow and quickly identifying the point of failure. By tracing the data flow forward,
developers can pinpoint where transformations or errors occurred and address them efficiently It is essential for
https://dzone.com/articles/data-lineage-in-modern-data-engineering 3/6
2/6/24, 10:46 PM Data Lineage in Modern Data Engineering - DZone
developers can pinpoint where transformations or errors occurred and address them efficiently. It is essential for
predicting the impact of (/) changes
REFCARDS on downstream
(/REFCARDZ) processes.
TREND REPORTS Before making
(/TRENDREPORTS)
EVENTS modifications to the data pipeline
(/EVENTS)  (/search)
(https://dzone.com/articles/what-is-a-data-pipeline) or underlying data sources, developers can analyze the
Culture and Methodologies Data Engineering Software Design and Architecture Coding Testing, Deployment, and Maintenance
forward lineage to assess how
(/culture-and-methodologies)
these changes will
(/data-engineering)
affect downstream applications.
(/software-design-and-architecture) (/coding) (/testing-deployment-and-maintenance)

Backward Lineage
It is also known as upstream lineage; it traces the path of data from its destination back to its source. It provides
insights into the origins of the data and the various transformations it undergoes before reaching its current state.

It is crucial for ensuring data quality by allowing developers to trace any issues or discrepancies back to their source.
By understanding the data's journey backward, developers can identify and rectify anomalies at their origin. It also
helps demonstrate data governance by providing a transparent view of how data is sourced, processed, and
reported, supporting regulatory audits and requirements.

Backward lineage is valuable when planning changes to upstream data sources. Developers can assess how
modifications in the source data may affect downstream processes, applications, or reports, enabling them to make
informed decisions.

Implementing Data Lineage


There are several open source and commercial tools available in the market for implementing data linage. Some of the
common tools are

Imperva Data Lineage


It provides intuitive visualizations of data flow from source to consumption. Records transformations applied to data
during its journey combine data discovery with comprehensive metadata views and help ensure data accuracy and
trustworthiness.

Atlan Data Lineage


It supports automated SQL parsing for various SQL statements (https://dzone.com/articles/sql-commands-a-brief-
guide) (CREATE, MERGE, INSERT, UPDATE) and captures lineage at the column and field levels. IT facilitates
collaboration and integrates with other tools.

Apache Atlas
It provides a centralized metadata repository for managing metadata and classifying data entities. Users can classify
and tag data entities for better organization and governance. It offers data lineage tracking capabilities to visualize the
flow of data within a Hadoop ecosystem.

Collibra
It provides a comprehensive data catalog that includes a business glossary, data lineage, and metadata management.
Users can visualize data lineage to understand how data moves through the organization.

Challenges and Best Practices


Implementing and managing data lineage is a complex job for developers, and they face several challenges in the
process. Some common issues include dealing with different data formats and names in various systems, handling
large and complicated data setups, and not having the right tools for tracking and showing data lineage in some
sources or technologies. Also, the constantly changing nature of data environments and problems with incomplete or
wrong information make things more challenging.

https://dzone.com/articles/data-lineage-in-modern-data-engineering 4/6
2/6/24, 10:46 PM Data Lineage in Modern Data Engineering - DZone

To overcome these challenges, it's crucial to choose the right tools for data lineage and governance. Setting up and
(/) REFCARDS (/REFCARDZ)
TREND REPORTS (/TRENDREPORTS)
EVENTS (/EVENTS)  (/search)
sticking to clear data governance rules is important to keep things consistent. Moreover, working together with
different
Culture andgroups involved is Data
Methodologies keyEngineering
to overcoming Software
difficulties
Design caused by ever-changing
and Architecture Coding data Testing,
setupsDeployment,
and ensuring
and Maintenance

accurate and thorough data lineage.


(/culture-and-methodologies) (/data-engineering) (/software-design-and-architecture) (/coding) (/testing-deployment-and-maintenance)

Conclusion
In conclusion, data lineage is vital for data engineering, ensuring quality (https://dzone.com/articles/future-
proofing-data-architecture-for-better-data), traceability, and compliance. It tracks the flow and transformations of
data, aiding developers in pipeline design and troubleshooting. Forward lineage optimizes workflows, while
backward lineage ensures data quality and supports governance. Various tools can assist in data lineage
implementation. Challenges include inconsistent data formats and dynamic environments, addressed by selecting
the right tools and adhering to governance practices through collaboration. In navigating these challenges,
organizations unlock the potential of data lineage, fortifying the reliability of data workflows.

Data Governance Data Quality Data (Computing)

Opinions expressed by DZone contributors are their own.

RELATED
From Chaos to Control: Nurturing a Culture of Data Governance

Want To Build Successful Data Products? Start With Ingestion and Integration

Top 5 Benefits of Data Lineage

Strategies for Governing Data Quality, Accuracy, and Consistency

ABOUT US
About DZone (/pages/about)
Send feedback (mailto:support@dzone.com)
Careers (https://careers.dzone.com/)
Sitemap (/sitemap)

ADVERTISE
Advertise with DZone (https://advertise.dzone.com)

CONTRIBUTE ON DZONE
Article Submission Guidelines (/articles/dzones-article-submission-guidelines)
Become a Contributor (/pages/contribute)
Core Program (/pages/core)
Visit the Writers' Zone (/writers-zone)

LEGAL
Terms of Service (https://technologyadvice.com/terms-conditions/)
Privacy Policy (https://technologyadvice.com/privacy-policy/)

CONTACT US
3343 Perimeter Hill Drive
Suite 100
Nashville, TN 37211

https://dzone.com/articles/data-lineage-in-modern-data-engineering 5/6
View publication stats

2/6/24, 10:46 PM Data Lineage in Modern Data Engineering - DZone


support@dzone.com (mailto:support@dzone.com)
(/) REFCARDS (/REFCARDZ)
TREND REPORTS (/TRENDREPORTS)
EVENTS (/EVENTS)  (/search)
Let's be friends:    
Culture and Methodologies Data Engineering Software Design and Architecture Coding Testing, Deployment, and Maintenance
(/pages/feeds)
(/culture-and-methodologies) (https://twitter.com/DZoneInc)
(https://www.facebook.com/DZoneInc)
(https://www.linkedin.com/company/dzone/)
(/data-engineering) (/software-design-and-architecture) (/coding) (/testing-deployment-and-maintenance)

https://dzone.com/articles/data-lineage-in-modern-data-engineering 6/6

You might also like