0% found this document useful (0 votes)

2 views3 pages

Data Pipeline Design

Uploaded by

Layan Mahasneh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views3 pages

Data Pipeline Design

Uploaded by

Layan Mahasneh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Data Pipeline Design for Carbon Tracking System

# Data Pipeline (Design Department)

Goals: reliable ingestion → clean, unified storage → fast analytics → ML insights →

dashboards & alerts.

## 1) Sources
- **IoT sensors (CO₂/CO₂e, temp, flow, runtime):** MQTT/HTTP → gateway.
- **Electricity (JEPCO):** API/CSV drops (hourly/daily).
- **Raw materials & transport:** operator entry (web app) + supplier CSV.
- **Maintenance logs:** CMMS or in-house form.

## 2) Transport & Ingestion

- **Protocol:** MQTT (sensors) → Kafka (topic fan-out) or MQTT → Telegraf for simplicity.
- **Batch loads:** JEPCO, suppliers via Airflow scheduled jobs.
- **Back-pressure & retries:** Kafka acks / DLQ topic; Airflow retries + alerting.

## 3) Processing (ELT)
- **Stream (near-real-time):** Kafka → Flink/Spark Structured Streaming (validate, enrich
with asset IDs, unit normalization).
- **Batch (daily/hourly):** dbt models or Spark jobs (joins, dimension lookups, CO₂e factor
application).
- **Data quality:** Great Expectations (range checks, nulls, schema drift).

## 4) Storage (Tiered)
- **Raw (immutable):** Object store (S3/minio) partitioned by source/date.
- **Operational store:**
- **Time-series:** InfluxDB/TimescaleDB for sensor telemetry.
- **Relational (OLTP):** Postgres for assets, sites, materials, users.
- **Analytics (OLAP):** DuckDB/BigQuery/Snowflake for dashboards & reports.

## 5) Models & Business Logic

- **Emission factors:** versioned tables by standard + validity date.
- **Core marts:** fact_emissions_hourly, fact_energy_use, dim_asset, dim_site, fact_materials.
- **KPIs:** tCO₂e by scope/site/line, intensity per unit output, uptime, MTBF, energy per
unit.

## 6) ML/AI
- **Predictive maintenance:** anomaly detection (Z-score/Prophet) + failure classification
(XGBoost).
- **Forecasting:** energy/emissions forecasts (Prophet/ARIMA).
- **Generative design hooks:** write-back constraints & results to design_variants.

## 7) Serving & Visualization

- **Dashboards:** Grafana/Metabase/Power BI over OLAP + Timescale.
- **APIs:** FastAPI layer for apps (read KPIs, write operator inputs).
- **Alerts:** Grafana/Alertmanager → Email/Slack when thresholds or anomalies trigger.

## 8) Orchestration & Ops

- **Scheduler:** Airflow for batch; Git-versioned DAGs.
- **CI/CD:** GitHub Actions → staging → prod with data smoke tests.
- **Monitoring:** Prometheus + Grafana for pipeline health; logs in Loki/ELK.

## 9) Security & Governance

- **AuthN/Z:** Keycloak/OAuth; RBAC (Design vs Manufacturing vs Exec).
- **PII:** likely minimal—still classify and mask where needed.
- **Lineage & catalog:** OpenMetadata/Amundsen; dataset SLAs documented.
- **Backups & retention:** Raw: 2–3 yrs; processed marts: 12–24 mo.

## 10) Schemas (examples)

**Telemetry (Timescale):**
telemetry(site_id, asset_id, sensor_type, ts_utc, value, unit, qflag)

Emissions mart (OLAP):

fact_emissions_hourly(site_id, asset_id, ts_utc, scope, activity_type,
energy_kwh, materials_kg, emission_factor_id, tco2e)

**Emission factor:**
emission_factor(id, name, source_std, region, activity, unit_in, unit_out,
factor, valid_from, valid_to, version)

## 11) Latency Targets

- **Live ops (alarms):** 5–30s end-to-end.
- **Dashboards:** near-real-time (≤1 min) for sensors; hourly/daily for JEPCO & materials.
- **Reports:** daily close + monthly regulatory exports.

## 12) Quick Diagram (text)

Sensors → MQTT → Kafka/Telegraf ─┐
├─ Stream proc (Flink) → Timescale (RT) → Grafana/Alerts
JEPCO API / CSV → Airflow ───────┤
Supplier CSV / App → API ────────┤
└─ Batch (dbt/Spark) → OLAP (DuckDB/BigQuery) → Dashboards/Reports
↘ Models (PM/Forecast) → Alerts & API
## 13) How this helps the Software Specialist
- **Clean contracts:** well-defined schemas + API layer mean fewer ad-hoc queries.
- **Faster features:** standardized marts (dbt) speed up new KPIs.
- **Fewer fires:** monitoring, DLQs, and data tests catch issues early.
- **Scalable by design:** streaming + batch tiers handle growth without rewrites.

## 14) Suggested Stack (balanced complexity)

- **Ingestion:** MQTT + Telegraf (simple) or + Kafka (scalable)
- **Processing:** dbt (+ DuckDB locally; Spark if big)
- **Storage:** TimescaleDB + Postgres + MinIO
- **Orchestration:** Airflow
- **Quality:** Great Expectations
- **Serving:** FastAPI + Grafana/Metabase

011 - Streaming Data System Architecture Components
No ratings yet
011 - Streaming Data System Architecture Components
2 pages
Apache
No ratings yet
Apache
9 pages
System Design CheatSheet
No ratings yet
System Design CheatSheet
9 pages
Cheatsheet System Design
No ratings yet
Cheatsheet System Design
16 pages
012 - Lambda Architecture
No ratings yet
012 - Lambda Architecture
2 pages
HLD Civic Issue Reporting
No ratings yet
HLD Civic Issue Reporting
7 pages
System Design Terms
No ratings yet
System Design Terms
9 pages
IoT Architecture Detailed Notes
No ratings yet
IoT Architecture Detailed Notes
2 pages
Big Data Analytics Application
No ratings yet
Big Data Analytics Application
6 pages
Cloud Assignment Report With Architecture
No ratings yet
Cloud Assignment Report With Architecture
4 pages
Temple Pilgrimage System Design Detailed
No ratings yet
Temple Pilgrimage System Design Detailed
3 pages
Document Final
No ratings yet
Document Final
25 pages
Data Engineering System Design
No ratings yet
Data Engineering System Design
37 pages
NYC Taxi Demand WebApp Roadmap
No ratings yet
NYC Taxi Demand WebApp Roadmap
4 pages
011.1 - Streaming Data System Architecture Components - Collection
No ratings yet
011.1 - Streaming Data System Architecture Components - Collection
2 pages
Module4 1
No ratings yet
Module4 1
68 pages
My Journey As A Data Engineer Spans Over
No ratings yet
My Journey As A Data Engineer Spans Over
6 pages
Industry Project Report
No ratings yet
Industry Project Report
39 pages
Real Time Analytics Stack
No ratings yet
Real Time Analytics Stack
1 page
Designing and Optimizing Scalable, Cloud-Native Data Pipelines For Real-Time Analytics: A Comprehensive Study
No ratings yet
Designing and Optimizing Scalable, Cloud-Native Data Pipelines For Real-Time Analytics: A Comprehensive Study
7 pages
Consolidated Paper AWS Analysis
No ratings yet
Consolidated Paper AWS Analysis
8 pages
Growlink Challenge
No ratings yet
Growlink Challenge
7 pages
AWS Tools for Data Engineers
No ratings yet
AWS Tools for Data Engineers
24 pages
Design A Workflow Management Platform Like Apache Airflo
No ratings yet
Design A Workflow Management Platform Like Apache Airflo
4 pages
Tech Stack For Soshianest
No ratings yet
Tech Stack For Soshianest
2 pages
007 - Big Data Architecture Style
No ratings yet
007 - Big Data Architecture Style
3 pages
AWS Big Picture Notes Sahil
No ratings yet
AWS Big Picture Notes Sahil
8 pages
Kafka Architecture
No ratings yet
Kafka Architecture
5 pages
8.1 Enterprise E-Commerce DSA With Microservices
No ratings yet
8.1 Enterprise E-Commerce DSA With Microservices
7 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Karthik (Project Details)
No ratings yet
Karthik (Project Details)
14 pages
Backend Technolgies UPCST Project
No ratings yet
Backend Technolgies UPCST Project
6 pages
AWS Transcript Only Hierarchical Notes Sahil
No ratings yet
AWS Transcript Only Hierarchical Notes Sahil
6 pages
System Design Data Engineers Pocket Full
No ratings yet
System Design Data Engineers Pocket Full
15 pages
Cloud Computing Module-5
No ratings yet
Cloud Computing Module-5
5 pages
IoT Platform Overview
No ratings yet
IoT Platform Overview
7 pages
Data Arch Base
No ratings yet
Data Arch Base
11 pages
Scenario-Based Questions On Integrating Data in A Cloud
No ratings yet
Scenario-Based Questions On Integrating Data in A Cloud
17 pages
System Architecture Document For Portal
No ratings yet
System Architecture Document For Portal
7 pages
Comprehensive Report On Supply Chain Optimization
No ratings yet
Comprehensive Report On Supply Chain Optimization
8 pages
009.3 - Streaming Data Use Cases
No ratings yet
009.3 - Streaming Data Use Cases
3 pages
Architecture Tethral
No ratings yet
Architecture Tethral
5 pages
CC Unit 3
No ratings yet
CC Unit 3
12 pages
B22DCVT246 Tran Van Huy
No ratings yet
B22DCVT246 Tran Van Huy
69 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
16 pages
007.2 - Big Data Systems Components
No ratings yet
007.2 - Big Data Systems Components
2 pages
011.3 - Streaming Data System Architecture Components - Processing Tier
No ratings yet
011.3 - Streaming Data System Architecture Components - Processing Tier
3 pages
Khattach 2025 End-To-End Architecture For Real-Time IoT Analytics and Predictive Maintenance Using Stream Processing
No ratings yet
Khattach 2025 End-To-End Architecture For Real-Time IoT Analytics and Predictive Maintenance Using Stream Processing
14 pages
Unit 5
No ratings yet
Unit 5
14 pages
RealTime Data Analytics Project Checklist
No ratings yet
RealTime Data Analytics Project Checklist
2 pages
HLD - Crowdsourced Civic Issue Reporting & Resolution System
100% (2)
HLD - Crowdsourced Civic Issue Reporting & Resolution System
6 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
009 - Streaming Data Applications
No ratings yet
009 - Streaming Data Applications
2 pages
Iot 5
No ratings yet
Iot 5
17 pages
DS Unit 2
No ratings yet
DS Unit 2
1 page
Cloud Services
No ratings yet
Cloud Services
4 pages
Orange Interview
No ratings yet
Orange Interview
22 pages
Assign 3
No ratings yet
Assign 3
9 pages
Lecture Week 3-Databases
No ratings yet
Lecture Week 3-Databases
17 pages
Lecture Week 6-Data Scraping and Data Wrangling
No ratings yet
Lecture Week 6-Data Scraping and Data Wrangling
16 pages
Lecture Week 9 - Regression
No ratings yet
Lecture Week 9 - Regression
41 pages
Lecture Week 7 - Plotting and Visualization
No ratings yet
Lecture Week 7 - Plotting and Visualization
25 pages
Cell Structure Summary Notes
No ratings yet
Cell Structure Summary Notes
5 pages
Qpi S2025 L9+L10
No ratings yet
Qpi S2025 L9+L10
43 pages
Screenshot 2021-11-17 at 7.48.41 PM
No ratings yet
Screenshot 2021-11-17 at 7.48.41 PM
1 page
Glass-And-Alminium (3) .PDF 0558490607 Info@
No ratings yet
Glass-And-Alminium (3) .PDF 0558490607 Info@
118 pages
FPJ International SCHOOL Epaper-08!10!2025
No ratings yet
FPJ International SCHOOL Epaper-08!10!2025
35 pages
Certified Energy Manager - Complete Application (For Remote Testing Centers)
100% (3)
Certified Energy Manager - Complete Application (For Remote Testing Centers)
33 pages
Non-Developers' Subdivision Guide
No ratings yet
Non-Developers' Subdivision Guide
33 pages
Questionnaire AIQFRP
No ratings yet
Questionnaire AIQFRP
4 pages
SELecting The Best Test Automation Framework
No ratings yet
SELecting The Best Test Automation Framework
6 pages
Sustainable Development Goals Overview
No ratings yet
Sustainable Development Goals Overview
25 pages
Iptv Guide
No ratings yet
Iptv Guide
66 pages
Lockwood - Mortice Inc 3579
No ratings yet
Lockwood - Mortice Inc 3579
10 pages
Land Sale Agreement Summary
No ratings yet
Land Sale Agreement Summary
2 pages
Complete Bundle Solutions Manual For History of Mathematics 3rd Edition by Katz
100% (1)
Complete Bundle Solutions Manual For History of Mathematics 3rd Edition by Katz
408 pages
Text Classification and Processing Using NLP
No ratings yet
Text Classification and Processing Using NLP
21 pages
Quality Education For Sustainable Development Quality Education For Sustainable Development
No ratings yet
Quality Education For Sustainable Development Quality Education For Sustainable Development
90 pages
Sarathi
No ratings yet
Sarathi
1 page
Cost Control and Cost Reduction
No ratings yet
Cost Control and Cost Reduction
16 pages
L&T Construction: Water & Effluent Treatment IC - EDRC
No ratings yet
L&T Construction: Water & Effluent Treatment IC - EDRC
4 pages
BÀI TẬP READING CƠ BẢN
No ratings yet
BÀI TẬP READING CƠ BẢN
4 pages
Term Paper ON: "The Role of Money Market in Bangladesh"
No ratings yet
Term Paper ON: "The Role of Money Market in Bangladesh"
30 pages
Iso 16903-2015
No ratings yet
Iso 16903-2015
22 pages
Marketing Aspects: Bruce R. Barringer R. Duane Ireland
No ratings yet
Marketing Aspects: Bruce R. Barringer R. Duane Ireland
31 pages
Accounting Students: Trial Balance
No ratings yet
Accounting Students: Trial Balance
4 pages
c8000v Releasenotes 17 11
No ratings yet
c8000v Releasenotes 17 11
9 pages
Mba HRM
No ratings yet
Mba HRM
14 pages
Futuristic Presentation Template
No ratings yet
Futuristic Presentation Template
26 pages
House Price Indices in India - 2015
No ratings yet
House Price Indices in India - 2015
48 pages
Certificate of Authentication
No ratings yet
Certificate of Authentication
1 page
2022 Nissan Navara PRO 4X 4x4 2.5L Turbodiesel 7-Speed at 188 Bhp-Z0014366
No ratings yet
2022 Nissan Navara PRO 4X 4x4 2.5L Turbodiesel 7-Speed at 188 Bhp-Z0014366
11 pages
PG Town Planning
100% (1)
PG Town Planning
3 pages
Akshay Patra q1
No ratings yet
Akshay Patra q1
1 page

Data Pipeline Design

Uploaded by

Data Pipeline Design

Uploaded by

Data Pipeline Design for Carbon Tracking System

# Data Pipeline (Design Department)

**Goals:** reliable ingestion → clean, unified storage → fast analytics → ML insights →

## 2) Transport & Ingestion

## 5) Models & Business Logic

## 7) Serving & Visualization

## 8) Orchestration & Ops

## 9) Security & Governance

## 10) Schemas (examples)

**Emissions mart (OLAP):**

## 11) Latency Targets

## 12) Quick Diagram (text)

## 14) Suggested Stack (balanced complexity)

You might also like

Goals: reliable ingestion → clean, unified storage → fast analytics → ML insights →

Emissions mart (OLAP):