5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn
7 Unlo
Home My Network Jobs Messaging Notifications Me For Business
Open source tools for
Data Engineering
Midhun Pottammal
Data Engineer and Full Stack Expert | Hadoop, 14 articles Follow
Spark, Kafka, Python, and NoSQL (Hive, Hbas…
February 14, 2024
Open Immersive Reader
Data Integration
1. Apache NiFi: A powerful and easy-to-use tool for
moving data between systems.
2. Airbyte: An open-source data integration platform that
helps you replicate your data in your warehouses,
lakes, and databases.
3. Meltano: An open-source data integration tool that
simplifies the process of extracting, loading, and
transforming data.
4. Apache Inlong: A platform for real-time data ingestion
and complex event processing.
5. Apache SeaTunnel: A data transfer tool for efficiently
moving large volumes of data.
Storage
1. HDFS: The Hadoop Distributed File System, designed
for storing large files across multiple machines.
2. Apache Ozone: A scalable, redundant, and distributed
object store for Hadoop.
3. Ceph: A distributed object, block, and file storage
platform.
https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 1/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn
4. MinIO: A high-performance, distributed object storage
server.
Data Lake Platform
1. Apache Hudi: A data lake solution for managing large
analytical datasets.
2. Apache Iceberg: A table format for storing huge, slow-
moving tabular data.
3. Delta: An open-source storage layer that brings ACID
transactions to Apache Spark.
4. Paimon: A data lake platform for managing and
analyzing data at scale.
Event Processing
1. Kafka: A distributed event streaming platform capable
of handling trillions of events a day.
2. Redpanda: A Kafka-compatible event streaming
platform with a focus on performance and scalability.
3. Pulsar: A cloud-native, distributed messaging and
streaming platform.
Data Processing & Computation
1. Apache Spark: An open-source, distributed computing
system that provides an interface for programming
entire clusters with implicit data parallelism and fault
tolerance.
2. Apache Flink: A framework and distributed processing
engine for stateful computations over unbounded and
bounded data streams.
3. Vaex: A Python library for lazy, out-of-core
DataFrames.
4. Ray: A fast and simple framework for building and
running distributed applications.
5. Dask: A flexible parallel computing library for analytic
computing.
6. Polars: A blazingly fast DataFrame library implemented
in Rust and using Apache Arrow.
Database
https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 2/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn
OLTP:SQL — RDBMS(MySQL, Postgres), In
Memory(Apache Ignite)NoSQL — KV(Aerospike),
Document (MongoDB), Graph(Neo4J),
Multimodel(ArangoDb)
HTAP:NewSQL — stonedb, TiDB
OLAP:Oflline — Columnar(Databend), Time Series
(TimeScale)Realtime — Realtime OLAP (Druid, Pinot,
Clickhouse, StarRocks), Search Engine, Streaming
Database (Materialize, RisingWave)
Visualization
1. Superset: A modern, enterprise-ready business
intelligence web application.
2. Rath
3. Redash: A visualization and dashboarding tool.
4. Metabase: An easy way to generate charts and
dashboards, ask simple ad hoc queries without using
SQL, and see detailed information about rows in your
Database.
Data Infrastructure
Kubernetes: An open-source container orchestration
platform.
Ambari: A software project designed to enable system
administrators to provision, manage, and monitor a
Hadoop cluster.
Workflow Management & DataOps
1. Airflow: A platform to programmatically author,
schedule, and monitor workflows.
2. Dagster: A data orchestrator for machine learning,
analytics, and ETL.
3. Kestra: A workflow orchestrator for data pipeline
management.
4. Temporal: An open-source, stateful microservices
orchestration platform.
5. Mage: A workflow engine for orchestrating data
pipelines.
6. Windmill: A platform for building and running data
pipelines.
https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 3/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn
7. DolphinScheduler: A distributed and easy-to-expand
visual DAG workflow scheduling system, dedicated to
solving the complex dependencies in data processing,
making the scheduling system out of the box for data
processing.
Monitoring
Prometheus + Mimir & Grafan + Loki
EFK
Metadata Management
1. Datahub: An open-source metadata platform for the
modern data stack.
2. Amundsen: A data discovery and metadata platform.
3. Marquez: An open-source metadata service for the
collection, aggregation, and visualization of a data
ecosystem's metadata.
Report this
Published by
Midhun Pottammal 14
Data Engineer and Full Stack Expert | Hadoop, Spark, Kafka, Python, and NoSQ… Follow
Published • 2mo articles
🌟 Exciting Tools in the World of Data Engineering! 🌟
#DataEngineering #OpenSource #TechTrends #DataEngineering #OpenSource
#TechTrends #DataIntegration #Storage #DataLake #DataProcessing #Database
#EventProcessing #Visualization #DataInfrastructure #WorkflowManagement
#DataOps #Monitoring #MetadataManagement
Like Comment Share 17
Reactions
+5
0 Comments
Add a comment…
Midhun Pottammal
Data Engineer and Full Stack Expert | Hadoop, Spark, Kafka, Python, and NoSQL (Hive,
Hbase, Iceberg) | Specialised in Informatica, Nifi, Cloudera CDP, and Databricks
https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 4/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn
Follow
More from Midhun Pottammal
Apache Iceberg Schema Benefit of Data Observability:
Evolution Unlocking the Insights 🚗
Midhun Pottammal on Linke… Midhun Pottammal on Linke…
Star Schema vs Snowflake
Schema: Key Differences
Between The Two
Midhun Pottammal on Linke…
See all 14 articles
https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 5/5