Address
:
[go:
up one dir
,
main page
]
Include Form
Remove Scripts
Session Cookies
Homepage
Open in app
Sign in
Get started
Data Engineer Things
Insights and ideas on data and engineering.
ETL
Data Architecture
Optimization
Interview Guide
Career Growth
AI in Data Engineering
About
Contribute
Follow
Following
Can Multiple Activities Be Executed in Parallel in Spark Rather Than Sequentially?
Can Multiple Activities Be Executed in Parallel in Spark Rather Than Sequentially?
Scenarios Highlighting the Benefits of Executing Activities in Parallel in Spark, Along with Implementation Details
Rahul Madhani
Sep 30
Trending Now
No, Data Engineers Don’t NEED dbt.
No, Data Engineers Don’t NEED dbt.
But It Sure Does Solve a Lot of Problems
Leo Godin
Jul 19
Uber’s Big Data Revolution: From MySQL to Hadoop and Beyond
Uber’s Big Data Revolution: From MySQL to Hadoop and Beyond
Volume: 100+ PB Data, Latency: Minutes
Vu Trinh
Sep 14
I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.
I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.
How Google manages metadata at a large scale.
Vu Trinh
Sep 17
Data Contracts in Action: Tools
Data Contracts in Action: Tools
Some people have asked me “Are data contracts really a thing?”, soon followed up by “What tools are available when using data contracts?”
Peter Flook
Sep 29
Why Does the "Executor Out of Memory" Error Happen in Apache Spark?
Why Does the "Executor Out of Memory" Error Happen in Apache Spark?
Access this blog for free: https://medium.com/@vishalbarvaliya/464ec2400b52?sk=8abbd0e0a6dd6bafaf2518515af9b5e8
Vishal Barvaliya
Sep 18
How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?
How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?
Was adding more machines enough?
Vu Trinh
Aug 14
Latest stories
Understanding Data Spilling in Databricks and How to Prevent It
Understanding Data Spilling in Databricks and How to Prevent It
Struggling with data spilling in Databricks? Discover why it happens, how it slows down your Spark jobs, and practical strategies to…
Rui Carvalho
Oct 15
Data Pipeline Development with MinIO, Iceberg, Nessie, Polars, StarRocks, Mage, and Docker
Data Pipeline Development with MinIO, Iceberg, Nessie, Polars, StarRocks, Mage, and Docker
Hello again, fellow technology enthusiasts!
George Zefkilis
Oct 15
Understanding dbt from the code to reveal hidden functionality
Understanding dbt from the code to reveal hidden functionality
An essential knowledge to become an advanced dbt user
Fumiaki Kobayashi
Oct 14
Why the Quack will you use DuckDB?
Why the Quack will you use DuckDB?
Five reasons why you should check out DuckDB
Dudhraj Sandeep
Oct 14
How to Store Python Apps Logs in DBFS and Volumes in Databricks
How to Store Python Apps Logs in DBFS and Volumes in Databricks
Detailed Steps to Store Python Application Logs in DBFS and Volumes in Databricks Using FileHandler
Rahul Madhani
Oct 11
The ultimate test of your Docker Image: Running in GitHub Actions
The ultimate test of your Docker Image: Running in GitHub Actions
I thought it would be simple…
Peter Flook
Oct 8
Trigger and Monitor Databricks Job Runs from Azure Data Factory
Trigger and Monitor Databricks Job Runs from Azure Data Factory
Complete guide to triggering and monitoring a Databricks job run from Azure Data Factory (ADF) using the REST API, along with code
Rahul Madhani
Oct 7
Data Death Cycle: The Silo Trap
Data Death Cycle: The Silo Trap
How to avoid common self-service pitfalls
Hugo Lu
Oct 3
Optimizing Costs and Building Efficient Data Pipelines on AWS!
Optimizing Costs and Building Efficient Data Pipelines on AWS!
As a data engineering leader overseeing complex data pipelines on AWS, it’s essential to not only develop robust pipelines but also ensure…
Shashwath Shenoy
Sep 30
APIs for Data Engineers (Part 2) - Interacting with APIs: The Requests Library and the…
APIs for Data Engineers (Part 2) - Interacting with APIs: The Requests Library and the…
In part one, we explored the fundamentals of APIs, including HTTP methods, endpoints, requests, and responses. Now, let’s delve deeper…
Aminat Lawal
Sep 29
I spent 8 hours diving deep into Snowflake (again)
I spent 8 hours diving deep into Snowflake (again)
Virtual Warehouse, Intermediate Storage, Cache, and Remote Storage
Vu Trinh
Sep 28
Never Make This Mistake When Overwriting a Spark Partition Table
Never Make This Mistake When Overwriting a Spark Partition Table
Learn about partition overwrite modes to prevent costly mistakes that could lead to data loss
Rahul Madhani
Sep 25
12 Unique Ways to Create Spark DataFrames
12 Unique Ways to Create Spark DataFrames
Discover 12 Unique Ways to Create Spark DataFrames with Practical Examples and Insights.
Rahul Madhani
Sep 25
Using Marquez as a lineage tool for Celery — adding the parent-run facet
Using Marquez as a lineage tool for Celery — adding the parent-run facet
This story is the second one about integrating Celery with Marquez using the OpenLineage Python package. In this story, we take the first…
Marin Aglić
Sep 22
A Brief History of REST Catalogs for Apache Iceberg
A Brief History of REST Catalogs for Apache Iceberg
Iceberg REST Catalogs are popping up everywhere, why is that?
Lisa N. Cao
Sep 21
I spent 5 hours learning how ClickHouse built their internal data warehouse.
I spent 5 hours learning how ClickHouse built their internal data warehouse.
19 data sources and a total of 470 TB of compressed data.
Vu Trinh
Sep 21
51 Questions to Help You Create Business Value with Data
51 Questions to Help You Create Business Value with Data
“To ask the right question is already half the solution of a problem.” — Carl Jung
Eduard Popa
Sep 18
A fun experiment: using Marquez as a lineage tool for Celery
A fun experiment: using Marquez as a lineage tool for Celery
Data lineage refers to the journey data takes — where it comes from, how it gets transformed, and where it ends up. This information…
Marin Aglić
Sep 14
Troubleshooting Spark Jobs: Overcoming Errors and Performance Challenges
Troubleshooting Spark Jobs: Overcoming Errors and Performance Challenges
A comprehensive guide for data engineers to identify troubleshoot,resolve common Spark job errors,optimize performance and boost efficiency
Pritam Deb
Sep 11
Looking to Enhance Your Data Quality? This is for You
Looking to Enhance Your Data Quality? This is for You
Practical techniques to implement data verification and validation processes for your data
Rahul Madhani
Sep 11
Why Would Someone Execute Databricks API From Azure Data Factory?
Why Would Someone Execute Databricks API From Azure Data Factory?
Explained scenarios where leveraging the Databricks REST API from ADF is essential to perform specific tasks with implementation
Rahul Madhani
Sep 11
I spent 6 hours learning how Apache Spark plans the execution for us.
I spent 6 hours learning how Apache Spark plans the execution for us.
Catalyst, Adaptive Query Execution, and how Airbnb leverages Spark 3.
Vu Trinh
Sep 11
How to Decide if Databricks Is the Right Tool for You
How to Decide if Databricks Is the Right Tool for You
Essential Questions You Need to Answer Before Adopting Databricks
Eduard Popa
Sep 6
Circumventing the problem of using data intervals when backfilling dataset scheduled DAGs
Circumventing the problem of using data intervals when backfilling dataset scheduled DAGs
Airflow’s data-aware scheduling feature allows event-based triggering between DAGs. It enables us to split large pipelines into smaller…
Marin Aglić
Sep 4
Lambda vs. Kappa Architecture: A Quick Guide for Data Engineers
Lambda vs. Kappa Architecture: A Quick Guide for Data Engineers
This article provides a clear explanation of Lambda and Kappa data processing architectures.
Santosh Joshi
Sep 3
About Data Engineer Things
Latest Stories
Archive
About Medium
Terms
Privacy
Teams