Sign in Get started

Data Engineer Things

Insights and ideas on data and engineering.

Can Multiple Activities Be Executed in Parallel in Spark Rather Than Sequentially?

Can Multiple Activities Be Executed in Parallel in Spark Rather Than Sequentially?

Scenarios Highlighting the Benefits of Executing Activities in Parallel in Spark, Along with Implementation Details

Sep 30

Trending Now

No, Data Engineers Don’t NEED dbt.

No, Data Engineers Don’t NEED dbt.

But It Sure Does Solve a Lot of Problems

Jul 19

Uber’s Big Data Revolution: From MySQL to Hadoop and Beyond

Uber’s Big Data Revolution: From MySQL to Hadoop and Beyond

Volume: 100+ PB Data, Latency: Minutes

Sep 14

I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.

I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.

How Google manages metadata at a large scale.

Sep 17

Data Contracts in Action: Tools

Data Contracts in Action: Tools

Some people have asked me “Are data contracts really a thing?”, soon followed up by “What tools are available when using data contracts?”

Sep 29

Why Does the "Executor Out of Memory" Error Happen in Apache Spark?

Why Does the "Executor Out of Memory" Error Happen in Apache Spark?

Access this blog for free: https://medium.com/@vishalbarvaliya/464ec2400b52?sk=8abbd0e0a6dd6bafaf2518515af9b5e8

Vishal Barvaliya

Sep 18

How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?

How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?

Was adding more machines enough?

Aug 14

Latest stories

Understanding Data Spilling in Databricks and How to Prevent It

Understanding Data Spilling in Databricks and How to Prevent It

Struggling with data spilling in Databricks? Discover why it happens, how it slows down your Spark jobs, and practical strategies to…

Oct 15

Data Pipeline Development with MinIO, Iceberg, Nessie, Polars, StarRocks, Mage, and Docker

Data Pipeline Development with MinIO, Iceberg, Nessie, Polars, StarRocks, Mage, and Docker

Hello again, fellow technology enthusiasts!

George Zefkilis

Oct 15

Understanding dbt from the code to reveal hidden functionality

Understanding dbt from the code to reveal hidden functionality

An essential knowledge to become an advanced dbt user

Fumiaki Kobayashi

Oct 14

Why the Quack will you use DuckDB?

Why the Quack will you use DuckDB?

Five reasons why you should check out DuckDB

Dudhraj Sandeep

Oct 14

How to Store Python Apps Logs in DBFS and Volumes in Databricks

How to Store Python Apps Logs in DBFS and Volumes in Databricks

Detailed Steps to Store Python Application Logs in DBFS and Volumes in Databricks Using FileHandler

Oct 11

The ultimate test of your Docker Image: Running in GitHub Actions

The ultimate test of your Docker Image: Running in GitHub Actions

I thought it would be simple…

Oct 8

Trigger and Monitor Databricks Job Runs from Azure Data Factory

Trigger and Monitor Databricks Job Runs from Azure Data Factory

Complete guide to triggering and monitoring a Databricks job run from Azure Data Factory (ADF) using the REST API, along with code

Oct 7

Data Death Cycle: The Silo Trap

Data Death Cycle: The Silo Trap

How to avoid common self-service pitfalls

Oct 3

Optimizing Costs and Building Efficient Data Pipelines on AWS!

Optimizing Costs and Building Efficient Data Pipelines on AWS!

As a data engineering leader overseeing complex data pipelines on AWS, it’s essential to not only develop robust pipelines but also ensure…

Shashwath Shenoy

Sep 30

APIs for Data Engineers (Part 2) - Interacting with APIs: The Requests Library and the…

APIs for Data Engineers (Part 2) - Interacting with APIs: The Requests Library and the…

In part one, we explored the fundamentals of APIs, including HTTP methods, endpoints, requests, and responses. Now, let’s delve deeper…

Sep 29

I spent 8 hours diving deep into Snowflake (again)

I spent 8 hours diving deep into Snowflake (again)

Virtual Warehouse, Intermediate Storage, Cache, and Remote Storage

Sep 28

Never Make This Mistake When Overwriting a Spark Partition Table

Never Make This Mistake When Overwriting a Spark Partition Table

Learn about partition overwrite modes to prevent costly mistakes that could lead to data loss

Sep 25

12 Unique Ways to Create Spark DataFrames

12 Unique Ways to Create Spark DataFrames

Discover 12 Unique Ways to Create Spark DataFrames with Practical Examples and Insights.

Sep 25

Using Marquez as a lineage tool for Celery — adding the parent-run facet

Using Marquez as a lineage tool for Celery — adding the parent-run facet

This story is the second one about integrating Celery with Marquez using the OpenLineage Python package. In this story, we take the first…

Sep 22

A Brief History of REST Catalogs for Apache Iceberg

A Brief History of REST Catalogs for Apache Iceberg

Iceberg REST Catalogs are popping up everywhere, why is that?

Sep 21

I spent 5 hours learning how ClickHouse built their internal data warehouse.

I spent 5 hours learning how ClickHouse built their internal data warehouse.

19 data sources and a total of 470 TB of compressed data.

Sep 21

51 Questions to Help You Create Business Value with Data

51 Questions to Help You Create Business Value with Data

“To ask the right question is already half the solution of a problem.” — Carl Jung

Sep 18

A fun experiment: using Marquez as a lineage tool for Celery

A fun experiment: using Marquez as a lineage tool for Celery

Data lineage refers to the journey data takes — where it comes from, how it gets transformed, and where it ends up. This information…

Sep 14

Troubleshooting Spark Jobs: Overcoming Errors and Performance Challenges

Troubleshooting Spark Jobs: Overcoming Errors and Performance Challenges

A comprehensive guide for data engineers to identify troubleshoot,resolve common Spark job errors,optimize performance and boost efficiency

Sep 11

Looking to Enhance Your Data Quality? This is for You

Looking to Enhance Your Data Quality? This is for You

Practical techniques to implement data verification and validation processes for your data

Sep 11

Why Would Someone Execute Databricks API From Azure Data Factory?

Why Would Someone Execute Databricks API From Azure Data Factory?

Explained scenarios where leveraging the Databricks REST API from ADF is essential to perform specific tasks with implementation

Sep 11

I spent 6 hours learning how Apache Spark plans the execution for us.

I spent 6 hours learning how Apache Spark plans the execution for us.

Catalyst, Adaptive Query Execution, and how Airbnb leverages Spark 3.

Sep 11

How to Decide if Databricks Is the Right Tool for You

How to Decide if Databricks Is the Right Tool for You

Essential Questions You Need to Answer Before Adopting Databricks

Sep 6

Circumventing the problem of using data intervals when backfilling dataset scheduled DAGs

Circumventing the problem of using data intervals when backfilling dataset scheduled DAGs

Airflow’s data-aware scheduling feature allows event-based triggering between DAGs. It enables us to split large pipelines into smaller…

Sep 4

Lambda vs. Kappa Architecture: A Quick Guide for Data Engineers

Lambda vs. Kappa Architecture: A Quick Guide for Data Engineers

This article provides a clear explanation of Lambda and Kappa data processing architectures.

Sep 3

About Data Engineer ThingsLatest StoriesArchiveAbout MediumTermsPrivacyTeams