100% found this document useful (1 vote)

83 views24 pages

Chapter3-Working With Dask DataFrames

This document introduces Dask DataFrames for parallel processing of large datasets in Python. It shows how to read CSV files lazily using Dask and build delayed pipelines to operate on the data in parallel. The key advantages of Dask over Pandas are its ability to handle datasets that are too large to fit into memory by distributing the data and computations across multiple cores or machines. The document also compares the performance of Dask and Pandas on tasks like reading taxi trip data files and computing aggregations, demonstrating that Dask can improve performance for large datasets that require out-of-core computation.

Uploaded by

Komi David ABOTSITSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

83 views24 pages

Chapter3-Working With Dask DataFrames

Uploaded by

Komi David ABOTSITSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Using Dask

DataFrames
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Dhavide Aruliah
Director of Training, Anaconda
Reading CSV
import dask.dataframe as dd

dd.read_csv() function
Accepts single lename or glob pa ern (with wildcard * )

Does not read le immediately (lazy evaluation)

File(s) need not t in memory

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Reading multiple CSV files
%ls

quarter1.csv quarter2.csv quarter3.csv quarter4.csv

transactions.head()
transactions = dd.read_csv('*.csv')
transactions.tail()

id names amount date id names amount date

0 131 Norbert -1159 2016-01-01 195 838 Wendy 87 2016-12-28
1 342 Jerry 1149 2016-01-01 196 915 Bob 852 2016-12-30
2 485 Dan 1380 2016-01-01 197 749 Patricia 1741 2016-12-31
3 513 Xavier 1555 2016-01-02 198 743 Michael 1191 2016-12-31
4 849 Michael 363 2016-01-02 199 889 Wendy 336 2016-12-31

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Building delayed pipelines
is_wendy = (transactions['names'] == 'Wendy')
wendy_amounts = transactions.loc[is_wendy, 'amount']
wendy_amounts

Dask Series Structure:

npartitions=4
None int64
None ...
None ...
None ...
None ...
Name: amount, dtype: int64
Dask Name: loc-series, 24 tasks

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Building delayed pipelines
wendy_diff = wendy_amounts.sum()
wendy_diff

dd.Scalar<series-..., dtype=int64>

wendy_diff.visualize(rankdir='LR')

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Visualizing pipelines

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Compatibility with Pandas API
Unavailable in dask.dataframe :

some unsupported le formats (e.g., .xls , .zip , .gz )

sorting

Available in dask.dataframe :

indexing, selection, & reindexing

aggregations: .sum() , .mean() , .std() , .min() , .max()

etc.

grouping with .groupby()

datetime conversion with dd.to_datetime()

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Let's practice!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N
Timing DataFrame
Operations
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Dhavide Aruliah
Director of Training, Anaconda
How big is big data?
Data size M Required hardware

M < 8 GB RAM (single machine)

8 GB < M < 10 TB hard disk (single machine)

M > 10 TB: specialized hardware

Two key questions:

Data ts in RAM (random access memory)?

Data ts on hard disk?

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Taxi CSV files
%ll -h yellow_tripdata_2015-*.csv

-rw-r--r-- 1 user staff 1.8G 31 Jul 16:43 yellow_tripdata_2015-01.csv

-rw-r--r-- 1 user staff 1.8G 31 Jul 16:43 yellow_tripdata_2015-02.csv
-rw-r--r-- 1 user staff 1.9G 31 Jul 16:43 yellow_tripdata_2015-03.csv
-rw-r--r-- 1 user staff 1.9G 31 Jul 16:43 yellow_tripdata_2015-04.csv
-rw-r--r-- 1 user staff 1.9G 31 Jul 16:43 yellow_tripdata_2015-05.csv
-rw-r--r-- 1 user staff 1.8G 31 Jul 16:43 yellow_tripdata_2015-06.csv
-rw-r--r-- 1 user staff 1.7G 31 Jul 16:43 yellow_tripdata_2015-07.csv
-rw-r--r-- 1 user staff 1.6G 31 Jul 16:43 yellow_tripdata_2015-08.csv
-rw-r--r-- 1 user staff 1.6G 31 Jul 16:43 yellow_tripdata_2015-09.csv
-rw-r--r-- 1 user staff 1.8G 31 Jul 16:43 yellow_tripdata_2015-10.csv
-rw-r--r-- 1 user staff 1.7G 31 Jul 16:43 yellow_tripdata_2015-11.csv
-rw-r--r-- 1 user staff 1.7G 31 Jul 16:43 yellow_tripdata_2015-12.csv

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Timing I/O & computation: Pandas
import time, pandas as pd
t_start = time.time();
df = pd.read_csv('yellow_tripdata_2015-01.csv');
t_end = time.time();
print('pd.read_csv(): {} s'.format(t_end-t_start)) # time [s]

pd.read_csv: 43.820565938949585 s

t_start = time.time();
m = df['trip_distance'].mean();
t_end = time.time();
print('.mean(): {} ms'.format((t_end-t_start)*1000)) # time [ms]

.mean(): 17.752885818481445 ms

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Timing I/O & computation: Dask
import dask.dataframe as dd, time
t_start = time.time();
df = dd.read_csv('yellow_tripdata_2015-*.csv');
t_end = time.time();
print('dd.read_csv: {} ms'.format((t_end-t_start)*1000)) # time [ms]

dd.read_csv: 404.7999382019043 ms

t_start = time.time();
m = df['trip_distance'].mean();
t_end = time.time();
print('.mean(): {} ms'.format((t_end-t_start)*1000)) # time [ms]

.mean(): 2.289295196533203 ms

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Timing I/O & computation: Dask
t_start = time.time();
result = m.compute();
t_end = time.time();
print('.compute(): {} min'.format((t_end-t_start)/60)) # time [min]

.compute(): 3.4004417498906454 min

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Timing in the IPython shell
m = df['trip_distance'].mean()
%time result = m.compute()

CPU times: user 9min 50s, sys: 1min 16s, total: 11min 7s
Wall time: 3min 1s

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Is Dask or Pandas appropriate?
How big is dataset?

How much RAM available?

How many threads/cores/CPUs available?

Are Pandas computations/formats supported in Dask API?

Is computation I/O-bound (disk-intensive) or CPU-bound

(processor intensive)?

Best use case for Dask

Computations from Pandas API available in Dask

Problem size close to limits of RAM, ts on disk

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Let's practice!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N
Analyzing NYC Taxi
Rides
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Dhavide Aruliah
Director of Training, Anaconda
The New York taxi dataset

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Taxi CSV files
%ll -h yellow_tripdata_2015-*.csv

-rw-r--r-- 1 user staff 1.8G 31 Jul 16:43 yellow_tripdata_2015-01.csv

Exercises use smaller les...

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Taxi data features
import pandas as pd
df = pd.read_csv('yellow_tripdata_2015-01.csv')
df.shape
df.columns

(12748986, 19)
Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
'passenger_count', 'trip_distance', 'pickup_longitude',
'pickup_latitude', 'RateCodeID', 'store_and_fwd_flag',
'dropoff_longitude', 'dropoff_latitude', 'payment_type',
'fare_amount','extra', 'mta_tax', 'tip_amount',
'tolls_amount','improvement_surcharge', 'total_amount'],
dtype='object')

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Amount paid
How much was each ride?
fare_amount : cost of ride

tolls_amount : charges for toll roads

extra : additional charges

tip_amount : amount tipped (credit cards only)

total_amount : total amount paid by passenger

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Payment type
df['payment_type'].value_counts()

1 7881388
2 4816992
3 38632
4 11972
5 2
Name: payment_type, dtype: int64

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Let's practice!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Chapter5-Case Study Analyzing Flight Delays
No ratings yet
Chapter5-Case Study Analyzing Flight Delays
32 pages
Dask Bags & Globbing in Python
No ratings yet
Dask Bags & Globbing in Python
33 pages
Chapter2-Working With Dask Arrays
No ratings yet
Chapter2-Working With Dask Arrays
41 pages
Chapter1-Working With Big Data
No ratings yet
Chapter1-Working With Big Data
44 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Machine Learning with Spark Guide
No ratings yet
Machine Learning with Spark Guide
26 pages
Pandas Series and DataFrame Guide
No ratings yet
Pandas Series and DataFrame Guide
87 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Apache Spark vs Dask: Big Data Tools
No ratings yet
Apache Spark vs Dask: Big Data Tools
55 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
PySpark SparkSession Guide
No ratings yet
PySpark SparkSession Guide
63 pages
Data Science Tools Guide: SQL, R, Python
No ratings yet
Data Science Tools Guide: SQL, R, Python
23 pages
Apache Spark 2.3: Key Updates
No ratings yet
Apache Spark 2.3: Key Updates
57 pages
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
No ratings yet
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
112 pages
Apache Spark Internals Guide
No ratings yet
Apache Spark Internals Guide
90 pages
Pandas & Matplotlib Cheat Sheet
No ratings yet
Pandas & Matplotlib Cheat Sheet
2 pages
Pandas Cheatsheet 1743309413
No ratings yet
Pandas Cheatsheet 1743309413
11 pages
Top 100 Python Interview Questions For Data Analyst
No ratings yet
Top 100 Python Interview Questions For Data Analyst
10 pages
Apache Spark
No ratings yet
Apache Spark
25 pages
Databricks Interview Questions With Detailed Solution
No ratings yet
Databricks Interview Questions With Detailed Solution
171 pages
2025 Pyspark Interview Questions Collections
No ratings yet
2025 Pyspark Interview Questions Collections
50 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Batch Processing with Spark Guide
No ratings yet
Batch Processing with Spark Guide
41 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Time Series
No ratings yet
Time Series
31 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
Pandas
No ratings yet
Pandas
16 pages
Python & Pandas Coding Quiz
No ratings yet
Python & Pandas Coding Quiz
2 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Scalable-ML-3 4 1
No ratings yet
Scalable-ML-3 4 1
147 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
Azure Databricks Onboarding Guide
No ratings yet
Azure Databricks Onboarding Guide
298 pages
Pandas vs PySpark: Data Operations
No ratings yet
Pandas vs PySpark: Data Operations
3 pages
Slide 3 Hadoop MapReduce Tutorial
No ratings yet
Slide 3 Hadoop MapReduce Tutorial
119 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
Fast Python High Performance Techniques For Large Datasets MEAP V10 Tiago Rodrigues Antao Instant Download
No ratings yet
Fast Python High Performance Techniques For Large Datasets MEAP V10 Tiago Rodrigues Antao Instant Download
110 pages
Python Data Preprocessing Guide
No ratings yet
Python Data Preprocessing Guide
3 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Hive and Impala
No ratings yet
Hive and Impala
46 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Unit-03: Capturing, Preparing and Working With Data
No ratings yet
Unit-03: Capturing, Preparing and Working With Data
41 pages
Top 50 Pandas Interview Questions and Answers (2024)
No ratings yet
Top 50 Pandas Interview Questions and Answers (2024)
34 pages
Big Data Topic3 (Spark) (Thanh Binh Nguyen) .TextMark
No ratings yet
Big Data Topic3 (Spark) (Thanh Binh Nguyen) .TextMark
60 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
Probability and Statistics For ML - Cwa
No ratings yet
Probability and Statistics For ML - Cwa
822 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
Python Data Visualization Guide
No ratings yet
Python Data Visualization Guide
16 pages
Shaik 200 Questions Data Engineer Interview Guide
No ratings yet
Shaik 200 Questions Data Engineer Interview Guide
76 pages
Are You Still Using Pandas For Big Data
No ratings yet
Are You Still Using Pandas For Big Data
14 pages
Dask For Parallel Computing Cheat Sheet
No ratings yet
Dask For Parallel Computing Cheat Sheet
2 pages
Logistic Regression in Python Using Dask
No ratings yet
Logistic Regression in Python Using Dask
19 pages
Are You Still Using Pandas For Big Data - by Roman Orac - Towards Data Science
No ratings yet
Are You Still Using Pandas For Big Data - by Roman Orac - Towards Data Science
10 pages
Power BI Date Table Guide
No ratings yet
Power BI Date Table Guide
16 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
Chapter 3
No ratings yet
Chapter 3
12 pages
Chapter 1
No ratings yet
Chapter 1
10 pages
Chapter 3
No ratings yet
Chapter 3
7 pages
Power BI Job Market Analysis Guide
No ratings yet
Power BI Job Market Analysis Guide
1 page
Chapter 3
No ratings yet
Chapter 3
15 pages
Chapter 1
No ratings yet
Chapter 1
9 pages
Chapter 1
No ratings yet
Chapter 1
34 pages
Chapter 3
No ratings yet
Chapter 3
7 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Debian Installer For Buffalo Linkstation NAS
No ratings yet
Debian Installer For Buffalo Linkstation NAS
24 pages
Logcat CSC Update Log
No ratings yet
Logcat CSC Update Log
370 pages
Parallels Desktop Command-Line Reference
No ratings yet
Parallels Desktop Command-Line Reference
43 pages
Moshell Installation Guide
No ratings yet
Moshell Installation Guide
3 pages
Use Testdisk To Recover Data From A Corrupt Drive With Linux Ubuntu
No ratings yet
Use Testdisk To Recover Data From A Corrupt Drive With Linux Ubuntu
17 pages
System Error Codes (0-499)
No ratings yet
System Error Codes (0-499)
13 pages
Macos Vulnerabilities Hiding in Plain Sight: Cve-2022-Xxxx - Disk Arbitration - Sandbox Bypass
No ratings yet
Macos Vulnerabilities Hiding in Plain Sight: Cve-2022-Xxxx - Disk Arbitration - Sandbox Bypass
22 pages
UEFI Spec 2 3 1
No ratings yet
UEFI Spec 2 3 1
2,210 pages
HRS 9 Software License Installation Guide
0% (1)
HRS 9 Software License Installation Guide
31 pages
Sti Trace
No ratings yet
Sti Trace
2 pages
RTOS-Based Embedded System Design
No ratings yet
RTOS-Based Embedded System Design
22 pages
Docker Cheatsheet
No ratings yet
Docker Cheatsheet
1 page
Log Cat 1757569907773
No ratings yet
Log Cat 1757569907773
26 pages
Digital Forensic Investigation For Virtu
No ratings yet
Digital Forensic Investigation For Virtu
4 pages
Fastboot Oem Vuln: Android Bootloader Vulnerabilities in Vendor Customizations
No ratings yet
Fastboot Oem Vuln: Android Bootloader Vulnerabilities in Vendor Customizations
17 pages
Include
No ratings yet
Include
3 pages
Malicious Batch Script Analysis
No ratings yet
Malicious Batch Script Analysis
6 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
109 pages
ch1 PDF
No ratings yet
ch1 PDF
21 pages
IT 513 Partition
No ratings yet
IT 513 Partition
4 pages
Resolve Missing Gateway Caused by The Copy Protection Bug
No ratings yet
Resolve Missing Gateway Caused by The Copy Protection Bug
2 pages
22BCE7182 BasicLinuxCommand
No ratings yet
22BCE7182 BasicLinuxCommand
15 pages
OS Lab Manual - SRM University
No ratings yet
OS Lab Manual - SRM University
50 pages
Stack Trace 2024.10.04 19-30-17.580
No ratings yet
Stack Trace 2024.10.04 19-30-17.580
3 pages
How To Convert VMX Images To ESX SERVER Images
No ratings yet
How To Convert VMX Images To ESX SERVER Images
2 pages
Dos Commands
No ratings yet
Dos Commands
7 pages
Linux Notes
No ratings yet
Linux Notes
2 pages
Inter-Process Communication
No ratings yet
Inter-Process Communication
39 pages
App Log
No ratings yet
App Log
46 pages
CIS Ubuntu Linux 20.04 LTS Benchmark v1.0.0
No ratings yet
CIS Ubuntu Linux 20.04 LTS Benchmark v1.0.0
547 pages

Chapter3-Working With Dask DataFrames

Uploaded by

Chapter3-Working With Dask DataFrames

Uploaded by

Using Dask

Does not read le immediately (lazy evaluation)

File(s) need not t in memory

PARALLEL PROGRAMMING WITH DASK IN PYTHON

quarter1.csv quarter2.csv quarter3.csv quarter4.csv

id names amount date id names amount date

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Dask Series Structure:

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

some unsupported le formats (e.g., .xls , .zip , .gz )

indexing, selection, & reindexing

aggregations: .sum() , .mean() , .std() , .min() , .max()

grouping with .groupby()

datetime conversion with dd.to_datetime()

PARALLEL PROGRAMMING WITH DASK IN PYTHON

M < 8 GB RAM (single machine)

8 GB < M < 10 TB hard disk (single machine)

M > 10 TB: specialized hardware

Data ts in RAM (random access memory)?

Data ts on hard disk?

PARALLEL PROGRAMMING WITH DASK IN PYTHON

-rw-r--r-- 1 user staff 1.8G 31 Jul 16:43 yellow_tripdata_2015-01.csv

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

.compute(): 3.4004417498906454 min

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

How much RAM available?

How many threads/cores/CPUs available?

Are Pandas computations/formats supported in Dask API?

Is computation I/O-bound (disk-intensive) or CPU-bound

Best use case for Dask

Computations from Pandas API available in Dask

Problem size close to limits of RAM, ts on disk

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

-rw-r--r-- 1 user staff 1.8G 31 Jul 16:43 yellow_tripdata_2015-01.csv

Exercises use smaller les...

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

tolls_amount : charges for toll roads

extra : additional charges

tip_amount : amount tipped (credit cards only)

total_amount : total amount paid by passenger

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

You might also like