0% found this document useful (0 votes)

129 views32 pages

01 Topol Arrow and Go

Apache Arrow and Go is a good match for building high performance data pipelines. Arrow provides a columnar format that uses less memory and I/O compared to row formats. Go is fast, has good memory usage characteristics, and makes it easy to build concurrent applications. The Arrow Go module provides readers, writers, and processing functions for common formats like CSV, JSON, Parquet, and Arrow IPC. A simple example processes a movie dataset CSV by reading records, transforming a JSON column into a list, and writing the results to Parquet. Arrow and Go together enable building efficient streaming data pipelines.

Uploaded by

Willfred

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views32 pages

01 Topol Arrow and Go

Uploaded by

Willfred

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Apache Arrow and Go:

A Match Made in Data

October 3rd, 2022

Presented by:

Matthew Topol
Who am I?

matt@voltrondata.com

Author Of

“In-Memory Analytics With Apache Arrow”

Staff Software Engineer at Voltron Data
Apache Arrow Contributor

@zeroshade
The Rundown
● Quick Primer on Apache Arrow
● Why Go?
● Simple Code Examples
● Example Code Walkthrough: A Streaming Data Pipeline
● What else can we do?
● More Resources
● Q/A
A quick primer on
https://arrow.apache.org

High Performance, In-Memory Columnar Format

No Data Serialization / Deserialization required!

Polyglot! Implementations in many languages

Go, C++, Rust, Python, R, Java, Julia, MATLAB, and
more…

5
What is Columnar?
Table of Data

Row Oriented Memory Buffer Arrow Columnar Memory Buffer

Why Columnar?
Memory Locality A Less I/O, lower memory usage, fewer page faults
I/O Get All Archers in Europe:
Vectorization Only need two columns! (Archer, Location)
1. Spin through Locations for indexes
2. Get Archers at those indexes

B Signiﬁcantly faster computation!

Calculate mean for Year column:
Only need the one column! (Year)
1. Vectorized operations require contiguous
memory
2. Our column is already contiguous memory!

7
But why Golang??
Simple and easy to learn Go is fast!

Easier deployment Built for easy

with static binaries concurrency

Excellent memory Great multi-core

usage characteristics usage and scaling
github.com/apache/arrow/go/v9
v10 should be released in the next couple weeks!!

…/arrow CSV, JSON, and Arrow IPC reader/writers

Arrow Flight and Flight SQL client and server
Supports multiple architectures (AMD64,
Golang Arrow ARM64, s390x, etc.) and leverages SIMD/NEON
Module

Low memory usage, high performance

…/parquet reader/writer
Contains pqarrow package for easy
interoperability between Parquet and Arrow
Supports multiple architectures (AMD64,
ARM64, s390x, etc.) and leverages SIMD/NEON

9
Let’s start exploring!
The Go Arrow and Parquet libraries
But ﬁrst… Some Terminology and Types

Record Batch Chunked Array

Array (arrow.Array) Table (arrow.Table)
(arrow.Record) (arrow.Chunked)
Collection of Arrays Sequence of arrays Collection of Columns
Logical Data type,
with the same length with the same data (Chunked Array + Field)
length, null count and 1
and a Schema type, total length and with the same total
or more Buffers of data
(Collection of Fields) total null count length and a schema

11
Simple Example
Build an Int64 Array

12
Memory Handling

Retain / Release
Reference counting is used to track usage of buffers
Manage ownership and eagerly try to free memory
Ties into Allocator interface for custom handling

memory.Allocator
Interface for custom memory allocation, default just uses
make([]byte, …)

Only Three methods: Allocate, Reallocate, Free

CheckedAllocator for tracking memory usage

13
Struct Builder
Multiple ﬁeld builders
Builder for each Array type and even a
RecordBuilder which is similar to the
StructBuilder
Reading and Writing Data
Multiple formats supported!

CSV Can provide an explicit schema or infer types

Specify null values, delimiter, line endings

Arrow Record Can control Record Batch chunk size Source

Reader/Writer

Highly efﬁcient Columnar storage

Parquet
Often Zero-Copy converting to Arrow

Can easily read columns and row groups in

parallel

15
Sample Usage
“The Movies Dataset”

16
Example: A Streaming Data Pipeline
Yes, it’s contrived. But it’s informative!

1 Read CSV Data 2 Transform / Add / 3 Write out Parquet File

Replace Columns

17
Example: The Sample Data
Kaggle: “The Movies Most columns are easy
Dataset” bool, int, ﬂoat, string
CSV reader can handle nulls for us
https://www.kaggle.com/datasets/ Infer the column types
rounakbanik/the-movies-dataset
Zero-copy transfer to new arrow.Record

Some columns we want to manipulate

String column values that are JSON strings
converted into Lists for easier processing
Any other streaming transformations you’d like…

Source

18
Reading CSV
Data
Stream Records via Channels
Low Memory usage, easy parallelism with
Golang
Manipulating
the Column
Let’s dig into this a bit
Trust me, it’s easier than it looks!
Follow along for the next few slides…
First: A ListBuilder
`[{‘id’: 123, ‘name’: ‘Comedy’}, {‘id’: 456, ‘name’: ‘Drama’}]`

Builders are reusable

Create a List Column

of Structs

21
Next: Build Replacement Column
Example is just one column, but could be any number of columns in parallel

Grab column we want

Could ﬁnd index via Schema

with FieldIndices method

Parse JSON directly

UnmarshalJSON on a builder
parses the JSON and adds
the values to the builder

22
Next: Send the New Record
It’s a pointer! There’s no copying!
Create the Output
Schema
Check if we have it already
so we only create it once.

Send the New

Record
Pass the new record to a
different channel,
continuing the pipeline

23
Improvement: Parallelize

Goroutines and Channels for extremely easy parallel patterns such as

fan-out/fan-in

24
Recap: Pipeline So far…
CSV Data Process / Manipulate Records
(Or any Record Stream)

Write
Parquet
Channel (Next)
Channel

25
Write a Parquet
File
Columnar ﬁle storage

Optimized Arrow -> Parquet conversion

26
Reader and Writer use io Interfaces
Easy reading and Parquet
writing of data Reader requires io.ReaderAt and io.Seeker
regardless of location Writer only needs io.Writer, great for Streams
Can read Parquet data and metadata directly or
convert directly to/from Arrow
(S3, ADLS, HDFS, etc.)

CSV
Only needs io.Reader and io.Writer
Control memory usage via Chunk options

27
What about between processes?
https://arrow.apache.org/docs/format/Flight.html

Communicate record batches locally or

Arrow IPC remotely
File and streaming formats

Efﬁcient Data Can mmap for efﬁciency Source

Transportation
Protobuf + Arrow IPC streams
Arrow Flight RPC Standardized Protocol for many clients
Arrow Flight SQL
>20x faster than ODBC

28
What else can it do?
Distributed Arrow Flight Deploy composable
services to be called by components to link
clients in any language against using C Data API

Some Uses for

Efﬁcient CLI utilities for Building highly
data manipulation from Apache Arrow concurrent deployable
remote data sources
in Go data pipelines

Building an Arrow Native Composable services to

Computation Engine or ofﬂoad data computation
custom Database and analysis
Want more examples?
More on Apache Arrow: https://arrow.apache.org/docs/

Or get my book!

Examples in multiple Practical Examples for Arrow

languages: Python / C++ / Go Flight and other Data Science
workﬂows

Amazon Link for the Book: buff.ly/3OcoxyB

“In-Memory Analytics with Apache Arrow”

Go Arrow/Parquet docs: https://pkg.go.dev/github.com/apache/arrow/go/v9

30
Q&A

31
Thanks Everyone!

The Go Gopher image is released XKCD Comics released under

under the Creative Commons Creative Commons
Attribution 3.0 License, originally Attribution-NonCommercial 2.5
created by artist Renee French License, created by Randall Munroe
https://xkcd.com

Unit Iii
No ratings yet
Unit Iii
107 pages
Algo Trading System
No ratings yet
Algo Trading System
40 pages
Learning Apache Arrow Overview
No ratings yet
Learning Apache Arrow Overview
10 pages
Python Datatypes
No ratings yet
Python Datatypes
43 pages
Final Project Report
No ratings yet
Final Project Report
113 pages
How To Work With CSV Files in Go - Earthly Blog
No ratings yet
How To Work With CSV Files in Go - Earthly Blog
22 pages
Week 9
No ratings yet
Week 9
53 pages
Compurer Project Final
No ratings yet
Compurer Project Final
22 pages
19ITP11 Unit III 1722502325118
No ratings yet
19ITP11 Unit III 1722502325118
126 pages
CHAPTER 5 Media and Cyber or Digital Literacies
No ratings yet
CHAPTER 5 Media and Cyber or Digital Literacies
39 pages
Arrow Cookbook
No ratings yet
Arrow Cookbook
12 pages
Guidelines On Direct Market Access October 2019
No ratings yet
Guidelines On Direct Market Access October 2019
21 pages
DS Python
No ratings yet
DS Python
7 pages
Chapter 08
No ratings yet
Chapter 08
52 pages
42 P16cse5a-P16ite3a 2020052204503639
No ratings yet
42 P16cse5a-P16ite3a 2020052204503639
23 pages
LP33 S1 - Tech Datasheet - 15 Thru 60kva - Apr 1 2016 V4
No ratings yet
LP33 S1 - Tech Datasheet - 15 Thru 60kva - Apr 1 2016 V4
8 pages
Group 2 Present 2
No ratings yet
Group 2 Present 2
8 pages
Cabinet Accessories Catalog V1277r4.2
No ratings yet
Cabinet Accessories Catalog V1277r4.2
23 pages
Supporting Columnar In-Memory Formats On Fpga: The Hardware Design of Fletcher For Appache Arrow
No ratings yet
Supporting Columnar In-Memory Formats On Fpga: The Hardware Design of Fletcher For Appache Arrow
14 pages
Acura TSX 2004 Electronic Throttle Control System
100% (1)
Acura TSX 2004 Electronic Throttle Control System
89 pages
Csvkit Manual
No ratings yet
Csvkit Manual
53 pages
Catalogo Proskit
No ratings yet
Catalogo Proskit
286 pages
CN Lab
No ratings yet
CN Lab
34 pages
Parallel Algorithms: Peter Harrison and William Knottenbelt
No ratings yet
Parallel Algorithms: Peter Harrison and William Knottenbelt
65 pages
Aurora Borealis Professional: Alexander Svirin
No ratings yet
Aurora Borealis Professional: Alexander Svirin
15 pages
Python Data Structure - Quick Guide
No ratings yet
Python Data Structure - Quick Guide
94 pages
Data Structure and Algorithm CSC
No ratings yet
Data Structure and Algorithm CSC
29 pages
LAB MANUAL Advanced ADBMS
100% (1)
LAB MANUAL Advanced ADBMS
28 pages
Lecture 2 Python Data Structures
No ratings yet
Lecture 2 Python Data Structures
52 pages
How To Active Microsoft Defender Step by Step Guideline
No ratings yet
How To Active Microsoft Defender Step by Step Guideline
10 pages
Micro Project Report Format
No ratings yet
Micro Project Report Format
11 pages
Unit3 BDAT
No ratings yet
Unit3 BDAT
18 pages
Gerald Corzo 5/27/2020: Google Colab Platform 2 Reading Data 2
No ratings yet
Gerald Corzo 5/27/2020: Google Colab Platform 2 Reading Data 2
11 pages
Week 1
No ratings yet
Week 1
12 pages
Data Structures Summary Topics 7-12
No ratings yet
Data Structures Summary Topics 7-12
17 pages
Ge Lunar Corporation - Encore Configuration Check Utility
No ratings yet
Ge Lunar Corporation - Encore Configuration Check Utility
18 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Special Scenario - Hunt - Dracula America
No ratings yet
Special Scenario - Hunt - Dracula America
2 pages
AI For Trading Learning Nanodegree Program Syllabus
No ratings yet
AI For Trading Learning Nanodegree Program Syllabus
18 pages
Regular Expression
No ratings yet
Regular Expression
17 pages
Demystifying Apache Arrow
No ratings yet
Demystifying Apache Arrow
6 pages
DP 900 Data Fundamentals 1710103456
No ratings yet
DP 900 Data Fundamentals 1710103456
35 pages
Service Manual Service Manual: Av Receiver Model
No ratings yet
Service Manual Service Manual: Av Receiver Model
40 pages
WHO Initiative On E-Waste and Child Health (2021)
No ratings yet
WHO Initiative On E-Waste and Child Health (2021)
6 pages
L17 Recursion Practice Set (Questions)
No ratings yet
L17 Recursion Practice Set (Questions)
7 pages
Paper-189 - Machine Learning Unveiled
No ratings yet
Paper-189 - Machine Learning Unveiled
19 pages
C++ Programming: Trainer: Akshita Chanchlani
No ratings yet
C++ Programming: Trainer: Akshita Chanchlani
15 pages
Non Linear Data Structures
No ratings yet
Non Linear Data Structures
12 pages
Job Interview Questions For Tech Jobs
No ratings yet
Job Interview Questions For Tech Jobs
2 pages
Holiday Homework Computer Science Part1
No ratings yet
Holiday Homework Computer Science Part1
14 pages
Junior Protetion & Control Engineer - GEC
No ratings yet
Junior Protetion & Control Engineer - GEC
5 pages
gRAPHING LINEAR EQUATIONS PDF
No ratings yet
gRAPHING LINEAR EQUATIONS PDF
1 page
Count Data - Wikipedia
No ratings yet
Count Data - Wikipedia
7 pages
20039391: Status of (Created With Aid4Mail in Trial Mode)
No ratings yet
20039391: Status of (Created With Aid4Mail in Trial Mode)
6 pages
1st Year Past Papers S.Q + L.Q
No ratings yet
1st Year Past Papers S.Q + L.Q
5 pages
Nptel Enrollment
No ratings yet
Nptel Enrollment
1 page
Catalogue&Price List: 1 - Laser Tube
No ratings yet
Catalogue&Price List: 1 - Laser Tube
2 pages
Shajal Ahamed
No ratings yet
Shajal Ahamed
2 pages
0260 RESOURCE WritingAcceptanceCriteria ANSWERKEY
No ratings yet
0260 RESOURCE WritingAcceptanceCriteria ANSWERKEY
3 pages
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
C++ Programming: Effective Practices and Techniques
From Everand
C++ Programming: Effective Practices and Techniques
Joe Smith
No ratings yet
Programming Kotlin
From Everand
Programming Kotlin
Stephen Samuel
No ratings yet
The Complete Guide to Technology & Programming
From Everand
The Complete Guide to Technology & Programming
MATHY WISDOM
No ratings yet
Haskell High Performance Programming
From Everand
Haskell High Performance Programming
Samuli Thomasson
No ratings yet
Java Streams Explained: A Practical Guide with Examples
From Everand
Java Streams Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
PyQt6 101: A Beginner’s guide to PyQt6
From Everand
PyQt6 101: A Beginner’s guide to PyQt6
Edward Chang
No ratings yet
Storage Area Networks For Dummies
From Everand
Storage Area Networks For Dummies
Christopher Poelker
3.5/5 (2)
Mastering Java Collections: From Basics to Expert Proficiency
From Everand
Mastering Java Collections: From Basics to Expert Proficiency
William Smith
No ratings yet
Learning Concurrent Programming in Scala
From Everand
Learning Concurrent Programming in Scala
Aleksandar Prokopec
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
C Programming for Arduino
From Everand
C Programming for Arduino
Julien Bayle
4/5 (13)
Mastering Swift
From Everand
Mastering Swift
Jon Hoffman
No ratings yet
Ruby Gems Mastery: 100 Essential Packages for 2024
From Everand
Ruby Gems Mastery: 100 Essential Packages for 2024
Kanto
No ratings yet
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
From Everand
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
Abhishek Mishra
No ratings yet
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Your First Week With Node.js
From Everand
Your First Week With Node.js
James Hibbard
No ratings yet
AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam
From Everand
AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam
Shreyas Subramanian
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Beginning Swift Programming
From Everand
Beginning Swift Programming
Wei-Meng Lee
No ratings yet
Mastering Python
From Everand
Mastering Python
Rick van Hattem
No ratings yet
Administering ArcGIS for Server
From Everand
Administering ArcGIS for Server
Hussein Nasser
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Java Programming
From Everand
Java Programming
Brian Evenson
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Node.js, JavaScript, API: Interview Questions and Answers
From Everand
Node.js, JavaScript, API: Interview Questions and Answers
John Edward Cooper Berg
5/5 (1)
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Coding In C Decoded: Decoded, #1
From Everand
Coding In C Decoded: Decoded, #1
D Brown
No ratings yet

01 Topol Arrow and Go

Uploaded by

01 Topol Arrow and Go

Uploaded by

Apache Arrow and Go:

A Match Made in Data

“In-Memory Analytics With Apache Arrow”

High Performance, In-Memory Columnar Format

Polyglot! Implementations in many languages

Row Oriented Memory Buffer Arrow Columnar Memory Buffer

B Signiﬁcantly faster computation!

Easier deployment Built for easy

Excellent memory Great multi-core

…/arrow CSV, JSON, and Arrow IPC reader/writers

Low memory usage, high performance

Record Batch Chunked Array

Only Three methods: Allocate, Reallocate, Free

CSV Can provide an explicit schema or infer types

Arrow Record Can control Record Batch chunk size Source

Highly efﬁcient Columnar storage

Can easily read columns and row groups in

1 Read CSV Data 2 Transform / Add / 3 Write out Parquet File

Some columns we want to manipulate

Builders are reusable

Create a List Column

Grab column we want

Could ﬁnd index via Schema

Parse JSON directly

Send the New

Goroutines and Channels for extremely easy parallel patterns such as

Optimized Arrow -> Parquet conversion

Communicate record batches locally or

Efﬁcient Data Can mmap for efﬁciency Source

Some Uses for

Building an Arrow Native Composable services to

Examples in multiple Practical Examples for Arrow

Amazon Link for the Book: buff.ly/3OcoxyB

Go Arrow/Parquet docs: https://pkg.go.dev/github.com/apache/arrow/go/v9

The Go Gopher image is released XKCD Comics released under

You might also like