01 Topol Arrow and Go
01 Topol Arrow and Go
Presented by:
Matthew Topol
Who am I?
matt@voltrondata.com
Author Of
@zeroshade
The Rundown
● Quick Primer on Apache Arrow
● Why Go?
● Simple Code Examples
● Example Code Walkthrough: A Streaming Data Pipeline
● What else can we do?
● More Resources
● Q/A
A quick primer on
https://arrow.apache.org
5
What is Columnar?
Table of Data
7
But why Golang??
Simple and easy to learn Go is fast!
9
Let’s start exploring!
The Go Arrow and Parquet libraries
But first… Some Terminology and Types
11
Simple Example
Build an Int64 Array
12
Memory Handling
Retain / Release
Reference counting is used to track usage of buffers
Manage ownership and eagerly try to free memory
Ties into Allocator interface for custom handling
memory.Allocator
Interface for custom memory allocation, default just uses
make([]byte, …)
13
Struct Builder
Multiple field builders
Builder for each Array type and even a
RecordBuilder which is similar to the
StructBuilder
Reading and Writing Data
Multiple formats supported!
15
Sample Usage
“The Movies Dataset”
16
Example: A Streaming Data Pipeline
Yes, it’s contrived. But it’s informative!
17
Example: The Sample Data
Kaggle: “The Movies Most columns are easy
Dataset” bool, int, float, string
CSV reader can handle nulls for us
https://www.kaggle.com/datasets/ Infer the column types
rounakbanik/the-movies-dataset
Zero-copy transfer to new arrow.Record
Source
18
Reading CSV
Data
Stream Records via Channels
Low Memory usage, easy parallelism with
Golang
Manipulating
the Column
Let’s dig into this a bit
Trust me, it’s easier than it looks!
Follow along for the next few slides…
First: A ListBuilder
`[{‘id’: 123, ‘name’: ‘Comedy’}, {‘id’: 456, ‘name’: ‘Drama’}]`
21
Next: Build Replacement Column
Example is just one column, but could be any number of columns in parallel
UnmarshalJSON on a builder
parses the JSON and adds
the values to the builder
22
Next: Send the New Record
It’s a pointer! There’s no copying!
Create the Output
Schema
Check if we have it already
so we only create it once.
23
Improvement: Parallelize
24
Recap: Pipeline So far…
CSV Data Process / Manipulate Records
(Or any Record Stream)
Write
Parquet
Channel (Next)
Channel
25
Write a Parquet
File
Columnar file storage
26
Reader and Writer use io Interfaces
Easy reading and Parquet
writing of data Reader requires io.ReaderAt and io.Seeker
regardless of location Writer only needs io.Writer, great for Streams
Can read Parquet data and metadata directly or
convert directly to/from Arrow
(S3, ADLS, HDFS, etc.)
CSV
Only needs io.Reader and io.Writer
Control memory usage via Chunk options
27
What about between processes?
https://arrow.apache.org/docs/format/Flight.html
Transportation
Protobuf + Arrow IPC streams
Arrow Flight RPC Standardized Protocol for many clients
Arrow Flight SQL
>20x faster than ODBC
28
What else can it do?
Distributed Arrow Flight Deploy composable
services to be called by components to link
clients in any language against using C Data API
Or get my book!
30
Q&A
31
Thanks Everyone!