[go: up one dir, main page]

0% found this document useful (0 votes)
129 views32 pages

01 Topol Arrow and Go

Apache Arrow and Go is a good match for building high performance data pipelines. Arrow provides a columnar format that uses less memory and I/O compared to row formats. Go is fast, has good memory usage characteristics, and makes it easy to build concurrent applications. The Arrow Go module provides readers, writers, and processing functions for common formats like CSV, JSON, Parquet, and Arrow IPC. A simple example processes a movie dataset CSV by reading records, transforming a JSON column into a list, and writing the results to Parquet. Arrow and Go together enable building efficient streaming data pipelines.

Uploaded by

Willfred
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views32 pages

01 Topol Arrow and Go

Apache Arrow and Go is a good match for building high performance data pipelines. Arrow provides a columnar format that uses less memory and I/O compared to row formats. Go is fast, has good memory usage characteristics, and makes it easy to build concurrent applications. The Arrow Go module provides readers, writers, and processing functions for common formats like CSV, JSON, Parquet, and Arrow IPC. A simple example processes a movie dataset CSV by reading records, transforming a JSON column into a list, and writing the results to Parquet. Arrow and Go together enable building efficient streaming data pipelines.

Uploaded by

Willfred
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Apache Arrow and Go:

A Match Made in Data


October 3rd, 2022

Presented by:

Matthew Topol
Who am I?

Email

matt@voltrondata.com

Author Of

“In-Memory Analytics With Apache Arrow”


Staff Software Engineer at Voltron Data
Apache Arrow Contributor

@zeroshade
The Rundown
● Quick Primer on Apache Arrow
● Why Go?
● Simple Code Examples
● Example Code Walkthrough: A Streaming Data Pipeline
● What else can we do?
● More Resources
● Q/A
A quick primer on
https://arrow.apache.org

High Performance, In-Memory Columnar Format


No Data Serialization / Deserialization required!

Polyglot! Implementations in many languages


Go, C++, Rust, Python, R, Java, Julia, MATLAB, and
more…

5
What is Columnar?
Table of Data

Row Oriented Memory Buffer Arrow Columnar Memory Buffer


Why Columnar?
Memory Locality A Less I/O, lower memory usage, fewer page faults
I/O Get All Archers in Europe:
Vectorization Only need two columns! (Archer, Location)
1. Spin through Locations for indexes
2. Get Archers at those indexes

B Significantly faster computation!


Calculate mean for Year column:
Only need the one column! (Year)
1. Vectorized operations require contiguous
memory
2. Our column is already contiguous memory!

7
But why Golang??
Simple and easy to learn Go is fast!

Easier deployment Built for easy


with static binaries concurrency

Excellent memory Great multi-core


usage characteristics usage and scaling
github.com/apache/arrow/go/v9
v10 should be released in the next couple weeks!!

…/arrow CSV, JSON, and Arrow IPC reader/writers


Arrow Flight and Flight SQL client and server
Supports multiple architectures (AMD64,
Golang Arrow ARM64, s390x, etc.) and leverages SIMD/NEON
Module

Low memory usage, high performance


…/parquet reader/writer
Contains pqarrow package for easy
interoperability between Parquet and Arrow
Supports multiple architectures (AMD64,
ARM64, s390x, etc.) and leverages SIMD/NEON

9
Let’s start exploring!
The Go Arrow and Parquet libraries
But first… Some Terminology and Types

Record Batch Chunked Array


Array (arrow.Array) Table (arrow.Table)
(arrow.Record) (arrow.Chunked)
Collection of Arrays Sequence of arrays Collection of Columns
Logical Data type,
with the same length with the same data (Chunked Array + Field)
length, null count and 1
and a Schema type, total length and with the same total
or more Buffers of data
(Collection of Fields) total null count length and a schema

11
Simple Example
Build an Int64 Array

12
Memory Handling

Retain / Release
Reference counting is used to track usage of buffers
Manage ownership and eagerly try to free memory
Ties into Allocator interface for custom handling

memory.Allocator
Interface for custom memory allocation, default just uses
make([]byte, …)

Only Three methods: Allocate, Reallocate, Free


CheckedAllocator for tracking memory usage

13
Struct Builder
Multiple field builders
Builder for each Array type and even a
RecordBuilder which is similar to the
StructBuilder
Reading and Writing Data
Multiple formats supported!

CSV Can provide an explicit schema or infer types


Specify null values, delimiter, line endings

Arrow Record Can control Record Batch chunk size Source


Reader/Writer

Highly efficient Columnar storage


Parquet
Often Zero-Copy converting to Arrow

Can easily read columns and row groups in


parallel

15
Sample Usage
“The Movies Dataset”

16
Example: A Streaming Data Pipeline
Yes, it’s contrived. But it’s informative!

1 Read CSV Data 2 Transform / Add / 3 Write out Parquet File


Replace Columns

17
Example: The Sample Data
Kaggle: “The Movies Most columns are easy
Dataset” bool, int, float, string
CSV reader can handle nulls for us
https://www.kaggle.com/datasets/ Infer the column types
rounakbanik/the-movies-dataset
Zero-copy transfer to new arrow.Record

Some columns we want to manipulate


String column values that are JSON strings
converted into Lists for easier processing
Any other streaming transformations you’d like…

Source

18
Reading CSV
Data
Stream Records via Channels
Low Memory usage, easy parallelism with
Golang
Manipulating
the Column
Let’s dig into this a bit
Trust me, it’s easier than it looks!
Follow along for the next few slides…
First: A ListBuilder
`[{‘id’: 123, ‘name’: ‘Comedy’}, {‘id’: 456, ‘name’: ‘Drama’}]`

Builders are reusable

Create a List Column


of Structs

21
Next: Build Replacement Column
Example is just one column, but could be any number of columns in parallel

Grab column we want

Could find index via Schema


with FieldIndices method

Parse JSON directly

UnmarshalJSON on a builder
parses the JSON and adds
the values to the builder

22
Next: Send the New Record
It’s a pointer! There’s no copying!
Create the Output
Schema
Check if we have it already
so we only create it once.

Send the New


Record
Pass the new record to a
different channel,
continuing the pipeline

23
Improvement: Parallelize

Goroutines and Channels for extremely easy parallel patterns such as


fan-out/fan-in

24
Recap: Pipeline So far…
CSV Data Process / Manipulate Records
(Or any Record Stream)

Write
Parquet
Channel (Next)
Channel

25
Write a Parquet
File
Columnar file storage

Optimized Arrow -> Parquet conversion

26
Reader and Writer use io Interfaces
Easy reading and Parquet
writing of data Reader requires io.ReaderAt and io.Seeker
regardless of location Writer only needs io.Writer, great for Streams
Can read Parquet data and metadata directly or
convert directly to/from Arrow
(S3, ADLS, HDFS, etc.)

CSV
Only needs io.Reader and io.Writer
Control memory usage via Chunk options

27
What about between processes?
https://arrow.apache.org/docs/format/Flight.html

Communicate record batches locally or


Arrow IPC remotely
File and streaming formats

Efficient Data Can mmap for efficiency Source

Transportation
Protobuf + Arrow IPC streams
Arrow Flight RPC Standardized Protocol for many clients
Arrow Flight SQL
>20x faster than ODBC

28
What else can it do?
Distributed Arrow Flight Deploy composable
services to be called by components to link
clients in any language against using C Data API

Some Uses for


Efficient CLI utilities for Building highly
data manipulation from Apache Arrow concurrent deployable
remote data sources
in Go data pipelines

Building an Arrow Native Composable services to


Computation Engine or offload data computation
custom Database and analysis
Want more examples?
More on Apache Arrow: https://arrow.apache.org/docs/

Or get my book!

Examples in multiple Practical Examples for Arrow


languages: Python / C++ / Go Flight and other Data Science
workflows

Amazon Link for the Book: buff.ly/3OcoxyB


“In-Memory Analytics with Apache Arrow”

Go Arrow/Parquet docs: https://pkg.go.dev/github.com/apache/arrow/go/v9

30
Q&A

31
Thanks Everyone!

The Go Gopher image is released XKCD Comics released under


under the Creative Commons Creative Commons
Attribution 3.0 License, originally Attribution-NonCommercial 2.5
created by artist Renee French License, created by Randall Munroe
https://xkcd.com

You might also like