Pytables: An On - Disk Binary Data Container, Query Engine and Computa:Onal Kernel
Pytables: An On - Disk Binary Data Container, Query Engine and Computa:Onal Kernel
Anon-diskbinarydatacontainer,query
engineandcomputa:onalkernel
FrancescAlted
TutorialforthePyDataConference,October2012,NewYorkCity
10
th
anniversaryofPyTables
Hi!,
PyTables is a Python package which allows dealing with HDF5 tables. Such
a table is defined as a collection of records whose values are stored in
fixed-length fields. PyTables is intended to be easy-to-use, and tries to
be a high-performance interface to HDF5. To achieve this, the newest
improvements introduced in Python 2.2 (like generators or slots and
metaclasses in new-brand classes) has been used. Pyrex creation extension
tool has been chosen to access the HDF5 library.
This package should be platform independent, but until now I've tested it
only with Linux. It's the first public release (v 0.1), and it is in
alpha state.
"
--FrancescAltedannouncingPyTables0.1,October2002
Overview
WhatPyTablesis?
DatastructuresinPyTables
Compressingdata
Advancedcapabili:esinPyTables
Notebooksfortutorial
http://pytables.org/download/PyData2012-NYC.tar.gz
Whatitis
Abinarydatacontainerforon-disk,structured
data
Canperformopera:onswithdata*on-disk*
Basedonthestandardde-factoHDF5format
FreesoNware(BSDlicense)
AboutHDF5
(HierarchicalDataFileversion5)
Aversa&ledatamodelthatcanrepresent
complexdataobjectsaswellasassociated
metadata
Aportableleformatwithnolimitonthe
numberorsizeofdataobjectsinthecollec:on
Implementsahigh-levelAPIwithC,C++,Fortran
90,andJavainterfaces
FreesoNware(BSD,MITkindoflicense)
PyTablesdis:nc:vefeatures
Supportsagoodrangeofcompressors:Zlib,
bzip2,LZOandBlosc
Powerfulquerycapabili:esforTableobjects,
includingindexing
Canperformout-of-coreopera:onsvery
eciently
Whatitisnot
Notarela:onaldatabasereplacement
Notadistributeddatabase
Notextremelysecureorsafe(itsmoreabout
speed!)
NotamereHDF5wrapper
DATASTRUCTURES
Datastructures
Highlevelofexibilityforstructuringyour
data:
Datatypes:scalars(numerical&strings),records,
enumerated,:me
Tablessupportmul:dimensionalcellsandnested
records
Mu:dimensionalarrays
Variablelengtharrays
TheArrayobject
Easytocreate:
le.createArray(mygroup,array,numpy_arr)
Shapecannotchange
Cannotbecompressed
TheCArrayobject
Dataisstoredinchunks
Eachchunkcanbecompressedindependently
Shapecannotchange
TheEArrayobject
Dataisstoredinchunks
Canbecompressed
Shapecanchange(eitherenlargedorshrunk)
Shapemustbekeptregular
TheVLArrayobject
Dataisstoredinvariablelengthrows
Canbeenlargedorshrunk
Datacannotbecompressed
TheTableobject
Dataisstoredinchunks
Canbecompressed
Canbeenlargedorshrunk
Fieldscannotbeofvariablelength
Col1
(int32)
Col2
(string10)
Col3
(bool)
Col4
(complex64)
Col5
(oat32)
Datasethierarchy
root
group1
table1 table2
group2
array
Aiributes:
Metadataaboutdata
t
a
b
l
e
1
Date:Jul242006
Observa:ons:555
CF:[0.1,0.3,0.6]
COMPRESSIONCAPABILITIES
Whycompression?
Letsyoustoremore
datausingthe
samespace
UsesmoreCPU,but
CPU:meischeap
comparedwithdisk
access
Dierent
compressorsfor
dierentuses:
bzip2,zlib,lzo,
blosc
WhyBlosc?
Accelera:ngI/O
Blosc
Man Memory
Sod State Dsk
C
a
p
a
c
t
y S
p
e
e
d
CPU
Leve 2 Cache
Leve 1 Cache
Mechanca Dsk
Leve 3 Cache
}
}
Other
compressors
MemoryaccessvsCPUcycle:me
Laptopcomputerbackin2005
Stateoftheartcomputerin2012
(singlenode)
OUT-OF-CORECOMPUTATIONS
Opera:ngwithdisk-basedarrays
tables.Exprisanop:mizedevaluatorfor
expressionsofdisk-basedarrays.
Itisacombina:onoftheNumexpradvanced
compu:ngcapabili:eswiththehighI/O
performanceofPyTables.
SimilarlytoNumexpr,disk-temporariesare
avoided,andmul:-threadedopera:onis
preserved.
AvoidingtemporarieswithNumexpr
Tables.Exprfollowsthesameapproach,
butwithdiskinsteadofmemory
Performingout-of-corecomputa:ons
withPyTables
Dsk
Fesystem
cache
(n-memory)
Dataset 1 Dataset 2 Resut
Chunk 1 Chunk 1
Chunk 1 Chunk 1
Bosc compJess1oh
CPU cache
+
Bock 1 Bock 1
Bock 1
Bosc decompJess1oh
Chunk 1
Bock 1
Chunk 2
Chunk N
Chunk 2
Chunk N
Chunk 1
Chunk 2
Chunk N
Bock 2
Bock N
Bock 1
Bock 2
Bock N
Bock 1
Bock 2
Bock N
(compressed data)
(compressed data)
(uncompressed data)
Virtualmachine:numexpr
ADVANCEDQUERYCAPABILITIES
Dierentquerymodes
Regularquery:
[ r[c1] for r in table
if r[c2] > 2.1 and r[c3] == True)) ]
In-kernelquery:
[ r[c1] for r in table.where((c2>2.1)&(c3==True)) ]
Indexedquery:
table.cols.c2.createIndex()
table.cols.c3.createIndex()
[ r[c1] for r in table.where((c2>2.1)&(c3==True)) ]
Regularandin-kernelqueries
Customizableindexes
Indexedqueryperformance
The Starving CPU Problem
High Performance Libraries
Why Should You Use Them?
In-Core High Performance Libraries
Out-of-Core High Performance Libraries
PyTables Pro Query Performance
Francesc Alted Large Data Analysis with Python
Conceptstotakehome
PyTablesisop:mizedtodealwithdataondisk
Mostoftheopera:onsusetheiterator/
generatormachineryinPython:thegoalisnot
tobloatmemorywithdata
Queries,indexesandout-of-coreopera:ons
aregoodexamplesoftheabove
Thankyou!
Ques:ons?