Hbase PPT PDF
Hbase PPT PDF
Hbase PPT PDF
Pietro Michiardi
Eurecom
Introduction
Batch processing
Hadoop and MapReduce:
I Excels at storing (semi- and/or un-) structured data
I Data interpretation takes place at analysis-time
I Flexibility in data classification
Random Access:
I Users need to “interact” with data, especially that “crunched” after a
MapReduce job
I This is historically where RDBMS excel: random access for
structured data
Column-Oriented Databases
Data layout:
I Save their data grouped by columns
I Subsequent column values are stored contiguously on disk
I This is substantially different from traditional RDBMS, which save
and store data by row
Important NOTE:
I HBase is not a column-oriented DB in the typical term
I HBase uses an on-disk column storage format
I Provides key-based access to specific cell of data, or a sequential
range of cells
Pietro Michiardi (Eurecom) Tutorial: HBase 5 / 102
Introduction Column-Oriented DB
Example: Hush
I Used throughout this course
I URL shortener service
url
Pietro Michiardi (Eurecom) Tutorial: HBase 8 / 102
Introduction The problem with RDBMS
Stored Procedures
I Consistently update data from multiple clients
I Underlying DB system guarantees coherency
Transactions
I Make sure you can update tables in an atomic fashion
I RDBMS → Strong Consistency (ACID properties)
I Referential Integrity
Master-Slave architecture
I Add DB server so that READS can be served in parallel
I Master DB takes all the writes (which are fewer in the Hush
application)
I Slaves DB replicate Master DB and serve all reads (but you need a
load balancer)
Caching
IAdd a caching layer, e.g. Memcached or Redis
IOffload READS to a fast in-memory system
→ You lose consistency guarantees
→ Cache invalidation is critical for having DB and Caching layer
consistent
Scaling up more
IWRITES are the bottleneck
IThe master DB is hit too hard by WRITE load
I Vertical scalability : beef up your master server
→ This becomes costly, as you may also have to replace your RDBMS
Sharding
I Partition your data across multiple databases
FEssentially you break horizontally your tables and ship them to
different servers
F This is done using fixed boundaries
Non-Relational DataBases
Data model
I How the data is stored: key/value, semi-structured,
column-oriented, ...
I How to access data?
I Can the schema evolve over time?
Storage model
I In-memory or persistent?
I How does this affect your access pattern?
Consistency model
I Strict or eventual?
I This translates in how fast the system handles READS and WRITES
[2]
Physical Model
I Distributed or single machine?
I How does the system scale?
Read/Write performance
I Top-down approach: understands well the workload!
I Some systems are better for READS, other for WRITES
Secondary indexes
I Does your workload require them?
I Can your system emulate them?
Failure Handling
I How each data store handle server failures?
I Is it able to continue operating in case of failures?
F This is related to Consistency models and the CAP theorem
I Does the system support “hot-swap”?
Compression
I Is the compression method pluggable?
I What time of compression?
Load Balancing
I Can the storage system seamlessly balance load?
Atomic read-modify-write
I Easy in a centralized system, difficult in a distributed one
I Prevent race conditions in multi-threaded or shared-nothing designs
I Can reduce client-side complexity
Impedance Match
“One-size-fits-all” has been long dismissed: need to find the perfect
match for your problem.
Database (De-)Normalization
Denormalization
I Duplicate data in more than one table such that at READ time no
further aggregation is required
Next: an example based on Hush
I How to convert a classic relational data model to one that fits
HBase
Figure:TITLE
The Hush Schema expressed as an ERD
url
Figure:TITLE
The Hush Schema expressed as an ERD
url
a23eg
Figure:TITLE
The Hush Schema expressed as an ERD
url
What is BigTable?
I BigTable is a distributed storage system for managing structured
data designed to scale to a very large size
I BigTable is a sparse, distributed, persistent multi-dimensional
sorted map
What is HBase?
I Essentially it’s an open-source version of BigTable
I Differences listed in [5]
Columns
I Rows are composed of columns
I Can have millions of columns
I Can be compressed or tagged to stay in memory
Column Families
IColumns are grouped into column families
→ Semantical boundaries between data
I Column families and columns stored together in the same low-level
References to columns
I Column “name” is called qualifier, and can be any arbitrary
number of bytes
I Reference: family:qualifier
A cell
I Every column value, or cell, is timestamped (implicitly or explicitly)
F This can be used to save multiple versions of a value that changes
over time
F Versions are stored in decreasing timestamp, most recent first
I Cell versions can be constrained by predicate deletions
F Keep only values from the last week
Which means:
I The first SortedMap is the table, containing a List of column
families
I The families contain another SortedMap, representing columns
and a List of value, timestamp tuples
Pietro Michiardi (Eurecom) Tutorial: HBase 31 / 102
Introduction HBase Sketch
Automatic Sharding
Region
I This is the basic unit of scalability and load balancing
I Regions are contiguous ranges of rows stored together → they are
the equivalent of range partitions in sharded RDBMS
I Regions are dynamically split by the system when they become too
large
I Regions can also be merged to reduce the number of storage files
Regions in practice
I Initially, there is one region
I System monitors region size: if a threshold is attained, SPLIT
F Regions are split in two at the middle key
F This creates roughly two equivalent (in size) regions
Automatic Sharding
Region Servers
I Each region is served by exactly one Region Server
I Region servers can serve multiple regions
I The number of region servers and their sizes depend on the
capability of a single region server
Server failures
I Regions allow for fast recovery upon failure
I Fine-grained Load Balancing is also achieved using regions as they
can be easily moved across servers
Storage API
Scan API
I Allows for fast iteration over ranges of rows
I Allows to limit the number and which column are returned
I Allows to control the version number of each cell
Read-modify-write API
I HBase supports single-row transactions
I Atomic read-modify-write on data stored in a single row key
Storage API
Counters
IValues can be interpreted as counters and updated atomically
ICan be read and modified in one operation
→ Implement global, strictly consistent, sequential counters
Coprocessors
I These are equivalent to stored-procedures in RDBMS
I Allow to push user code in the address space of the server
I Access to server local data
I Implement lightweight batch jobs, data pre-processing, data
summarization
Data lookups
I Since HFiles have a block index, lookup can be done with a single
disk seek
I First, the block possibly containing a given lookup key is determined
with a binary search in the in-memory index
I Then a block read is performed to find the actual key
Data Locality
I Achieved by the system looking up for server hostnames
I Achieved through intelligent key design
Pietro Michiardi (Eurecom) Tutorial: HBase 37 / 102
Introduction HBase Sketch
HBase implementation
Deleting data
I Since HFiles are immutable, how can we delete data?
I A delete marker (also known as tombstone marker ) is written to
indicate that a given key is deleted
I During the read process, data marked as deleted is skipped
I Compactions (see next slides) finalize the deletion process
READ operation
I Merge of what is stored in the memstores (data that is not on disk)
and in the HFiles
I The WAL is never used in the READ operation
I Several API calls to read, scan data
Minor Compaction
I Rewrites small HFiles into fewer, larger HFiles
I This is done using an n-way merge1
Major Compaction
I Rewrites all files within a column family or a region in a new one
I Drop deleted data
I Perform predicated deletion (e.g. delete old data)
1
What is MergeSort?
Pietro Michiardi (Eurecom) Tutorial: HBase 39 / 102
Introduction HBase Sketch
Region Servers
I Handle READs and WRITEs
I Handle region splitting
Pietro Michiardi (Eurecom) Tutorial: HBase 40 / 102
Architecture
Architecture
B+ Trees
LSM-Trees
Data flow
I Incoming data is first stored in a logfile, sequentially
I Once the log has the modification saved, data is pushed in memory
F In-memory store holds most recent updates for fast lookup
I When memory is “full”, data is flushed in a store file to disk, as a
sorted list of key → record pair
I At this point, the log file can be thrown away
LSM-Trees
Clean-up process
I As flushes take place over time, a lot of store files are created
I Background process aggregates files into larger ones to limit disk
seeks
I All store files are always sorted by key → no re-ordering required to
fit new keys in
Data Lookup
I Lookups are done in a merging fashion
F First lookup in the in-memory store
F If miss, the lookup in the on-disk store
Deleting data
I Use a delete marker
I When pages are re-written, deleted markers and keys are
eventually dropped
I Predicate deletion happens here
Pietro Michiardi (Eurecom) Tutorial: HBase 45 / 102
Architecture Seek vs. Transfer
B+ Tree [1]
I Work well when there are not so many updates
I The more and the faster you insert data at random locations the
faster pages get fragmented
I Updates and deletes are done at disk seek rates, rather than
transfer rates
LSM-Tree [7]
I Work at disk transfer rate and scale better to huge amounts of data
I Guarantee a consistent insert rate
F They transform random into sequential writes
I Reads are independent from writes
I Optimized data layout which offers predictable boundaries on disk
seeks
Storage
Overview
.META.
Pietro Michiardi (Eurecom) Tutorial: HBase 47 / 102
Architecture Storage
Storage
Overview
HBase handles two kinds of file types
I One is used for the WAL
I One is used for the actual data storage
Storage
Overview
General communication flow
I A client contacts ZooKeeper when trying to access a particular row
I Recovers from ZooKeeper the server name that host the -ROOT-
region
I Using the -ROOT- information the client retrieves the server name
that host the .META. table region
F The .META. table region contain the row key in question
I Contact the reported .META. server and retrieve the server name
that has the region containing the row key in question
Caching
I Generally, lookup procedures involve caching row key locations for
faster subsequent lookups
Storage
Overview
Important Java Classes
I HRegionServer handles one or more regions and create the
corresponding HRegion object
I When an HRegion object is opened it creates aStore instance for
each HColumnFamily
I Each Store instance can have:
F One or more StoreFile instances
F A MemStore instance
I HRegionServer has a shared HLog instance
Storage
Write Path
External client insert data in HBase
I Issues an HTable.put(Put) request to HRegionServer
I HRegionServer hands the request to the HRegion instance that
matches the request [Q: What is the matching criteria?]
Storage
HBase Files
What and where are HBase files (including WAL, HFile,...)
stored?
I HBase has a root directory set to “/hbase” in HDFS
I Files can be divided into:
F Those that reside under the HBase root directory
F Those that are in the per-table directories
/hbase
I .logs
I .oldlogs
I .hbase.id
I .hbase.version
I /example-table
Storage
HBase Files
/example-table
I .tableinfo
I .tmp
I “...Key1...”
F .oldlogs
F .regioninfo
F .tmp
F colfam1/
colfam1/
I “....column-key1...”
Storage
Storage
Storage
HBase: Region-level files
Inside each table dir, there is a separate dir for every region
in the table
The name of each of this dirs is the MD5 hash of a region name
I Inside each region there is a directory for each column family
I Each column family directory holds the actual data files, namely
HFiles
I Their name is just an arbitrary random number
Each region directory also has a .regioninfo file
I Contains the serialized information of the HRegionInfo instance
Split Files
I Once the region needs to be split, a splits directory is created
F This is used to stage two daughter regions
F If split is successful, daughter regions are moved up to the table
directory
Pietro Michiardi (Eurecom) Tutorial: HBase 56 / 102
Architecture Storage
Storage
HBase: Compaction
10
Process that takes care of re-organizing store files
hbase.hstore.compaction.max
I Essentially to conform to underlying filesystem requirements
hbase.hstore.compaction.min.size
I Compaction check hbase.hstore.compaction.max.size
when memstore is flushed Long.MAX_VALUE
hbase.hstore.compaction.ratio 1.2
Pietro Michiardi (Eurecom) Tutorial: HBase 58 / 102
Architecture Storage
Storage
HFile format
Store files are implemented by the HFile class
I Efficient data storage is the goal
Storage
Storage
The KeyValue Format
Each KeyValue in the HFile is a low-level byte array
I It allows for zero-copy access to the data
Format
I Fixed-length preambule indicated the length of the key and value
F This is useful to offset into the array to get direct access to the value,
ignoring the key
I Key format
F Contains row key, column family name, column qualifier...
F [TIP]: consider small keys to avoid overhead when storing small
data
Write Path
I Client modifies data (put(), delete(), increment())
put() delete() increment()
I Modifications are wrapped into a KeyValue object
KeyValue
I Objects are batched HRegionServer
to the corresponding HRegionServer
I Objects are
KeyValue routed to the corresponding HRegion
HRegion
I Objects are written to WAL and in the MemStore
MemStore Store
Pietro Michiardi (Eurecom) Tutorial: HBase 64 / 102
Architecture Read Path
Read Path
Region Lookups
Region Lookups
Region Lookups
Key Design
Concepts
Concepts
Logical vs. on-disk layout of a table
I Main unit of separation within a table is the column family
I The actual columns (as opposed to other column-oriented DB) are
not used to separate data
I Although cells are stored logically in a table format, rows are stored
as linear sets of the cells
I Cells contain all the vital information inside them
Concepts
Concepts
Concepts
Versioning
IMultiple versions of the same cell stored consecutively, together
with the timestamp
I Cells are sorted in descending order of timestamp
KeyValue object
I The entire cell, with all the structural information, is a KeyValue
object
I Contains: row key, <column family: qualifier> →
column key, timestamp and value
I Sorted by row key first, then by column key
Concepts
Concepts
KeyValue
KeyValue
Tall-Narrow Tables
I Few columns
I Many rows
Flat-Wide Tables
I Many columns
I Few rows
than others
→ A single row could outgrow the maximum file/region size and work
against split facility
time
→ HBase will store all rows sorted in a distinct range, namely regions
with specific start and stop keys
machine
Salting example
byte prefix = (byte) (Long.hashCode(timestamp) % <number of
region servers>);
byte[] rowkey = Bytes.add(Bytes.toBytes(prefix),
Bytes.toBytes(timestamp));
- You can only access data (especially time ranges) for a given
swapped or promoted field (but this could be a feature)
+ You achieve load balancing
Summary
MapReduce Integration
Introduction
InputFormat → TableInputFormatBase
Main classesScan
involved in MapReduce
Mapper
TableInputFormat
Mapper → TableMapper
OutputFormat
OutputFormat
Used to persist data
I OutputFormat
Output written to files
I Output written to HBase tables
TableOutputFormat TableRecord
Writer
F This is done using a TableRecordWriter
OutputFormat → TableOutputFormat
This is the class that handles the key/valu pairs and writes
them to their final destination
I Single instance that takes the output record from each reducer
subsequently
Details
I Must specify the table name when the MR job is created
I Handles buffer flushing implicitly (autoflush option is set to false)
MapReduce Locality
compactions regularly
→ HDFS is smart enough to ensure data locality
F There is a block placement policy that enforces local writes
F The data node compares the server name of the writer with its own
F If they match, the block is written to the local filesystem
I Just be careful about region movements during load balancing or
server failures
Table Splits
Table Splits
When a job starts, the framework calls
createRecordReader() for each input split
I It iterates over the splits and create a new TableRecordReader
with the current split
I Each TableRecordReader handles exactly one region, reading
and mapping every row from the region’s start and end keys
Data locality
I Each split contains the server name hosting the region
I The framework checks the server name and if the TaskTracker is
running on the same machine, it will run it on that server
I The RegionServer is colocated with the HDFS DataNode, hence
data is read from the local filesystem
References I
[1] B+ tree.
http://en.wikipedia.org/wiki/B%2B_tree.
[2] Eric Brewer.
Lessons from giant-scale services.
In In IEEE Internet Computing, 2001.
[3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,
Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew
Fikes, and Robert E. Gruber.
Bigtable: A distributed storage system for structured data.
In Proc. od USENIX OSDI, 2006.
[4] Jeffrey Dean and Sanjay Ghemawat.
Mapreduce: Simplified data processing on large clusters.
In Proc. of ACM OSDI, 2004.
References II
References III
[8] D. Salmen.
Cloud data structure diagramming techniques and design patterns.
https://www.data-tactics-corp.com/index.php/
component/jdownloads/finish/22-white-papers/
68-cloud-data-structure-diagramming, 2009.