Lecture #08
ADVANCED
DATABASE
SYSTEMS
Storage Models &
Data Layout
@Andy_Pavlo // 15-721 // Spring 2020
2
D ATA O R G A N I Z AT I O N
Fixed-Length Variable-Length
Index Data Blocks Data Blocks
Block Id + C++11 alignas
Offset
Block Pointer Offset
44-bits 20-bits
15-721 (Spring 2020)
3
D ATA O R G A N I Z AT I O N
One can think of an in-memory database as just a
large array of bytes.
→ The schema tells the DBMS how to convert the bytes
into the appropriate type.
→ Each tuple is prefixed with a header that contains its
meta-data.
Storing tuples with as fixed-length data makes it
easy to compute the starting point of any tuple.
15-721 (Spring 2020)
4
Type Representation
Data Layout / Alignment
Storage Models
System Catalogs
15-721 (Spring 2020)
5
D ATA R E P R E S E N TAT I O N
INTEGER/BIGINT/SMALLINT/TINYINT
→ C/C++ Representation
FLOAT/REAL vs. NUMERIC/DECIMAL
→ IEEE-754 Standard / Fixed-point Decimals
TIME/DATE/TIMESTAMP
→ 32/64-bit int of (micro/milli)seconds since Unix epoch
VARCHAR/VARBINARY/TEXT/BLOB
→ Pointer to other location if type is ≥64-bits
→ Header with length and address to next location (if
segmented), followed by data bytes.
15-721 (Spring 2020)
6
VA R I A B L E P R E C I S I O N N U M B E R S
Inexact, variable-precision numeric type that uses
the “native” C/C++ types.
Store directly as specified by IEEE-754.
Typically faster than arbitrary precision numbers.
→ Example: FLOAT, REAL/DOUBLE
15-721 (Spring 2020)
7
VA R I A B L E P R E C I S I O N N U M B E R S
Rounding Example
#include <stdio.h>
Output int main(int argc, char* argv[]) {
x+y = 0.30000001192092895508 float x = 0.1;
0.3 = 0.29999999999999998890 float y = 0.2;
printf("x+y = %.20f\n", x+y);
printf("0.3 = %.20f\n", 0.3);
}
15-721 (Spring 2020)
8
FIXED PRECISION NUMBERS
Numeric data types with arbitrary precision and
scale. Used when round errors are unacceptable.
→ Example: NUMERIC, DECIMAL
Typically stored in an exact, variable-length binary
representation with additional meta-data.
→ Like a VARCHAR but not stored as a string
15-721 (Spring 2020)
9
D ATA L AYO U T
char[]
CREATE TABLE AndySux (
id INT PRIMARY KEY, header id value
value BIGINT
);
reinterpret_cast<int32_t*>(address)
15-721 (Spring 2020)
10
VA R I A B L E -L E N G T H F I E L D S
char[]
CREATE TABLE AndySux (
value VARCHAR(1024) header id Andy|64-BIT
64-BIT POINTER POINTER
);
Variable-Length Data Blocks
INSERT INTO AndySux
VALUES ("Andy has the worst LENGTH NEXT
Andy has the worst
hygiene that I have ever seen. I hygiene that I have ever seen. I hate
hate him so much.");
LENGTH NEXT him so much.
15-721 (Spring 2020)
11
N U L L D ATA T Y P E S
Choice #1: Special Values
→ Designate a value to represent NULL for a data type (e.g.,
INT32_MIN).
Choice #2: Null Column Bitmap Header
→ Store a bitmap in the tuple header that specifies what
attributes are null.
Choice #3: Per Attribute Null Flag
→ Store a flag that marks that a value is null.
→ Have to use more space than just a single bit because this
messes up with word alignment.
15-721 (Spring 2020)
12
DISCLAIMER
The truth is that you only need to worry about
word-alignment for cache lines (e.g., 64 bytes).
I’m going to show you the basic idea using 64-bit
words since it’s easier to see…
15-721 (Spring 2020)
13
W O R D -A L I G N E D T U P L E S
All attributes in a tuple must be word aligned to
enable the CPU to access it without any
unexpected behavior or additional work.
CREATE TABLE AndySux (
32-bits id INT PRIMARY KEY, char[]
64-bits cdate TIMESTAMP, id cdate c zipc
16-bits color CHAR(2),
32-bits zipcode INT 64-bit Word 64-bit Word 64-bit Word 64-bit Word
);
15-721 (Spring 2020)
14
W O R D -A L I G N E D T U P L E S
Approach #1: Perform Extra Reads
→ Execute two reads to load the appropriate parts
of the data word and reassemble them.
Approach #2: Random Reads
→ Read some unexpected combination of bytes
assembled into a 64-bit word.
Approach #3: Reject
Source: Levente Kurusa → Throw an exception and hope app handles it.
15-721 (Spring 2020)
15
W O R D -A L I G N M E N T: PA D D I N G
Add empty bits after attributes to ensure that tuple
is word aligned.
CREATE TABLE AndySux (
32-bits id INT PRIMARY KEY, char[]
00000000 00000
64-bits cdate TIMESTAMP, id 00000000
00000000 cdate c zipc 000
00000
00000000 000
16-bits color CHAR(2),
32-bits zipcode INT 64-bit Word 64-bit Word 64-bit Word 64-bit Word
);
15-721 (Spring 2020)
16
W O R D -A L I G N M E N T: R E O R D E R I N G
Switch the order of attributes in the tuples'
physical layout to make sure they are aligned.
→ May still have to use padding.
CREATE TABLE AndySux (
32-bits id INT PRIMARY KEY, char[]
000000000000
64-bits cdate TIMESTAMP, id zipc cdate c 000000000000
000000000000
000000000000
16-bits color CHAR(2),
32-bits zipcode INT 64-bit Word 64-bit Word 64-bit Word 64-bit Word
);
15-721 (Spring 2020)
17
C M U -D B A L I G N M E N T E X P E R I M E N T
Processor: 1 socket, 4 cores w/ 2×HT
Workload: Insert Microbenchmark
Avg. Throughput
No Alignment 0.523 MB/sec
Padding 11.7 MB/sec
Padding + Sorting 814.8 MB/sec
Source: Tianyu Li
15-721 (Spring 2020)
18
STORAGE MODELS
N-ary Storage Model (NSM)
Decomposition Storage Model (DSM)
Hybrid Storage Model
COLUMN-STORES VS. ROW-STORES: HOW
DIFFERENT ARE THEY REALLY?
SIGMOD 2008
15-721 (Spring 2020)
19
N -A R Y S T O R A G E M O D E L ( N S M )
The DBMS stores all of the attributes for a single
tuple contiguously.
Ideal for OLTP workloads where txns tend to
operate only on an individual entity and insert-
heavy workloads.
Use the tuple-at-a-time iterator model.
15-721 (Spring 2020)
20
N -A R Y S T O R A G E M O D E L ( N S M )
Advantages
→ Fast inserts, updates, and deletes.
→ Good for queries that need the entire tuple.
→ Can use index-oriented physical storage.
Disadvantages
→ Not good for scanning large portions of the table and/or
a subset of the attributes.
15-721 (Spring 2020)
21
D E C O M P O S I T I O N S TO R A G E M O D E L ( D S M )
The DBMS stores a single attribute for all tuples
contiguously in a block of data.
Ideal for OLAP workloads where read-only
queries perform large scans over a subset of the
table’s attributes.
15-721 (Spring 2020)
22
D E C O M P O S I T I O N S TO R A G E M O D E L ( D S M )
Advantages
→ Reduces the amount wasted work because the DBMS
only reads the data that it needs.
→ Better compression.
Disadvantages
→ Slow for point queries, inserts, updates, and deletes
because of tuple splitting/stitching.
15-721 (Spring 2020)
23
D S M S Y S T E M H I S TO R Y
1970s: Cantor DBMS
1980s: DSM Proposal
1990s: SybaseIQ (in-memory only)
2000s: Vertica, VectorWise, MonetDB
2010s: Everyone
15-721 (Spring 2020)
24
DSM: DESIGN DECISIONS
Tuple Identification
Data Organization
Update Policy
Buffering Location
OPTIMAL COLUMN LAYOUT FOR
HYBRID WORKLOADS
VLDB 2019
15-721 (Spring 2020)
25
D S M : T U P L E I D E N T I F I C AT I O N
Choice #1: Fixed-length Offsets
→ Each value is the same length for an attribute.
Choice #2: Embedded Tuple Ids
→ Each value is stored with its tuple id in a column.
Offsets Embedded Ids
A B C D A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
15-721 (Spring 2020)
26
D S M : D ATA O R G A N I Z AT I O N
Choice #1: Insertion Order
→ Tuples are inserted into any free slot that is available in
existing blocks.
Choice #2: Sorted Order
→ Tuples are inserted based into a slot according to some
ordering scheme.
Choice #3: Partitioned
→ Assign tuples to blocks according to their attribute values
and some partitioning scheme (e.g., hashing, range).
15-721 (Spring 2020)
27
D S M : D ATA O R G A N I Z AT I O N
Data Table Sorted Table
INSERT INTO xxx A B C A B C
VALUES (a2, b1, c5); 0 a1 b1 c1 2 a1 b2 c8
1 a3 b2 c9 5 a1 b2 c9
2 a1 b2 c8 0 a1 b1 c1
3 a2 b2 c7 3 a2 b2 c7
4 a2
a3 b2
b1 c9
c5
c6 1 a3 b2 c9
5 a3
a1 b1
b2 c1
c9 6 a3 b1 c1
6 a3 b1 c6
c1 4 a3 b1 c6
7 a2 b1 c5 7
Sort Order: (A↑, B↓, C↑)
15-721 (Spring 2020)
28
C A S P E R D E LTA S TO R E
Range-partitioned column store with a "shallow"
order-preserving index above it.
→ Shallow index maps value ranges to partitions.
→ Index keys are sorted but the individual columns are not.
DBMS runs an offline optimization algorithm to
determine the optimal partitioning of data.
OPTIMAL COLUMN LAYOUT FOR
HYBRID WORKLOADS
VLDB 2019
15-721 (Spring 2020)
29
C A S P E R D E LTA S TO R E
Data Table
A B C
INSERT INTO xxx 0 a1 b2 c8
VALUES (a2, b1, c5); 1 a1 b2 c9
2 a1 b1 c1
3 a2 b2 c7
INSERT INTO xxx 4 a2 b1 c5
VALUES (a2, b2, c6); 5 a2
a3 b2
b1 c6
6 a3 b1 c1
Shallow Index 7 a3 b2 c9
key→partition 8 a3 b1 c6
9
15-721 (Spring 2020)
30
O B S E R VAT I O N
Data is “hot” when it enters the database
→ A newly inserted tuple is more likely to be updated again
the near future.
As a tuple ages, it is updated less frequently.
→ At some point, a tuple is only accessed in read-only
queries along with other tuples.
15-721 (Spring 2020)
31
H Y B R I D S TO R A G E M O D E L
Single logical database instance that uses different
storage models for hot and cold data.
Store new data in NSM for fast OLTP
Migrate data to DSM for more efficient OLAP
15-721 (Spring 2020)
32
H Y B R I D S TO R A G E M O D E L
Choice #1: Separate Execution Engines
→ Use separate execution engines that are optimized for
either NSM or DSM databases.
Choice #2: Single, Flexible Architecture
→ Use single execution engine that can efficiently operate
on both NSM and DSM databases.
15-721 (Spring 2020)
33
S E PA R AT E E X E C U T I O N E N G I N E S
Run separate “internal” DBMSs that each only
operate on DSM or NSM data.
→ Need to combine query results from both engines to
appear as a single logical database to the application.
→ Must use a synchronization method (e.g., 2PC) if a txn
spans execution engines.
Two approaches to do this:
→ Fractured Mirrors (Oracle, IBM)
→ Delta Store (SAP HANA)
15-721 (Spring 2020)
34
FRACTURED MIRRORS
Store a second copy of the database in a DSM
layout that is automatically updated.
→ All updates are first entered in NSM then eventually
copied into DSM mirror.
NSM DSM
(Primary) (Mirror) Analytical
Transactions Queries
A CASE FOR FRACTURED MIRRORS
VLDB 2002
15-721 (Spring 2020)
35
D E LTA S TO R E
Stage updates to the database in an NSM table.
A background thread migrates updates from delta
store and applies them to DSM data.
DSM
NSM Historical Data
Delta Store
Transactions
15-721 (Spring 2020)
36
P E LOT O N A D A P T I V E S TO R A G E
Employ a single execution engine architecture that
can operate on both NSM and DSM data.
→ Don’t need to store two copies of the database.
→ Don’t need to sync multiple database segments.
Note that a DBMS can still use the delta-store
approach with this single-engine architecture.
BRIDGING THE ARCHIPELAGO BETWEEN ROW-STORES AND
COLUMN-STORES FOR HYBRID WORKLOADS
SIGMOD 2016
15-721 (Spring 2020)
37
P E LOT O N A D A P T I V E S TO R A G E
Original Data Adapted Data
UPDATE AndySux A B C D A B C D
SET A = 123,
B = 456, Hot
C = 789
WHERE D = “xxx” A B C D
SELECT AVG(B)
FROM AndySux
WHERE C = “yyy”
Cold
15-721 (Spring 2020)
39
P E LOT O N A D A P T I V E S T O R A G E
Row Layout Column Layout Adaptive Layout
Execution Time (ms)
1600
1200
800
400
0 Scan Insert Scan Insert Scan Insert Scan Insert Scan Insert Scan Insert
Sep-15 Sep-16 Sep-17 Sep-18 Sep-19 Sep-20
15-721 (Spring 2020)
40
S Y S T E M C ATA LO G S
Almost every DBMS stores their database's
catalogs the same way that they store regular data.
→ Wrap object abstraction around tuples.
→ Specialized code for "bootstrapping" catalog tables.
The entire DBMS should be aware of transactions
in order to automatically provide ACID guarantees
for DDL commands and concurrent transactions.
15-721 (Spring 2020)
41
SCHEMA CHANGES
ADD COLUMN:
→ NSM: Copy tuples into new region in memory.
→ DSM: Just create the new column segment
DROP COLUMN:
→ NSM #1: Copy tuples into new region of memory.
→ NSM #2: Mark column as "deprecated", clean up later.
→ DSM: Just drop the column and free memory.
CHANGE COLUMN:
→ Check whether the conversion can happen. Depends on
default values.
15-721 (Spring 2020)
42
INDEXES
CREATE INDEX:
→ Scan the entire table and populate the index.
→ Must record changes made by txns that modified the table
while another txn was building the index.
→ When the scan completes, lock the table and resolve
changes that were missed after the scan started.
DROP INDEX:
→ Just drop the index logically from the catalog.
→ It only becomes "invisible" when the txn that dropped it
commits. All existing txns will still have to update it.
15-721 (Spring 2020)
43
SEQUENCES
Typically stored in the catalog. Used for
maintaining a global counter
→ Also called "auto-increment" or "serial" keys
Sequences are not maintained with the same
isolation protection as regular catalog entries.
→ Rolling back a txn that incremented a sequence does not
rollback the change to that sequence.
→ All INSERT queries would incur write-write conflicts.
15-721 (Spring 2020)
44
PA R T I N G T H O U G H T S
We abandoned the hybrid storage model
→ Significant engineering overhead.
→ Delta version storage + column store is almost
equivalent.
Catalogs are hard.
15-721 (Spring 2020)