0% found this document useful (0 votes)

46 views51 pages

Dedup Slides

The document provides an overview of deduplication including how it works using content addressable storage which indexes data by cryptographic hashes of chunks, and discusses techniques for chunking data into variable sized chunks and maintaining metadata to reconstruct files from chunks.

Uploaded by

deponly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views51 pages

Dedup Slides

Uploaded by

deponly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Deduplication: Overview & Case Studies

CSCI 333 – Spring 2020

Williams College
Lecture Outline

Background
Content Addressable Storage (CAS)
Deduplication
Chunking
The Index
Other CAS applications
Lecture Outline

Background
Content Addressable Storage (CAS)
Deduplication
Chunking
The Index
Other CAS applications
Content Addressable Storage (CAS)

Deduplication systems often rely on Content Addressable

Storage (CAS)

Data is indexed by some content identifier

The content identifier is determined by some

function over the data itself
- often a cryptographically strong hash function
CAS

Example:
I send a document to be stored remotely
on some content addressable storage
CAS

Example:
The server receives the document, and
calculates a unique identifier called the
data's fingerprint
CAS

The fingerprint should be:

unique to the data

- NO collisions

one-way
- hard to invert
CAS

The fingerprint should be:

unique to the data

- NO collisions

one-way
- hard to invert 1024 objects before it is more likely
than not that a collision has occurred

SHA-1:
20 bytes (160 bits)
P(collision(a,b)) = (½)160
coll(N, 2160) = (NC2)(½)160
CAS

Example:
SHA-1( ) = de9f2c7fd25e1b3a...

Name de9f2c7fd25e1b3a... de9f2c7fd25e1b3a... data

homework.txt
CAS

Example:
I submit my homework, and my “buddy”
Harold also submits my homework...
CAS

Example:
Same contents, same fingerprint.

de9f2c7fd25e1b3a...

de9f2c7fd25e1b3a... data
CAS

Example:
Same contents, same fingerprint.

The data is only stored once!

de9f2c7fd25e1b3a...

de9f2c7fd25e1b3a... data
Background

Background
Content Addressable Storage (CAS)
Deduplication
Chunking
The Index
Other applications
CAS

Example:
Now suppose Harry writes his name at the
top of my document.
CAS

Example:
The fingerprints are completely different,
despite the (mostly) identical contents.

de9f2c7fd25e1b3a...

fad3e85a0bd17d9b...

de9f2c7fd25e1b3a... data
fad3e85a 0bd17d9b... data'
CAS

Problem Statement:

What is the appropriate granularity to

address our data?

What are the tradeoffs associated with

this choice?
Background

Background
Content Addressable Storage (CAS)
Deduplication
Chunking
The Index
Other applications
Deduplication

Chunking breaks a data stream into segments

SHA1( DATA ) becomes

SHA1( CK1 ) + SHA1( CK2 ) + SHA1( CK3 )

How do we divide a data stream?

How do we reassemble a data stream?

Deduplication

Division.

Option 1: fixed-size blocks

- Every (?)KB, start a new chunk

Option 2: variable-size chunks

- Chunk boundaries dependent on chunk contents

Deduplication

Division: fixed-size blocks

hw-bill.txt hw-harold.txt

=
Deduplication

Division: fixed-size blocks

hw-bill.txt hw-harold.txt
Suppose Harold adds his name
Harold to the top of my homework
=|=

=|=

=|=
This is called the
=|= boundary shifting
problem.

=|=

=|=
Deduplication

Division.

Option 1: fixed-size blocks

- Every 4KB, start a new chunk

Option 2: variable-size chunks

- Chunk boundaries dependent on chunk contents

Deduplication

Division: variable-size chunks

- Slide the window byte by byte across the data, and

parameters: compute a window fingerprint at each position.
Window of width w
Target pattern t - If the fingerprint matches the target, t, then we
have a fingerprint match at that position
Deduplication

Division: variable-size chunks

- Slide the window byte by byte across the data, and

compute a window fingerprint at each position.

- If the fingerprint matches the target, t, then we

have a fingerprint match at that position
Deduplication

Division: variable-size chunks

hw-wkj.txt hw-harold.txt
Deduplication

Division: variable-size chunks

hw-wkj.txt hw-harold.txt
Suppose Harold adds his name
Harold to the top of my homework
=|=

Only introduce one

new chunk to storage.
Deduplication

Division: variable-size chunks

Sliding window properties:

- collisions are OK, but
- average chunk size should be configurable
- reuse overlapping window calculations

Rabin fingerprints

Window w, target t
- expect a chunk ever 2t-1+w bytes

LBFS: w=48, t=13

- expect a chunk every 8KB
Deduplication

Division: variable-size chunks

Rabin fingerprint: preselect divisor D, and an irreducible polynomial

R(b1,b2,...,bw) = (b1pw-1 + b2pw-2 + … + bw) mod D

R(bi,...,bi+w-1) = ((R(bi-1, ..., bi+w-2) - bi-1pw-1)p + bi+w-1) mod D

Arbitrary previous previous

window window first
of width w calculation term
Deduplication

Recap:

Chunking breaks a data stream into smaller segments

→ What do we gain from chunking?

→ What are the tradeoffs?

+ Finer granularity of sharing - Fingerprinting is an expensive operation

+ Finer granularity of addressing - Not suitable for all data patterns

- Index overhead
Deduplication

Reassembling
chunks:
Recipes provide directions for reconstructing files from chunks
Deduplication

Reassembling
chunks:
Recipes provide directions for reconstructing files from chunks

Metadata
<SHA1>
<SHA1>
<SHA1>
...

DATA DATA DATA

BLOCK BLOCK BLOCK
CAS

Example:

Name de9f2c7fd25e1b3a... de9f2c7fd25e1b3a... recipe/data

Metadata

homework.txt ( <SHA1>
<SHA1>
<SHA1>
) ???

...
Deduplication

Background
Content Addressable Storage (CAS)
Deduplication
Chunking
The Index
Other applications
Deduplication

The Index:

SHA-1 fingerprint uniquely identifies data, but

the index translates fingerprints to chunks.

<sha-11> <chunk1>
<sha-12> <chunk2>
<sha-13> <chunk3>
… …
<sha-1n> <chunkn>

<chunki> = {location, size?, refcount?, compressed?, ...}

Deduplication

The Index:

For small chunk stores:

- database, hash table, tree

For a large index, legacy data structures won't fit in main memory
- each index query requires a disk seek

- why?
SHA-1 fingerprints independent and randomly distributed
- no locality

Known as the index disk bottleneck

Deduplication

The Index:

Back of the envelope:

Average chunk size: 4KB

Fingerprint: 20B

20TB unique data = 100GB SHA-1 fingerprints

Deduplication

Disk bottleneck:

Data Domain strategy:

- filter unnecessary lookups
- piggyback useful work onto the disk lookups that are necessary

Memory
Locality Preserving Cache

Summary Vector

Disk
Stream Informed Segment
Layout (Containers)
Deduplication

Disk bottleneck:
Summary vector
- Bloom filter (any AMQ data structure works)

... 1 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1 ...

h1
h2
h3

Filter properties:
●
No false negatives
●
if an FP is in the index, it is in summary vector
●
Tuneable false positive rate
●
We can trade memory for accuracy
Note: on a false positive, we are no worse off
- We just do the disk seek we would have done anyway
Deduplication

Disk bottleneck:

Data Domain strategy:

- filter unnecessary lookups
- piggyback useful work onto the disk lookups that are necessary

Bloom Filter Memory

Locality Preserving Cache

Summary Vector

Disk
Stream Informed Segment
Layout (Containers)
Deduplication

Disk bottleneck:

Stream informed segment layout (SISL)

- variable sized chunks written to fixed size containers
- chunk descriptors are stored in a list at the head
→“temporal locality” for hashes within a container

Principle:
- backup workloads exhibit chunk locality
Deduplication

Disk bottleneck:

Data Domain strategy:

- filter unnecessary lookups
- piggyback useful work onto the disk lookups that are necessary

Bloom Filter Memory

Locality Preserving Cache

Group Fingerprints:
Summary Vector
Temporal Locality

Disk
Stream Informed Segment
Layout (Containers)
Deduplication

Disk bottleneck:

Locality Preserving Cache (LPC)

- LRU cache of candidate fingerprint groups

CD1 CD2 CD3 CD4 CD43 CD44 CD45 CD46 CD9 CD10 CD11 CD12 ...

...

On-disk container
Principle:
- if you must go to disk, make it worth your while
Deduplication
START

Read request
Disk bottleneck: for chunk
fingerprint

No
Fingerprint
in Bloom
filter?

Yes

No
On-disk fingerprint
No Lookup Fingerprint
index lookup: get
Necessary in LPC?
container location

Yes

Prefetch fingerprints
Read data from
END from head of target
target container.
data container.
Deduplication

Summary: Dedup and the 4 W's

Dedup Goal: eliminate repeat instances of identical data

What (granularity) to dedup?

Where to dedup?

When to dedup?

Why dedup?
Deduplication

Summary: Dedup and the 4 W's

Hybrid?
Context-aware.
What (granularity) to dedup?

Whole-file Fixed-size Content-

defined
Chunking N/A offsets Sliding window
overheads fingerprinting
Dedup All-or-nothing Boundary shifting Best
Ratio problem
Other Low index (Whole-file)+ Latency,
notes overhead, Ease of CPU intensive
compressed/ implementation,
encrypted/ selective caching,
media synchronization
Deduplication

Summary: Dedup and the 4 W's

Where to dedup?

source destination

Dedup before sending Dedup at storage server

data over the network + server more powerful
+ save bandwidth - centralized data structures
- client complexity
- trust clients?

hybrid
Client index checks membership,
Server index stores location
Deduplication

Summary: Dedup and the 4 W's

When to dedup?

inline post-process Dedup

Data Dedup Disk Data Disk

+ never store duplicate data - temporarily wasted storage

- slower → index lookup per chunk + faster → stream long writes, reclaim in
+ faster → save I/O for duplicate data the background
- may create (even more) fragmentation

hybrid
→ post-processing faster for initial commits
→ switch to inline to take advantage of I/O savings
Deduplication

Why dedup?

Perhaps you have a loooooot of data...

- enterprise backups

Or data that is particularly amenable to deduplication...

- small or incremental changes
- data that is not encrypted or compressed

Or that changes infrequently.

- blocks are immutable → no such thing as a “block modify”
- rate of change determines container chunk locality

Ideal use case: “Cold Storage”

Deduplication

Why dedup?

Perhaps your bottleneck isn't the CPU

- Use dedup if you can favorably trade other resources

Shared Shared
Cache Cache

Packet Store Packet Store

(FIFO) (FIFO)
Bandwidth Constrained
Link

Fingerprint Fingerprint
Index Index

Example: Protocol Independent Technique for Eliminating

Redundant Network Traffic
Background

Background
Content Addressable Storage (CAS)
Deduplication
Chunking
The Index
Other applications
Other CAS Applications

Data verification

CAS can be used to build tamper evident storage. Suppose that:

- you can't fix a compromised server,
- but you never want be fooled by one

Insight: Fingerprints uniquely identify data

- hash before storing data, and save the fp locally
- rehash data and compare fps upon receipt

!?!?!?!

Unit 2 and 3 (2 Part)
No ratings yet
Unit 2 and 3 (2 Part)
9 pages
Ijctt V3i3p108
No ratings yet
Ijctt V3i3p108
6 pages
Deduplication On Upmem
No ratings yet
Deduplication On Upmem
14 pages
Efficient Deduplication Techniques For Modern Backup Operation - 2011
No ratings yet
Efficient Deduplication Techniques For Modern Backup Operation - 2011
17 pages
Chunkstash: Speeding Up Storage Deduplication Using Flash Memory
No ratings yet
Chunkstash: Speeding Up Storage Deduplication Using Flash Memory
38 pages
A Design of Parallel Content-Defined Chunking System Using Non-Hashing Algorithms On FPGA
No ratings yet
A Design of Parallel Content-Defined Chunking System Using Non-Hashing Algorithms On FPGA
13 pages
It Is A Very Efficient Method To Search The Exact Data Items Based On Hash Table
No ratings yet
It Is A Very Efficient Method To Search The Exact Data Items Based On Hash Table
49 pages
2 - 2006 Various Chunks
No ratings yet
2 - 2006 Various Chunks
25 pages
Data and File Structures: Hashing
No ratings yet
Data and File Structures: Hashing
24 pages
Final Paper v2.5
No ratings yet
Final Paper v2.5
6 pages
Part 4 File Organizatin Lec 4 5part 2 File Organization L1&2
No ratings yet
Part 4 File Organizatin Lec 4 5part 2 File Organization L1&2
36 pages
Ayesha
No ratings yet
Ayesha
43 pages
Hashing
No ratings yet
Hashing
33 pages
A Survey: Enhanced Block Level Message Locked Encryption For Data Deduplication
No ratings yet
A Survey: Enhanced Block Level Message Locked Encryption For Data Deduplication
4 pages
Deduplication Using Hadoop and Hbase
100% (1)
Deduplication Using Hadoop and Hbase
18 pages
A Framework For Analyzing and Improving Content-Based Chunking Algorithms - 2005
No ratings yet
A Framework For Analyzing and Improving Content-Based Chunking Algorithms - 2005
11 pages
A High-Performance Post-Deduplication Delta Compression Scheme For Packed Datasets
No ratings yet
A High-Performance Post-Deduplication Delta Compression Scheme For Packed Datasets
8 pages
IJCDS Latex 4 8 2024-2
No ratings yet
IJCDS Latex 4 8 2024-2
12 pages
Erasure Coding and Data Deduplication-Survey
No ratings yet
Erasure Coding and Data Deduplication-Survey
8 pages
Fs Mod 5 (WWW - Vtuloop.com)
No ratings yet
Fs Mod 5 (WWW - Vtuloop.com)
105 pages
Implementation Priority Queue Using Array
No ratings yet
Implementation Priority Queue Using Array
3 pages
Dsa 240404 220052
No ratings yet
Dsa 240404 220052
9 pages
Cse373 10 Hashing
No ratings yet
Cse373 10 Hashing
36 pages
15-440 Distributed Systems: Hashing and Cdns
No ratings yet
15-440 Distributed Systems: Hashing and Cdns
38 pages
Experiment 8 DS Student
No ratings yet
Experiment 8 DS Student
8 pages
Csci 2111: Data and File Structures Week 10, Lectures 1 & 2: Hashing
No ratings yet
Csci 2111: Data and File Structures Week 10, Lectures 1 & 2: Hashing
19 pages
Cache Memory Mapping Techniques
No ratings yet
Cache Memory Mapping Techniques
7 pages
Mod 5
No ratings yet
Mod 5
13 pages
ISR Signaturefile
No ratings yet
ISR Signaturefile
27 pages
EScholarship UC Item 9qn752v6
No ratings yet
EScholarship UC Item 9qn752v6
11 pages
Module 5-FS
No ratings yet
Module 5-FS
21 pages
DSA Chapter 08 (Searching)
No ratings yet
DSA Chapter 08 (Searching)
65 pages
What Is Hashing?
No ratings yet
What Is Hashing?
24 pages
History-Independent Cuckoo Hashing
No ratings yet
History-Independent Cuckoo Hashing
20 pages
Li 2020
No ratings yet
Li 2020
13 pages
Deduplication School
No ratings yet
Deduplication School
61 pages
MODULE 5 - BCS304 - HASHING - Leftisht Trees - OBST - Notes
No ratings yet
MODULE 5 - BCS304 - HASHING - Leftisht Trees - OBST - Notes
32 pages
Ch11 Hash Indexes 1perpage Annotated
No ratings yet
Ch11 Hash Indexes 1perpage Annotated
28 pages
Understanding Data Deduplication
No ratings yet
Understanding Data Deduplication
4 pages
M01res01-Technology Overview
No ratings yet
M01res01-Technology Overview
46 pages
Chap 12. Extendible Hashing: File Structures
No ratings yet
Chap 12. Extendible Hashing: File Structures
40 pages
Hashing: Heidi C. Ellis and Gerard C. Weatherby
No ratings yet
Hashing: Heidi C. Ellis and Gerard C. Weatherby
18 pages
Lec 9
No ratings yet
Lec 9
27 pages
History of File Structures
No ratings yet
History of File Structures
26 pages
2023TSC Security-Aware and Efficient Data Deduplication For Edge-Assisted Cloud Storage Systems
No ratings yet
2023TSC Security-Aware and Efficient Data Deduplication For Edge-Assisted Cloud Storage Systems
12 pages
File Systems
No ratings yet
File Systems
8 pages
9 DictionaryandHashing-1
No ratings yet
9 DictionaryandHashing-1
32 pages
DBMS Chapter 4 Record Organization and Dile Management
No ratings yet
DBMS Chapter 4 Record Organization and Dile Management
36 pages
A Cluster-Based Data Deduplication Technology
No ratings yet
A Cluster-Based Data Deduplication Technology
5 pages
File Organization
No ratings yet
File Organization
49 pages
Cloud Backup Deduplication Guide
No ratings yet
Cloud Backup Deduplication Guide
3 pages
6 File
No ratings yet
6 File
47 pages
File Sharing and Data Duplication Removal in Cloud Using File Checksum
No ratings yet
File Sharing and Data Duplication Removal in Cloud Using File Checksum
3 pages
07 Hash Tables 4 Distributed Hash Tables
No ratings yet
07 Hash Tables 4 Distributed Hash Tables
97 pages
Chapter+11+ +hashing +hashing
No ratings yet
Chapter+11+ +hashing +hashing
54 pages
Thuyet Trinh
No ratings yet
Thuyet Trinh
28 pages
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
SSRN Id4661706
No ratings yet
SSRN Id4661706
13 pages
Data Structures for Efficient Search
No ratings yet
Data Structures for Efficient Search
87 pages
S8 Perf
No ratings yet
S8 Perf
15 pages
Lecture14 LSM
No ratings yet
Lecture14 LSM
51 pages
Lecture04 Data Models
No ratings yet
Lecture04 Data Models
67 pages
Jmis 9 4 261
No ratings yet
Jmis 9 4 261
8 pages
There Is No Now Print
No ratings yet
There Is No Now Print
7 pages
UPS Management Card Guide
100% (2)
UPS Management Card Guide
4 pages
AngularJS Offline Data Sync Guide
No ratings yet
AngularJS Offline Data Sync Guide
24 pages
Izar Mobile 2 6.2 For Android
No ratings yet
Izar Mobile 2 6.2 For Android
3 pages
Cisco 2911
No ratings yet
Cisco 2911
7 pages
Chapter 2
No ratings yet
Chapter 2
12 pages
Simulate Transport Requests
100% (1)
Simulate Transport Requests
32 pages
IP-Project (2021-22) Student Marks Management System
100% (2)
IP-Project (2021-22) Student Marks Management System
19 pages
Role of Benchmarks White Paper
No ratings yet
Role of Benchmarks White Paper
7 pages
PC Specification
100% (1)
PC Specification
5 pages
Gigmoto District Sicmil Integrated School Gigmoto, Catanduanes First Quarter Exam Tle - Ict 9
No ratings yet
Gigmoto District Sicmil Integrated School Gigmoto, Catanduanes First Quarter Exam Tle - Ict 9
2 pages
Sploitlist
0% (1)
Sploitlist
287 pages
Bootex
No ratings yet
Bootex
8 pages
OOP C++ Lab Manual: Employee & Publication Classes
No ratings yet
OOP C++ Lab Manual: Employee & Publication Classes
6 pages
Iuppiter Processor and Memory Stocklist
No ratings yet
Iuppiter Processor and Memory Stocklist
25 pages
Embedded Systems Overview
No ratings yet
Embedded Systems Overview
15 pages
2.centralised Mutual Exclusion
No ratings yet
2.centralised Mutual Exclusion
6 pages
Parallel Computing for Students
No ratings yet
Parallel Computing for Students
113 pages
Automatic Dim Bright Control Microcontroller
No ratings yet
Automatic Dim Bright Control Microcontroller
55 pages
Lab 4.2.2.7 - Configuring Frame Relay and Subinterfaces (Our Routers Are F0/0, Substitute For "G0/0")
No ratings yet
Lab 4.2.2.7 - Configuring Frame Relay and Subinterfaces (Our Routers Are F0/0, Substitute For "G0/0")
11 pages
ComProg Module - M7 Final
No ratings yet
ComProg Module - M7 Final
5 pages
Floboss 103
No ratings yet
Floboss 103
4 pages
C Programming Lab Manual
No ratings yet
C Programming Lab Manual
72 pages
CBCS Revised BCA 5 and 6 Sem-Final
No ratings yet
CBCS Revised BCA 5 and 6 Sem-Final
38 pages
Mingw w64 Howto Build Adv
No ratings yet
Mingw w64 Howto Build Adv
8 pages
Chapter 01
No ratings yet
Chapter 01
75 pages
Unit 1 Introduction To Dbms
No ratings yet
Unit 1 Introduction To Dbms
36 pages
FRAM - Semiconductor Memory Types
No ratings yet
FRAM - Semiconductor Memory Types
5 pages
S4-User's Guide 2.1
No ratings yet
S4-User's Guide 2.1
439 pages
Programming Self-Assessment Guide
No ratings yet
Programming Self-Assessment Guide
2 pages
USER'S MANUAL Revision 1.1b Supermicro X12SPA-TF
No ratings yet
USER'S MANUAL Revision 1.1b Supermicro X12SPA-TF
145 pages