Principles of Distributed Database
Systems
TS. Phan Thị Hà
Khoa CNTT. PTIT
© 2020, M.T. Özsu & P. Valduriez TS. Pan
1
Thị Hà_PTIT
Outline
◼ Introduction
◼ Distributed and Parallel Database Design
◼ Distributed Data Control
◼ Distributed Query Processing
◼ Distributed Transaction Processing
◼ Data Replication
◼ Database Integration – Multidatabase Systems
◼ Parallel Database Systems
◼ Peer-to-Peer Data Management
◼ Big Data Processing
◼ NoSQL, NewSQL and Polystores
◼ Web Data Management
© 2020, M.T. Özsu & P. Valduriez TS. Pan
2
Thị Hà_PTIT
Outline
◼ Introduction
❑ What is a distributed DBMS
❑ History
❑ Distributed DBMS promises
❑ Design issues
❑ Distributed DBMS architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
3
Thị Hà_PTIT
Distributed Computing
◼ A number of autonomous processing elements (not
necessarily homogeneous) that are interconnected by a
computer network and that cooperate in performing their
assigned tasks.
◼ What is being distributed?
❑ Processing logic
❑ Function
❑ Data
❑ Control
© 2020, M.T. Özsu & P. Valduriez TS. Pan
4
Thị Hà_PTIT
Current Distribution – Geographically
Distributed Data Centers
© 2020, M.T. Özsu & P. Valduriez TS. Pan
5
Thị Hà_PTIT
What is a Distributed Database System?
A distributed database is a collection of multiple, logically
interrelated databases distributed over a computer network
A distributed database management system (Distributed
DBMS) is the software that manages the DDB and provides
an access mechanism that makes this distribution
transparent to the users
© 2020, M.T. Özsu & P. Valduriez TS. Pan
6
Thị Hà_PTIT
What is not a DDBS?
◼ A timesharing computer system
◼ A loosely or tightly coupled multiprocessor system
◼ A database system which resides at one of the nodes of
a network of computers - this is a centralized database
on a network node
© 2020, M.T. Özsu & P. Valduriez TS. Pan
7
Thị Hà_PTIT
Distributed DBMS Environment
© 2020, M.T. Özsu & P. Valduriez TS. Pan
8
Thị Hà_PTIT
Implicit Assumptions
◼ Data stored at a number of sites → each site logically
consists of a single processor
◼ Processors at different sites are interconnected by a
computer network → not a multiprocessor system
❑ Parallel database systems
◼ Distributed database is a database, not a collection of
files → data logically related as exhibited in the users’
access patterns
❑ Relational data model
◼ Distributed DBMS is a full-fledged DBMS
❑ Not remote file system, not a TP system
© 2020, M.T. Özsu & P. Valduriez TS. Pan
9
Thị Hà_PTIT
Important Point
Logically integrated
but
Physically distributed
© 2020, M.T. Özsu & P. Valduriez TS. Pan
10
Thị Hà_PTIT
Outline
◼ Introduction
❑
❑ History
❑
© 2020, M.T. Özsu & P. Valduriez TS. Pan
11
Thị Hà_PTIT
History – File Systems
© 2020, M.T. Özsu & P. Valduriez TS. Pan
12
Thị Hà_PTIT
History – Database Management
© 2020, M.T. Özsu & P. Valduriez TS. Pan
13
Thị Hà_PTIT
History – Early Distribution
Peer-to-Peer (P2P)
© 2020, M.T. Özsu & P. Valduriez TS. Pan
14
Thị Hà_PTIT
History – Client/Server
© 2020, M.T. Özsu & P. Valduriez TS. Pan
15
Thị Hà_PTIT
History – Data Integration
© 2020, M.T. Özsu & P. Valduriez TS. Pan
16
Thị Hà_PTIT
History – Cloud Computing
On-demand, reliable services provided over the Internet in
a cost-efficient manner
◼ Cost savings: no need to maintain dedicated compute
power
◼ Elasticity: better adaptivity to changing workload
© 2020, M.T. Özsu & P. Valduriez TS. Pan
17
Thị Hà_PTIT
Data Delivery Alternatives
◼ Delivery modes
❑ Pull-only
❑ Push-only
❑ Hybrid
◼ Frequency
❑ Periodic
❑ Conditional
❑ Ad-hoc or irregular
◼ Communication Methods
❑ Unicast
❑ One-to-many
◼ Note: not all combinations make sense
© 2020, M.T. Özsu & P. Valduriez TS. Pan
18
Thị Hà_PTIT
Outline
◼ Introduction
❑
❑ Distributed DBMS promises
❑
© 2020, M.T. Özsu & P. Valduriez TS. Pan
19
Thị Hà_PTIT
Distributed DBMS Promises
Transparent management of distributed, fragmented,
and replicated data
Improved reliability/availability through distributed
transactions
Improved performance
Easier and more economical system expansion
© 2020, M.T. Özsu & P. Valduriez TS. Pan
Thị Hà_PTIT
Transparency
◼ Transparency is the separation of the higher-level
semantics of a system from the lower level
implementation issues.
◼ Fundamental issue is to provide
data independence
in the distributed environment
❑ Network (distribution) transparency
❑ Replication transparency
❑ Fragmentation transparency
◼ horizontal fragmentation: selection
◼ vertical fragmentation: projection
◼ hybrid
© 2020, M.T. Özsu & P. Valduriez TS. Pan
Thị Hà_PTIT
Example
© 2020, M.T. Özsu & P. Valduriez TS. Pan
22
Thị Hà_PTIT
Transparent Access
Tokyo
SELECT ENAME,SAL
FROM EMP,ASG,PAY Boston Paris
WHERE DUR > 12 Paris projects
Paris employees
AND EMP.ENO = ASG.ENO Communication Paris assignments
Network Boston employees
AND PAY.TITLE = EMP.TITLE
Boston projects
Boston employees
Boston assignments
Montreal
New
Montreal projects
York Paris projects
Boston projects New York projects
New York employees with budget > 200000
New York projects Montreal employees
New York assignments Montreal assignments
© 2020, M.T. Özsu & P. Valduriez TS. Pan
23
Thị Hà_PTIT
Distributed Database - User View
Distributed Database
© 2020, M.T. Özsu & P. Valduriez TS. Pan
24
Thị Hà_PTIT
Distributed DBMS - Reality
User
Query
User
DBMS
Application
Software
DBMS
Software
DBMS Communication
Software Subsystem
User
DBMS User Application
Software Query
DBMS
Software
User
Query
© 2020, M.T. Özsu & P. Valduriez TS. Pan
25
Thị Hà_PTIT
Types of Transparency
◼ Data independence
◼ Network transparency (or distribution transparency)
❑ Location transparency
❑ Fragmentation transparency
◼ Fragmentation transparency
◼ Replication transparency
© 2020, M.T. Özsu & P. Valduriez TS. Pan
26
Thị Hà_PTIT
Reliability Through Transactions
◼ Replicated components and data should make distributed
DBMS more reliable.
◼ Distributed transactions provide
Concurrency transparency
❑
❑ Failure atomicity
• Distributed transaction support requires implementation of
❑ Distributed concurrency control protocols
❑ Commit protocols
◼ Data replication
❑ Great for read-intensive workloads, problematic for updates
❑ Replication protocols
© 2020, M.T. Özsu & P. Valduriez TS. Pan
27
Thị Hà_PTIT
Potentially Improved Performance
◼ Proximity of data to its points of use
❑ Requires some support for fragmentation and replication
◼ Parallelism in execution
❑ Inter-query parallelism
❑ Intra-query parallelism
© 2020, M.T. Özsu & P. Valduriez TS. Pan
28
Thị Hà_PTIT
Scalability
◼ Issue is database scaling and workload scaling
◼ Adding processing and storage power
◼ Scale-out: add more servers
❑ Scale-up: increase the capacity of one server → has limits
© 2020, M.T. Özsu & P. Valduriez TS. Pan
29
Thị Hà_PTIT
Outline
◼ Introduction
❑
❑ Design issues
❑
© 2020, M.T. Özsu & P. Valduriez TS. Pan
30
Thị Hà_PTIT
Distributed DBMS Issues
◼ Distributed database design
❑ How to distribute the database
❑ Replicated & non-replicated database distribution
❑ A related problem in directory management
◼ Distributed query processing
❑ Convert user transactions to data manipulation instructions
❑ Optimization problem
◼ min{cost = data transmission + local processing}
❑ General formulation is NP-hard
© 2020, M.T. Özsu & P. Valduriez TS. Pan
31
Thị Hà_PTIT
Distributed DBMS Issues
◼ Distributed concurrency control
❑ Synchronization of concurrent accesses
❑ Consistency and isolation of transactions' effects
❑ Deadlock management
◼ Reliability
❑ How to make the system resilient to failures
❑ Atomicity and durability
© 2020, M.T. Özsu & P. Valduriez TS. Pan
32
Thị Hà_PTIT
Distributed DBMS Issues
◼ Replication
❑ Mutual consistency
❑ Freshness of copies
❑ Eager vs lazy
❑ Centralized vs distributed
◼ Parallel DBMS
❑ Objectives: high scalability and performance
❑ Not geo-distributed
❑ Cluster computing
© 2020, M.T. Özsu & P. Valduriez TS. Pan
33
Thị Hà_PTIT
Related Issues
◼ Alternative distribution approaches
❑ Modern P2P
❑ World Wide Web (WWW or Web)
◼ Big data processing
❑ 4V: volume, variety, velocity, veracity
❑ MapReduce & Spark
❑ Stream data
❑ Graph analytics
❑ NoSQL
❑ NewSQL
❑ Polystores
© 2020, M.T. Özsu & P. Valduriez TS. Pan
34
Thị Hà_PTIT
Outline
◼ Introduction
❑
❑ Distributed DBMS architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
35
Thị Hà_PTIT
DBMS Implementation Alternatives
© 2020, M.T. Özsu & P. Valduriez TS. Pan
36
Thị Hà_PTIT
Dimensions of the Problem
◼ Distribution
❑ Whether the components of the system are located on the same machine or
not
◼ Heterogeneity
❑ Various levels (hardware, communications, operating system)
❑ DBMS important one
◼ data model, query language,transaction management algorithms
◼ Autonomy
❑ Not well understood and most troublesome
❑ Various versions
◼ Design autonomy: Ability of a component DBMS to decide on issues related to its
own design.
◼ Communication autonomy: Ability of a component DBMS to decide whether and
how to communicate with other DBMSs.
◼ Execution autonomy: Ability of a component DBMS to execute local operations in
any manner it wants to.
© 2020, M.T. Özsu & P. Valduriez TS. Pan
37
Thị Hà_PTIT
Client/Server Architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
38
Thị Hà_PTIT
Advantages of Client-Server
Architectures
◼ More efficient division of labor
◼ Horizontal and vertical scaling of resources
◼ Better price/performance on client machines
◼ Ability to use familiar tools on client machines
◼ Client access to remote data (via standards)
◼ Full DBMS functionality provided to client workstations
◼ Overall better system price/performance
© 2020, M.T. Özsu & P. Valduriez TS. Pan
39
Thị Hà_PTIT
Database Server
© 2020, M.T. Özsu & P. Valduriez TS. Pan
40
Thị Hà_PTIT
Distributed Database Servers
© 2020, M.T. Özsu & P. Valduriez TS. Pan
41
Thị Hà_PTIT
Peer-to-Peer Component Architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
42
Thị Hà_PTIT
MDBS Components & Execution
© 2020, M.T. Özsu & P. Valduriez TS. Pan
43
Thị Hà_PTIT
Mediator/Wrapper Architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
44
Thị Hà_PTIT
Cloud Computing
On-demand, reliable services provided over the Internet in
a cost-efficient manner
◼ IaaS – Infrastructure-as-a-Service
◼ PaaS – Platform-as-a-Service
◼ SaaS – Software-as-a-Service
◼ DaaS – Database-as-a-Service
© 2020, M.T. Özsu & P. Valduriez TS. Pan
45
Thị Hà_PTIT
Simplified Cloud Architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
46
Thị Hà_PTIT