0% found this document useful (0 votes)

356 views151 pages

Tutorial CEPH - Redhat

Tutorial CEPH

Uploaded by

anonymous_9888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

356 views151 pages

Tutorial CEPH - Redhat

Tutorial CEPH

Uploaded by

anonymous_9888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 151

Ceph Essentials

CEPH-101

May 30, 2014

Revision 02-0514

MSST 2014

COURSE OUTLINE
1 Module 1 - Course Introduction

1.1 Course Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Course Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Module 2 - Ceph History/Components

2.1 Module Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2 Storage landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3 CEPH vs INKTANK vs ICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4 Cluster Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5 Ceph Journal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6 Access methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Module 3 - Data Placement

3.1 Module Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2 CRUSH, PGs and Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3 From object to OSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4 CRUSH Hierarchy, Weight, Placement and Peering . . . . . . . . . . . . . . . . . . . . . .

3.5 Ceph OSD failures and expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Module 4 - RADOS

4.1 Module Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2 Ceph Unied Storage Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4 Ceph Cluster Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Module 5 - Ceph Block Device

5.1 Module Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2 Ceph Block Device (aka Rados Block Device) . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3 Ceph Block Devices - krbd vs librbd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.4 RBD Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.5 RBD Clones

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Module 6 - Ceph Filesystem

6.1 Module objectives

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.2 Ceph Metadata Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3 Dynamic Tree Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 Module 7 - Creating A Ceph Cluster

7.1 Module objectives

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2 Ceph Conguration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

102

7.3 Global section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

103

7.4 MON section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

104

7.5 MDS section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107

7.6 OSD section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.7 Working with RADOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

108
110

7.8 Working with RBDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.9 Working with CephFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

111
128

8 Module 8 - Thanks For Attending

8.1 Module objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

143
144
145

Module 1 - Course Introduction

Course Introduction

.
.

Course Overview

.
.

This instructor-led training course provides students with the

knowledge and skills needed to help them become proficient with
Ceph Storage cluster features
This training is based on Ceph v0.67.7 (Dumpling)

2011-2014 Inktank Inc.

3 / 148

Course Objectives
After completing this course, delegates should be able to:

Understand storage trends and Ceph project history

Discuss Ceph concepts
Identify the Ceph versions
Identify Ceph components and operate some of them
-

RADOS Gateway a.k.a. radosgw,

MONs, OSDs & MSDs
RBD,
CRUSH,

Create and deploy a Ceph Storage Cluster

2011-2014 Inktank Inc.

4 / 148

.
.

Course Agenda

Module
Module
Module
Module
Module
Module
Module
-

.
.

1
2
3
4
5
6
7

Course Introduction
Ceph History and Components
Ceph Data Placement
RADOS Object Store
Ceph Block Storage (Ceph RBD)
Ceph File systems (CephFS)
Creating a Ceph Storage Cluster

LAB ; Deploying your cluster with ceph-deploy

2011-2014 Inktank Inc.

5 / 148

Course Prerequisites

Attendees should have completed the following:

Previous work experience with storage concepts

Previous work experience with Linux based servers

2011-2014 Inktank Inc.

6 / 148

.
.

Course Catalog

The following training are available:

.
.

CEPH-100
CEPH-101
CEPH-110
CEPH-120
CEPH-130
CEPH-200

Ceph
Ceph
Ceph
Ceph
Ceph
Ceph

Fundamentals (ILT)
Essentials (WBT)
Operations & tuning (ILT & VCT)
and OpenStack (ILT & VCT)
Unified Storage for OpenStack (VCT)
Open Source Development (ILT)

2011-2014 Inktank Inc.

7 / 148

Course Material

The following material is made available to delegates:

Course slides in PDF format

Lab instructions in PDF in OVA format
VM Images are for training purposes only
VM Images should not be distributes
VM Images should not be used for production environments

2011-2014 Inktank Inc.

8 / 148

.
.

Course Material

The following material is made available to delegates:

PDF files
-

.
.

Course slides in PDF format

Lab instructions in PDF format
Lab Virtual Machine images in OVA format
Files are copy protected for copyright reasons
Files can be updated with sticky notes
VM Images for training purposes only
VM Images should not be distributed
VM Images should not be used for productions environments

2011-2014 Inktank Inc.

9 / 148

Course Material
How to add a note to your PDF files

Right click in the PDF file

From the pop up menu select Add Sticky Note

2011-2014 Inktank Inc.

10 / 148

Shortcut is CTRL+6 on Microsoft Windows

Shortcut is CMD+6 on Apple OSX
Do not forget to save your file

.
.

End Module 1

CEPH-101 : Course Introduction

.
.

2011-2014 Inktank Inc.

11 / 148

Module 2 - Ceph History/Components

Ceph History and Components

.
.

Module Objectives
By the end of this module, you will be able to:

Identify the requirements of storage infrastructures

Identify the attributes of a Ceph Storage Cluster
Understand Ceph design considerations
Ceph components and architecture:
-

.
.

The Ceph
librados
The Ceph
The Ceph
The Ceph

Storage Cluster (RADOS)

Gateway (radosgw)
Block Device (RBD)
File System (CephFS)

2011-2014 Inktank Inc.

13 / 148

Storage Challenges

Must support current infrastructures

Must plan for integration of emerging technologies

Object storage to support massive influx of unstructured data

All this while supporting:

Cloud infrastructures and SDNs

Must support new challenges

Block storage with snapshots, cloning

File storage with POSIX support
Coherent cache structured data

Massive data growth

Mixed hardware that must work heterogeneously
Maintain reliability and fault tolerance

2011-2014 Inktank Inc.

14 / 148

.
.

Storage Costs

Money
-

Time
-

.
.

More data, same budgets

Need a low cost per gigabyte
No vendor lock-in
Ease of administration
No manual data migration or load balancing
Seamless scaling ** both expansion and contraction

2011-2014 Inktank Inc.

15 / 148

Ceph Delivers
Ceph: The Future Of Storage

A new philosophy
-

A new design
-

Open Source
Community-focused equals strong, sustainable ecosystem
Scalable
No single point of failure
Software-based
Self-managing
Flexible
Unified

2011-2014 Inktank Inc.

16 / 148

.
.

Ceph: Unified Storage

2011-2014 Inktank Inc.

17 / 148

All the analysts will tell you that were facing a data explosion. If you are responsible for managing data for your company, you
dont need the analysts to tell you that. As disks become less expensive, there are easier for users to generate content. And
that content must be managed, protected, and backed up so that it is available to you users whenever they request it.

.
.

Ceph: Technological
Foundations
Built to address the following challenges

Every component must scale

No single point of failure
Software-based, not an appliance
Open source
Run on readily-available, commodity hardware
Everything must self-manage wherever possible

2011-2014 Inktank Inc.

18 / 148

.
.

Inktank Ceph Enterprise

Note 1 : Inktank Ceph Enterprise

.
.

2011-2014 Inktank Inc.

19 / 148

Inktank Ceph Enterprise

2011-2014 Inktank Inc.

20 / 148

Note 1 : Inktank Ceph Enterprise

.
.

Inktank Ceph Enterprise

Note 1 : Inktank Ceph Enterprise

.
.

2011-2014 Inktank Inc.

21 / 148

Cluster Components
After completing this section you will be able to:

Describe the architectural requirements of a Ceph Storage Cluster

Explain the role of core RADOS components, including:
-

OSD
Monitors
Ceph journal

Explain the role of librados

Define the role of the Ceph Gateway (radosgw)

2011-2014 Inktank Inc.

22 / 148

.
.

Ceph Cluster

.
.

2011-2014 Inktank Inc.

23 / 148

Monitors

What do they do?

What they do NOT do?

They dont serve stored objects to clients

How many do we need?

They maintain the cluster map 1

They provide consensus for distributed
decision-making

There must be an odd number of MONs

3 is the minimum number of MONs 3

2011-2014 Inktank Inc.

24 / 148

Note 1: Ceph Monitors are daemons. The primary role of a monitor is to maintain the state of the cluster by managing critical
Ceph Cluster state and configuration information. The Ceph Monitors maintain a master copy of the CRUSH Map and Ceph
Daemons and Clients can check in periodically with the monitors to be sure they have the most recent copy of the map.
Note 2: The monitors most establish a consensus regarding the state of the cluster, which is why there must be an odd number
of monitors.
Note 3: In critical environment and to provide even more reliability and fault tolerance, it can be advised to run up 5 Monitors
In order for the Ceph Storage Cluster to be operational and accessible, there must be at least more than half of Monitors
running and operational. If this number goes below, and as Ceph will always guarantee the integrity of the data to its
accessibility, the complete Ceph Storage Cluster will become inaccessible to any client.
For your information, the Ceph Storage Cluster maintains different map for its operations:
- MON Map
- OSD Map
- CRUSH Map

.
.

Object Storage Node

Object Stodrage Device (OSD)

Building block of Ceph Storage Cluster.

One hard disk
One Linux File sustem
One Ceph OSD Daemon

File Sytem:
-

File system must be btrfs, xfs or ext4

Have the XATTRs enabled 2

2011-2014 Inktank Inc.

25 / 148

The OSD connect a disk to the Ceph Storage Cluster

Note 1: The only system supported in the dumpling and emperor versions are XFS and EXT4. BTRFS, although very promising, is
not yet ready for production environments. There is also work ongoing in order to someday provide support around ZFS
Note 2: The Extended Attributes of the underlying file system are used for storing and retrieving information about the internal
object state, snapshot metadata, and Ceph Gateway access control lists (ACLs)

.
.

Ceph OSD Daemon

Ceph Object Storage Device Daemon

OSD is primary for some objects

for
for
for
for

replication
coherency
re-balancing
recovery

Under control of the primary

Capable of becoming primary

Supports extended objects classes

Responsible
Responsible
Responsible
Responsible

OSD is secondary for some objects

Intelligent Storage Servers 1

Serve stored objects to clients

Atomic transactions
Synchronization and notifications
Send computation to the data

2011-2014 Inktank Inc.

26 / 148

Note 1: The overall design and goal of the OSD is to bring the computing power as close as possible of the data and to let it
perform the maximum it can. For now, it processes the functions listed in the bullet lists, depending on its role (primary or
secondary), but in the future, Ceph will probably leverage the close link between the OSD and the data to extend the
computational power of the OSD.
For example: The OSD drive creation of the thumbnail of an object rather than having the client being responsible for such an
operation.

.
.

xfs, btrfs, or ext4?

2011-2014 Inktank Inc.

27 / 148

Ceph requires a modern Linux file system. We have tested XFS, btrs and ext4, and these are the supported file systems. Full
size and extensive tests have been performed on BTRFS is not recommended for productions environments.
Right now for stability, the recommendation os to use xfs.

.
.

Ceph Journal

Ceph OSDs
-

Write small, random I/O to the

journal sequentially 1
Each OSD has its own journal 2

2011-2014 Inktank Inc.

28 / 148

Journals use raw volumes on the OSD nodes

Note 1 : This is done for speed and consistency. It speeds up workloads by allowing the host file system more time to coalesce
writes because of the small IO request being performed
Note 2 : By default, the Ceph Journal is written to the same disk as the OSD data. For better performance, the Ceph Journal
should be configured to write to its own hard drive such as a SSD.

.
.

Ceph Journal

Ceph OSDs
-

Ceph OSD or Ceph OSD Node Failures

Write a description of each

operation to the journal 1
Then commits the operation to the
file system 2
This mechanism enables atomic
updates to an object
Upon restart, OSD(s) replay the
journal 3

2011-2014 Inktank Inc.

29 / 148

Note 1 : The write to the CEPH cluster will be acknowledged when the minimum number of replica journals have been written
to.
Note 2 : The OSD stops writing every few seconds and synchronizes the journal with the file system commits performed so they
can trim operations from the journal to reclaim the space on the journal disk volume.
Note 3 : The replay sequence will start after the last sync operation as previous journal records were trimmed out.

.
.

Communication methods

2011-2014 Inktank Inc.

30 / 148

.
.

Communication methods

Communication with the Ceph Cluster

librados: Native interface to the Ceph Cluster 1

Ceph Gateway: RESTful APIs for S3 and Swift compatibility 2
libcephfs: Access to a Ceph Cluster via a POSIX-like interface.
librbd: File-like access to Ceph Block Device images 3

2011-2014 Inktank Inc.

Note 1: Services interfaces built on top of this native interface include the Ceph Block Device, The Ceph Gateway, and the
Ceph File System.
Note 2: Amazon S3 and OpenStack Swift. The Ceph Gateway is referred to as radosgw.
Note 3: Python module

.
.

31 / 148

librados

librados is a native C library that allows applications to work

with the Ceph Cluster (RADOS). There are similar libraries
available for C++, Java, Python, Ruby, and PHP.
When applications link with librados, they are enabled to interact
with the objects in the Ceph Cluster (RADOS) through a native
protocol.
2011-2014 Inktank Inc.

32 / 148

librados is a native C library that allows applications to work with the Ceph Cluster (RADOS). There are similar libraries
available for C++, Java, Python, Ruby, and PHP.
When applications link with librados, they are enabled to interact with the objects in the Ceph Cluster (RADOS) through a
native protocol.

.
.

Ceph Gateway

The Ceph Gateway (also known as the RADOS Gateway)

HTTP REST gateway used to access objects in the Ceph Cluster

Built on top of librados, and implemented as a FastCGI module
using libfcgi
Compatible with Amazon S3 and OpenStack Swift APIs

2011-2014 Inktank Inc.

33 / 148

The gateway application sits on top of a webserver, it uses the librados library to communicate with the CEPH cluster and will
write to OSD processes directly.
The Ceph Gateway (also known as the RADOS Gateway) is an HTTP REST gateway used to access objects in the Ceph Cluster. It
is built on top of librados, and implemented as a FastCGI module using libfcgi, and can be used with any FastCGI capable web
server. Because it uses a unified namespace, it is compatible with Amazon S3 RESTful APIs and OpenStack Swift APIs.

.
.

Ceph Block Device - RBD

Block devices are the most common way to store data

Allows for storage of virtual disks in the Ceph Object Store
Allows decoupling of VMs and containers
High-Performance through data striping across the cluster
Boot support in QEMU, KVM, and OpenStack (Cinder)
Mount support in the Linux kernel

2011-2014 Inktank Inc.

34 / 148

RBD stands for RADOS Block Device

Date is striped across the Ceph cluster

.
.

CephFS

The Ceph File System is a parallel file system that provides a massively scalable, single-hierarchy, shared
disk1 & 2

2011-2014 Inktank Inc.

35 / 148

The MDS stores its data in the CEPH cluster.

It stores only metadata information for the files store in the file system (access, modify, create) dates for example.
Note 1 : CephFS is not currently supported in production environments. It not in a production environment, you should only use
an active/passive MDS configuration and not use snapshot.
Note 2 : CephFS should be not be mounted on a host that is a node in the Ceph Object Store

.
.

End Module 2

CEPH-101 : Ceph History & Components

2011-2014 Inktank Inc.

36 / 148

.
.

Module 3 - Data Placement

Ceph Data Placement

.
.

Module Objectives
After completing this module you will be able to

Define CRUSH
Discuss the CRUSH hierarchy
Explain where to find CRUSH rules
Explain how the CRUSH data placement algorithm is used to
determine data placement
Understand Placement Groups in Ceph
Understand Pools in Ceph

2011-2014 Inktank Inc.

38 / 148

What is CRUSH?
A pseudo random placement algorithm

.
.

CRUSH
CRUSH (Controlled Replication Under Scalable Hashing)

Cephs data distribution mechanism

Pseudo-random placement algorithm
-

Rule-based configuration
-

Desired/required replica count

Affinity/distribution rules
Infrastructure topology
Weighting

Excellent data distribution

Deterministic function of inputs

Clients can compute data location

De-clustered placement
Excellent data-re-distribution
Migration proportional to change

2011-2014 Inktank Inc.

CRUSH is by essence the crown jewel of the Ceph Storage Cluster.

The OSD will make its decisions based on the CRUSH map it holds and will adapt when it receives a new one.

.
.

39 / 148

Placement Groups (PGs)

What is a PG?
-

A PG is a logical collection of objects

Objects within the PG are replicated by the same set of devices

How to choose the PG?

With CRUSH, data is first split into a certain number od sections

Each sections called placement groups (PG).
An objects PG is determined by CRUSH
Choice is cased on

the hash of the object name,

the desired level of replication
the total number of placement groups in the system

2011-2014 Inktank Inc.

40 / 148

A Placement Group (PG) aggregates a series of objects into a group, and maps the group to a series of OSDs.

.
.

Placement Groups (PGs)

2011-2014 Inktank Inc.

A Placement Group (PG) aggregates a series of objects onto a group, and maps the group to a series of OSDs.

.
.

41 / 148

Placement Groups (PGs)

Without them
-

Track placement and metadata on a per-object basis

Not realistic nor scalable with a million++ objects.

Extra Benefits
-

Reduce the number of processes

Reduce amount of per object per-object metadata Ceph must track

Handling the cluster life cycle

The total number of PGs must be adjusted when growing the cluster
As devices leave or join the Ceph cluster, most PGs remain where
they are,
CRUSH will adjust just enough of the data to ensure uniform
distribution

2011-2014 Inktank Inc.

42 / 148

Note 1: Tracking object placement and object metadata on a per-object basis is computationally expensive-i.e., a system with
millions of objects cannot realistically track placement on a per-object basis. Placement groups address this barrier to
performance and scalability. Additionally, placement groups reduce the number of processes and the amount of per-object
metadata Ceph must track when storing and retrieving data.
Note 2: Increasing the number of placement groups reduces the variance in per-OSD load across you cluster. We recommend
approximately 50-200 placement groups per OSD to balance out memory and CPU requirements and per-OSD load. For a single
pool of objects, you can use the following formula: Total Placement Groups = (OSDs*(50-200))/Number of replica.
When using multiple data pools for storing objects, you need to ensure that you balance the number of placement groups per
pool with the number of placement groups per OSD so that you arrive at a reasonable total number of placement groups that
provides reasonably low variance per OSD without taxing system resources or making the peering process too slow
1. ceph osd pool set <pool-name> pg_num <pg_num>
2. ceph osd pool set <pool-name> pgp_num <pgp_num>
The pgp_num parameter should be equal to pg_num
The second command will trigger the rebalancing of your data

.
.

Pools

What are pools?

Pools have the following attributes:

Pools are logical partitions for storing object data

Set
Set
Set
Set

ownership/access
number of object replicas
number of placement groups
the CRUSH rule set to use.

The PGs within a pool are dynamically mapped to OSDs

2011-2014 Inktank Inc.

43 / 148

When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data.
A pool has a default number of replica. Currently 2 but the Firefly version will bump the default up to 3.
A pool differs from CRUSHs location-based buckets in that a pool doesnt have a single physical location, and a pool provides
you with some additional functionality, including: Replicas: You can set the desired number of copies/replicas of an object
- A typical configuration stored an object and one additional copy (i.e., size = 2), but you can determine the number of
copies/replicas.
Placement Groups: you can set the number of placement groups for the pool.
- A typical configuration uses approximately 100 placement groups per OSD to provide optimal balancing without using up too
many computing resources. When setting up multiple pools, be careful to ensure you set a reasonable number of placement
groups for both the pool and the cluster as a whole.
CRUSH Rules: When you store data in a pool, a CRUSH rule set mapped to the pool enables CRUSH
- To identify a rule for the placement of the primary object and object replicas in your cluster. You can create a custom CRUSH
rule for your pool.
Snapshots: When you create snapshots with ceph osd pool mksnap, you effectively take a snapshot of a particular pool.
Set Ownership: You can set a user ID as the owner of a pool.

.
.

Pools

To create a pool, you must:

When you install Ceph, the default pools are:

Supply a name
Supply how many PGs can belong to the pool
data
metadata
rbd

2011-2014 Inktank Inc.

44 / 148

To organize data into pools, you can list, create, and remove pools. You can also view the utilization statistics for each pool.
Listing the pools:
ceph osd lspools
Creating the pools:
ceph osd pool create {pool-name} {pg-num} [{pgp-num}]
Deleting the pools:
ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
Renaming the pools:
ceph osd pool rename {current-pool-name} {new-pool-name}
Statistics for the pools:
rados df
Snapshotting pools:
ceph osd pool mksnap {pool-name} {snap-name}
Removing a snapshot:
ceph osd pool rmsnap {pool-name} {snap-name}

.
.

Pools

Pool attributes
-

Getting pool values: ceph osd pool get {pool-name} {key}

2011-2014 Inktank Inc.

Attributes
size: number of replica objects
min_size: minimum number of replica available for IO
crash_replay_interval: number of seconds to allow clients to replay acknowledged, but uncommitted requests
pgp_num: effective number of placement groups to use when calculating data placement
crush_ruleset: ruleset to use for mapping object placement in the cluster (CRUSH Map Module)
hashpspool: get HASHPSPOOL flag on a given pool

.
.

45 / 148

From Object to OSD

2011-2014 Inktank Inc.

46 / 148

To generate the PG id, we use - The pool id - A hashing formula based on the object name modulo the number of PGs
First OSD in the list returned is the primary OSD, the next ones are secondary

.
.

CRUSH Map Hierarchy

2011-2014 Inktank Inc.

The command used to view the CRUSH Map is: ceph osd tree

.
.

47 / 148

CRUSH Map Hierarchy

Device list: List of OSDs

Buckets: Hierarchical aggregation of storage locations
-

Buckets have an assigned weight

Buckets have a type

root 1
datacenter
rack
host
osd

Rules: define data placement for pools

2011-2014 Inktank Inc.

48 / 148

Note 1: Ceph pool

.
.

CRUSH Map Hierarchy

The CRUSH Map contains

A list of OSDs
A list of the rules to tell CRUSH how data is to be replicated
A default CRUSH map is created when you create the cluster
The default CRUSH Map is not suited for production clusters 1

2011-2014 Inktank Inc.

49 / 148

Note 1: This default CRUSH Map is fine for a sandbox-type installation only! For production clusters, it should be customized
for better management, performance, and data security

.
.

Weights and Placement

What are weights used for

They define how much data should go to an OSD

What can they be used for

Weights have no cap

Weights can be used to reflect storage capacity of an OSD
Weights can be used to offload a specific OSD

OSD daemon health status

If the weight is =0, no data will go to an OSD

If the weight is > 0, data will go to an OSD

up..
down
in..
out.

Running
Not running or can't be contacted
Holds data
Does NOT hold data

2011-2014 Inktank Inc.

50 / 148

As a quick way to remember it, the weight value indicates the proportion of data an OSD will hold if it is up and running

.
.

Peering and Recovery

The Ceph Object Store is dynamic

Failure is the norm rather than the exception

The cluster map records cluster state at a point in time
The cluster map contains the OSDs status (up/down, weight, IP)
-

OSDs cooperatively migrate data

They do so to achieve recovery based on CRUSH rules
Any Cluster map update potentially triggers data migration

2011-2014 Inktank Inc.

By default, if an OSD has been down for 5 minutes or more, we will start copying data to other OSDs in order to satisfy the
number of replicas the pool must hold.
Remember that is the number of replica available goes below the min_size pool parameter, no IO will be served.

.
.

51 / 148

CRUSH

When it comes time to store an object in the cluster (or retrieve one), the client
calculates where it belongs.

2011-2014 Inktank Inc.

52 / 148

.
.

CRUSH

What happens, though, when a node goes down?

The OSDs are always talking to each other (and the monitors)
They know when something is wrong

.
.

The 3rd & 5th nodes noticed that 2nd node on the bottom row is gone
They are also aware that they have replicas of the missing data

2011-2014 Inktank Inc.

53 / 148

CRUSH

The OSDs collectively

Use the CRUSH algorithm to determine how the cluster should look
based on its new state
and move the data to where clients running CRUSH expect it to be

2011-2014 Inktank Inc.

54 / 148

.
.

CRUSH

.
.

Placement is calculated rather than centrally controlled

Node failures are transparent to clients
2011-2014 Inktank Inc.

55 / 148

Example: OSD failure

ceph-osd daemon dies

pg maps to fewer replicas

If osd.123 was primary in a PG, a replica takes over

PG is 'degraded' (N-1 replicas)
Data redistribution is not triggered

Monitor marks OSD 'out' after 5 minutes (configurable)

Peers heartbeats fail; peers inform monitor

New osdmap published with osd.123 as 'down

PG now maps to N OSDs again

PG re-peers, activates
Primary backfills to the 'new' OSD

2011-2014 Inktank Inc.

56 / 148

Re-peering should be quick

Few seconds for 1 PG
Less than 30 seconds for 100 PGs
Less than 5 minutes for 1000 PGs
When this process takes place, remember to check the private network where copying/replicating takes place
You need a primary PG to satisfy a read or a write IO request.
The client will experience increased latency during the re-peering and before the new primary PG gets elected.
A PG being re-peered will not accept read and write operations

.
.

Example - OSD Expansion

New rack of OSDs added to CRUSH map

Many PGs map to new OSDs
-

Temporarily remap PG to include previous replicas + new OSDs

keeps replica count at (or above) target
Peer and activate
Background backfill to new OSDs
Drop re-mapping when completed, activate
Delete PG data from the old 'stray' replica

2011-2014 Inktank Inc.

57 / 148

Remember the redistribution is proportional to the change introduced.

For a while, you will use extra space because of the copy that sits on the new OSDs plus the old copies that remains until it is
disposed off when the backfill completes.

.
.

End Module 3

CEPH-101 : Data Placement

2011-2014 Inktank Inc.

58 / 148

.
.

Module 4 - RADOS
RADOS

.
.

Module Objectives

After completing this module you will be able to

Understand requirements for a Ceph Storage Cluster deployment

Deploy a Ceph Storage Cluster

2011-2014 Inktank Inc.

60 / 148

.
.

What Is Ceph?

A Storage infrastructure:
-

An Object Store:
-

.
.

Scalability
Redundancy
Flexibility

An Object Store with many access methods:

Uses Object Storage at its core

Delivers at the object level:

Massively distributed, scalable and highly available

Block level access

File level access
RESTful access
Delivered through client layers and APIs

2011-2014 Inktank Inc.

61 / 148

The Ceph Storage Cluster

In a Ceph Storage Cluster

Individual unit of data is an object

Objects have:

The Object namespace is

Flat thus not hierarchical.

Objects are
-

A name
A payload (contents)
Any number of key-value pairs (attributes).

Physically organized in Placement Groups (PGs)

Stored by Object Storage Daemons (OSDs).

2011-2014 Inktank Inc.

62 / 148

.
.

Replication

The Ceph Storage Cluster

Object Replication
-

.
.

Placement Groups (and the objects they contain) are synchronously

replicated over multiple OSDs.
Ceph uses Primary-Copy replication between OSDs.

2011-2014 Inktank Inc.

63 / 148

Object Replication Principle

2011-2014 Inktank Inc.

64 / 148

.
.

Replication Principle

OSD Storage for RADOS objects

On the OSD local storage

File system choice

.
.

In any user_xattr capable file system.

btrfs will be recommended "when it's ready
xfs is the best option for production use today
The contents of the file in the underlying local file system.
rados command-line utility is one of the many standard Ceph tools

2011-2014 Inktank Inc.

65 / 148

Monitors (MONs)

Monitor Servers (MONs)

Quorum management
-

They arbitrate between OSDs

Maintain the Ceph Storage Cluster quorum
Based on the Paxos distributed consensus algorithm

CAP theorem of distributed systems

Ceph MONs pick Consistency and Partition tolerance over

Availability

A non-quorate partition is unavailable

2011-2014 Inktank Inc.

66 / 148

.
.

RADOS/Ceph Client APIs

Ceph offers access to RADOS object store through

.
.

A C API (librados.h)
A C++ API (librados.hpp)

Documented and simple API.

Bindings for various languages.
Doesn't implement striping 1

2011-2014 Inktank Inc.

67 / 148

Ceph Gateway

RADOS Gateway
-

Runs with Apache with mod_fastcgi 2

Based on the FastCGI interface
-

Provides a RESTful API to RADOS

Supports both OpenStack Swift APIs
Supports Amazon S3 APIs
Additional APIs can be supported through plugins

Also supported in other web servers

2011-2014 Inktank Inc.

68 / 148

.
.

Ceph Gateway Overview

.
.

2011-2014 Inktank Inc.

69 / 148

Ceph Gateway And RADOS

2011-2014 Inktank Inc.

70 / 148

.
.

Ceph Hadoop Integration

Ceph Hadoop Shim

.
.

A Ceph replacement for HDFS

Provides a C++ JNI library for the Hadoop Java code
Requires a patch to Hadoop
Has not been up streamed.
Drop-in replacement for HDFS
Does away with the NameNode SPOF in Hadoop

2011-2014 Inktank Inc.

71 / 148

Summary

Ceph is a distributed storage solution

Ceph is resilient against outages
Ceph uses a decentralized structure
Cephs backend is RADOS which converts data into objects and
stores them in a distributed manner
Ceph stores are accessible via clients
-

a standard Linux filesystem (CephFS)

a Linux block device driver (RBD)
via any code using librados
QEMU
RADOSGW
...

2011-2014 Inktank Inc.

72 / 148

.
.

End Module 4

CEPH-101 : RADOS

.
.

2011-2014 Inktank Inc.

73 / 148

Module 5 - Ceph Block Device

Ceph Block Device

.
.

Module Objectives

After completing this module you will be able to

.
.

Describe how to access the Ceph Storage Cluster (RADOS) using

block devices
List the types of caching that are used with Ceph Block Devices
Explain the properties of Ceph Snapshots
Describe how the cloning operations on Ceph Block Devices work

2011-2014 Inktank Inc.

75 / 148

Ceph Block Device - RBD

Block devices are the most common way to store data

2011-2014 Inktank Inc.

76 / 148

RBD stands for RADOS Block Device

Date is striped across the Ceph cluster

.
.

RBD: Native

.
.

The Ceph Block Device interacts with Ceph OSDs using the
librados and librbd libraries.
The Ceph Block Devices are striped over multiple OSDs in a Ceph
Object Store.

2011-2014 Inktank Inc.

77 / 148

Ceph Block Device: Native

Machines (even those running on bare metal) can mount an RBD

image using native Linux kernel drivers

2011-2014 Inktank Inc.

78 / 148

.
.

RBD: Virtual Machines

The librbd library

Virtualization containers
-

KVM or QEMU can use VM images that are stored in RADOS

Virtualization containers can also use RBD block storage in
OpenStack and CloudStack platforms.

Ceph based VM Images

Maps data blocks into objects for storage in the Ceph Object Store.
Inherit librados capabilities such as snapshot and clones

Are striped across the entire cluster

Allow simultaneous read access from different cluster nodes

2011-2014 Inktank Inc.

Virtualization containers can boot a VM without transferring the boot image to the VM itself.
Config file rbdmap will tell which RBD device needs to be mounted

.
.

79 / 148

RBD: Virtual Machines

Machines (even those running on bare metal) can mount an RBD

image using native Linux kernel drivers

2011-2014 Inktank Inc.

80 / 148

As far as the VM is concerned, it sees a block device and is not even aware about the CEPH cluster.

.
.

Software requirements

RBD usage has the following software requirements

krbd: The kernel rados block device (rbd) module is able to access
the Linux kernel on the OSD.
librbd: A shared library that allows applications to access Ceph
Block Devices.
QEMU/KVM: is a widely used open source hypervisor. More info on
the project can be found at http://wiki.qemu.org/Main_Page
libvirt: the virtualization API that supports KVM/QEMU and other
hypervisors. Since the Ceph Block Device supports QEMU/KVM, it
can also interface with software that uses libvirt.

2011-2014 Inktank Inc.

You will be dependant on the kernel version for the best performance and avoiding the bugs
A kernel version of minimum 3.8 is SUPER highly recommended
To use an RBD directly in the VM itself, you need to:
Install librbd (will also install librados)
Then map the RBD device

.
.

81 / 148

librbd Cache Disabled

Caching Techniques:

The rbd kernel device itself is able to access and use the Linux
page cache to improve performance if necessary.

librbd caching called RBD caching

RBD caching uses a least recently used algorithm (LRU)

By default, cache is not enabled
Ceph supports write-back and write-through caching
Cache is local to the Ceph client
Each Ceph Block Device image has its own cache
Cache is configured in the [client] section of the ceph.conf file

2011-2014 Inktank Inc.

82 / 148

LIBRBD can not leverage the Linux page caching for its own use. Therefore LIBRBD implements its own caching mechanism
By default caching is disabled in LIBRBD.
Note 1 : In write-back mode LIBRBD caching can coalesce contiguous requests for better throughput.
We offer Write Back (aka Cache Enabled which is the activation default) and Write Through support
Be cautious with Write Back as the host will be caching and acknowledge the write IO request as soon as data is place in the
server LIBRBD local cache.
Write Through is highly recommended for production servers to avoid loosing data in case of a server failure

.
.

librbd Cache Settings

Supported caching settings:

Caching not enabled

Reads and writes go to the Ceph Object Store

Write IO only returns when data is on the disks on all replicas.

Cache enabled (Write Back)

Consider 2 values

immediately if U < M
After writing data back to disk until U < M

Write-through caching
-

Un-flushed cache bytes U

Max dirty cache bytes M

Writes are returned

Max dirty byte is set to 0 to force write through

2011-2014 Inktank Inc.

83 / 148

Note 1: In write-back mode it can coalesce contiguous requests for better throughput.
The ceph.conf file settings for RBD should be set in the [client] section of your configuration file.
The settings include (default values are in bold underlined):
-rbd cache Enable RBD caching. Value is true or false
-rbd cache size The RBD cache size in bytes. Integer 32MB
-rbd cache max dirty The dirty byte threshold that triggers write back. Must be less than above. 24MB
-rbd cache target dirty The dirty target before cache starts writing back data . Does not hold write IOs to cache. 16MB
-rbd cache max dirty age Number of seconds dirty bytes are kept in cache before writing back. 1
-rbd cache writethrough until flush Start in write through mode and switch to write back after first flush occurs. Value is
true or false.

.
.

Snapshots

Snapshots
-

Are Instantly created

Are read only copies
Do not use space

2011-2014 Inktank Inc.

Until original data

changes

Do not change

Support incremental
snapshots1
Data is read from the original
data

84 / 148

Since Cuttlefish, we support incremental snapshots

.
.

Clones

Clone creation
-

Create snapshot
Protect snapshot
Clone snapshot

Clone behavior
-

Like any other RBD image

A clone is created from a snapshot

.
.

2011-2014 Inktank Inc.

Read from it
Write to it
Clone it
Resize it

85 / 148

Clones

Ceph supports
-

Clones are
-

Copies of snapshots
Writable2

Original data protection

Many COW clones1

Never written to

2011-2014 Inktank Inc.

86 / 148

Note 1 : Reads are always served from the original snapshot used to create the clone. Ceph supports many copy-on-write
clones of a snapshot
Note 2 : Snapshots are read-only!

.
.

Clones

Read Operation
-

Through to original copy

Unless there is new data1

2011-2014 Inktank Inc.

Note 1 : If data has been updated in the clone, data is read from the clone mounted on the host.

.
.

87 / 148

End Module 5

CEPH-101 : Ceph Block Device

2011-2014 Inktank Inc.

88 / 148

.
.

Module 6 - Ceph Filesystem

Ceph Filesystem

.
.

Module Objectives

At the end of this lesson, you will be able to:

Describe the methods to store and access data using the Ceph File
System
Explain the purpose of a metadata server cluster (MDS)

2011-2014 Inktank Inc.

90 / 148

.
.

Metadata Server (MDS)

Manages metadata for a POSIX-compliant shared file

system
-

The Ceph Metadata Server daemon (MDS)

Directory hierarchy
File metadata (owner, timestamps, mode, etc.)
Stores metadata in RADOS
Does not access file content
Only required for shared file system
1

Provides the POSIX information needed by file systems that

enables Ceph FS to interact with the Ceph Object Store
It remembers where data lives within a tree 2
Clients accessing CephFS data first make a request to an
MDS, which provides what they need to get files from the
right OSDs 3

If you aren't running CephFS, MDS daemons do not need to

be deployed.
2011-2014 Inktank Inc.

91 / 148

Note 1 : The MDS requires a 64bit OS because of the size of the INODES. This also means that ceph-fuse must be run also from a
64bit capable client
Note 2 : CephFS also keeps the recursive size of each directory that will appear at each level (. & .. Directory names)
There are 2 ways to mount a file system
1. The kernel based tool
2. The ceph-fuse tool (only alternative supported on all kernels that do not have the CephFS portion 2.6.32)
Ceph-fuse is most of the time slower than the CephFS kernel module
Note 3 : To mount with the kernel module, issue mount t ceph <mon1,mon2, ...> making all MON running nodes are quoted for
MON failure fault tolerance
To create a snapshot of a file system
In the .snap directory of the file system, create a directory and thats it. From the file system root directory tree, issue mkdir
./.snap/snap_20131218_100000 command
To delete a snapshot of a file system, remove the corresponding snapshot directory name in the .snap directory and thats it.
From the file system root directory tree, issue rm ./.snap/snap_20131218_100000 command

.
.

MDS High Availability

MDSs can be running in two modes

A standby MDS can become active

If the previously active daemon goes away

Multiple active MDSs for load balancing

Active
Standby

Are a possibility
This configuration is currently not supported/recommended

2011-2014 Inktank Inc.

92 / 148

.
.

Metadata Server (MDS)

.
.

2011-2014 Inktank Inc.

93 / 148

MDS functionality

The client learns about MDSs and OSDs from MON

Clients talk to MDS for access to Metadata

Permission bits, ACLs, file ownership, etc.

Clients talk to OSD for access to data

MDSs themselves store all their data in OSDs
-

via MON Map, OSD Map and MDS Map

In a separate pool called metadata

2011-2014 Inktank Inc.

94 / 148

.
.

DTP

Stands for Dynamic Tree Partitioning

.
.

2011-2014 Inktank Inc.

95 / 148

DTP

Stands for Dynamic Tree Partitioning

2011-2014 Inktank Inc.

96 / 148

.
.

Summary

Metadata servers are called MDSs

MDS provide CephFS clients with POSIX compatible metadata
The number of MDSs is unlimited
In Ceph, a crashed MDS does not lead to a downtime
-

.
.

If you do not use CephFS

2011-2014 Inktank Inc.

97 / 148

End Module 6

CEPH-101 : Ceph FileSystem (CephFS)

2011-2014 Inktank Inc.

98 / 148

.
.

Module 7 - Creating A Ceph Cluster

Creating A Cluster

.
.

Module Objectives

At the end of this module, you will be able to:

Understand the deployment of a cluster with ceph-deploy

Locate the Ceph configuration file
Understand the format of the Ceph configuration file
Update the Ceph configuration file
Know the importance of the sections
Know the differences in the section naming conventions

2011-2014 Inktank Inc.

100 / 148

.
.

Deploy a Ceph Storage Cluster

How to set up a Ceph cluster

ceph-deploy
Manual cluster creation

Automated cluster creation

100

.
.

Through the cli

Puppet
Chef
Juju
Crowbar

2011-2014 Inktank Inc.

101 / 148

Creating a Ceph cluster

Getting started
-

Ceph supports encrypted authentication (cephx)

Starting with Ceph 0.48, the default location for the per-daemon
keyrings is $datadir/keyring,

We recommend to always use the default paths

datadir is defined with osd data, mds data, etc.

Udev hooks, ceph-disk and Upstart/Sysvinit scripts use those default
path

2011-2014 Inktank Inc.

102 / 148

101

.
.

Ceph Configuration File

Understanding the /etc/ceph/ceph.conf file

INI-based file format with sections:

The section name/header information

The parameter information

[name]
parameter=value 1

Contains a [global] section

Can defines all MONs, OSDs and MDSs
Also contains settings to enable authentication (cephx)
Defines client-specific configuration

2011-2014 Inktank Inc.

Note 1 : Parameter name can use space or _ as a separator

e.g. osd journal size or osd_journal_size

102

.
.

103 / 148

The [global] section

Defines the cluster wide parameters

Comments can be added with ";" at the beginning of a line

Typically used to enable or disable cephx
Typically used to define separate networks

One for OSDs (cluster network)

One for clients (public network)

See page notes for an example

2011-2014 Inktank Inc.

104 / 148

Usually you use the global section to enable or disable general options such as cephx authenticaction
Cephx is the mechanism that will let you set permissions
[global]
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
public network = {network}[, {network}]
cluster network = {network}[, {network}]
mon initial members = {hostname}[, {hostname}]
mon host = {ip-address}[, {ip-address}]
osd journal size = 1024
filestore xattr use omap = true ; required for EXT4

103

.
.

The [mon] sections

Monitors servers configuration

[mon.a] host = daisy

mon addr = 192.168.122.111:6789
[mon.b]
host = eric
mon addr = 192.168.122.112:6789

The Monitor section names are suffixed by letters

First letter is a then increments

For Monitor wide parameters (log for example)

[mon]
parameter = value

2011-2014 Inktank Inc.

Ceph-deploy build by default a default ceph.conf file

The global section can contain the list of the monitor members so that we can build the monitor map as soon as we have a
quorum
mon_initial_members is a Best Practice parameter to use in the global section
Individual MON sections are often use for setting specific options such as debugging on a specific monitor
In production do not use ceph-deploy. Go through the manual deployment of monitors
urlhttp://ceph.com/docs/master/rados/operations/add-or-rm-mons/

104

.
.

105 / 148

The [mon] sections

The settings in detail:

mon addr

host

Used to configure the mon listening IP address and listening port

Used by mkcephfs only
Used by Chef-based & ceph-deploy as well as [global] mon initial
members

Mon section name

$id by default resolves to letters (e.g. a)

$name resolves to the full name (e.g. [mon.a])

2011-2014 Inktank Inc.

106 / 148

The host parameter is used by mkcephfs so you should not use it as this command is deprecated.
Keep the ceph.conf file as slim as possible

105

.
.

The [mon] sections

Since Cuttlefish
Declaring every Monitor is not required
The only mandatory parameters are, in the [global] section
-

mon host = x,y,z (ips)

mon initial members = a,b,c (hostnames)

Daemon start and stop

Upstart/Sysvinit scripts will parse default monitor paths

/var/lib/ceph/mon/$cluster-`hostname`

The directory must contain:

A done file
A upstart or sysvinit file for the Monitor to be started

2011-2014 Inktank Inc.

107 / 148

Note 1 : The mon_initial_members parameter avoids a brain split during the first start making sure quorum is gained as soon as
possible
The default install path is: /var/lib/ceph/mon/$cluster-`hostname`
The best practice is to use the default path

106

.
.

The [mds] sections

Similar configuration to the MONs

Upstart/Sysvinit will also start by parsing if not in ceph.conf

[mds] entry is typically left empty

MDS's don't necessarily require additional configuration

MDS section name

The default MDS path must contain the same files as the MONs

$id by default resolves to numbers (e.g. 0)

$name by default resolves to full name (e.g. [mds.0])

See example in page notes

2011-2014 Inktank Inc.

108 / 148

[mds.0]
host = daisy
[mds.1]
host = eric

107

.
.

The [osd] sections

Similar configuration to the MONs

The section name prefix is osd

[osd]
osd data = /var/lib/ceph/osd/$cluster-$id
1
osd
journal = /var/lib/ceph/osd/$cluster-$id/journal
1
osd journal size = 256 ; Size, in megabytes
; filestore xattr use omap = true ; for ext3/4
[osd.0]
host = daisy
[osd.1]
host = eric

Note 1 : If the journal is to be changed:

1. Stop the OSD
2. Modify the file
3. ceph osd i mkjournal to start a new journal
4. Restart the OSD

108

.
.

2011-2014 Inktank Inc.

109 / 148

The [osd] sections

Configuring OSDs
-

the data and journal settings here resemble the standard

and are thus redundant
for educational purposes in this example only
Default value for $cluster is ceph
When using EXT3 and EXT4 as a file store

OSD section name

$id by default resolves to numbers (e.g. 0)

$name by default resolves to full name (e.g. [osd.0@])

Journal parameters
-

xattr use omap = true is a requirement

Can point to a fast block device rather than a file (SSDs)

ceph-deploy puts journal and data on the same device by default

2011-2014 Inktank Inc.

110 / 148

109

.
.

Storing arbitrary data in

objects

Storing data in a RADOS pool

Using the rados command line utility

# dd if=/dev/zero of=test bs=10M count=1

1+0 records in
1+0 records out
10485760 bytes (10 MB) copied
# rados -p data put my-object my-object
# rados -p data stat my-object
data/my-object mtime 1348960511, size 10485760

Note 1 : RADOS does no implement striping.

110

.
.

2011-2014 Inktank Inc.

111 / 148

Ceph Block Devices

Working with the Ceph Block Device

RBD provides a block interface on top of RADOS

RBD Images stripe their data into multiple RADOS objects
Image data is thus distributed across the cluster

To the client, the image looks like a single block device

Image distribution enhances scalability

2011-2014 Inktank Inc.

112 / 148

111

.
.

Ceph Block Devices

2011-2014 Inktank Inc.

As you can see in this slide, data coming from a client to an RBD image will be split among the various OSD processes and
underlying drives. As explained on the previous slide.

112

.
.

113 / 148

Ceph Block Devices

Creating an RBD image

# rbd create --size 1024 foo

# rbd ls
foo
# rbd info foo
rbd image foo:
size 1024 MB in 256 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.0
parent: (pool -1)

2011-2014 Inktank Inc.

114 / 148

113

.
.

Ceph Block Devices

Image Order
-

The image "order" in RBD's is the binary shift value

Order must be between 12 and 25:

1<<22 = 4M
12 = 4K
13 = 8K

The default is

22 = 4MB

2011-2014 Inktank Inc.

Note 1 : The << C operator is a left bit shifting operator. << shifts the left operand bits by the right operand value
A binary value for example
1 = 0001
If we do 1 << 2 the resulting value is
4 = 0100
The opposite operator is >>, the right bit shifting operator
The advantage of these operators is that they are executed in a single CPU cycle

114

.
.

115 / 148

Ceph Block Devices

Listing RBD images

rbd --pool <pool> --image <rbd> rm

Mapping a RBD image on a client

rbd --pool <pool> --image <rbd> --size <MB> resize

Delete a RBD image

rbd --pool <pool> --image <rbd> info

Resizing a RBD image

rbd ls --pool <pool>

Info on a RBD image

1&2

rbd --pool <pool> map <rbd>

Unmapping a RBD image from a client

rbd --pool <pool> unmap /dev/rbd<x>

2011-2014 Inktank Inc.

116 / 148

Note 1 : The pool defaults to rbd if omitted

Note 2 : --pool can be replaced with p
Note 3 : Deleting the RBD image will fail if snapshots still exist for this particular image. Hence in this case, the snap purge
command will have to be issued first.

115

.
.

Ceph Block Devices

Create an RBD device

rbd snap rm snap <pool>/<rbd>@<snap>

rbd --pool <pool> snap rm --snap <snap> <rbd>

Delete all snapshots for an RBD

rbd snap ls <pool>/<rbd>

rbd --pool <pool> snap ls <rbd>

Delete a specific snapshot

rbd snap create --snap <pool>/<rbd>@<snap>

rbd --pool <pool> snap create --snap <snap> <rbd>

Displays all snapshots for an image

rbd create --size <MB> --pool <pool> <rbd>

Creates a new RBD snapshot

rbd snap purge <pool>/<rbd>

rbd --pool <pool> snap purge <rbd>

Rollback a snapshot
-

rbd rollback snap <pool>/<rbd>@<snap>

rbd --pool <pool> snap rollback --snap <snap> <rbd>

Note 1 : The pool defaults to rbd if omitted

Note 2 : --pool option can be replaced with the p option

116

.
.

1&2

2011-2014 Inktank Inc.

117 / 148

Ceph Block Devices

RBD requirements
-

Linux kernel version 2.6.37 or later

Compiled in as a kernel module

modinfo rbd for details

modprobe rbd for loading a specific module

2011-2014 Inktank Inc.

118 / 148

117

.
.

Ceph Block Devices

Step by step
-

Mapping an RBD requires that the kernel module is loaded

Map the desired RBD to the client

rbd map <rbd name> output should be silent

If cephx authentication is enabled

Checking the RBD device mapping

#
#
#
#

modprobe rbd output should be silent

Add the --id <name> and --keyring <filename> options

rbd showmapped

Mount the RBD device

modprobe rbd
mkdir /mnt/mountpoint
mkfs.ext4 /dev/rbd<x>
mount /dev/rbdx /mnt/mountpoint

Note 1 : Verifying the RBD image is mapped to the client

# modprobe rbd
# rbd map foo
# rbd showmapped

118

.
.

2011-2014 Inktank Inc.

119 / 148

Ceph Block Devices

Mapping a snapshot of an image

RBD snapshots are read-only

#
#
#
#

rbd map [<pool>/]<rbd>@<snap>

rbd map --pool <pool> --snap <snap> <image>
So is the block device!

Step by Step 1
modprobe rbd
rbd map foo@s1
rbd showmapped
blockdev --getro /dev/rbd<x>

2011-2014 Inktank Inc.

120 / 148

Note 1 : How to use a RBD snapshot on a client

# modprobe rbd
# rbd map foo@s1
# rbd showmapped
# blockdev --getro /dev/rbd1
1

119

.
.

Ceph Block Devices

RBD module loaded

Try this out

120

.
.

Map a RBD image (/dev/rbd{n} will be mapped)

Do echo {n} > /sys/bus/rbd/remove

Subdirectory devices/{n}
-

Maintains a directory named rbd

Directory is located in the /sys/bus directory
rbd map & unmap commands add and remove files in this directory.

For every mapped device

Holding some information about the mapped image

2011-2014 Inktank Inc.

121 / 148

Ceph Block Devices

Un-mapping a RBD image

There is no such thing as an exclusive mapping

rbd unmap /dev/rbd{n}

We can actually map an image already mapped
But Best Practice says you should un-map before mapping

2011-2014 Inktank Inc.

122 / 148

Remember the pool option can be replaced with the p option for quicker typing
# rbd unmap /dev/rbd0
# rbd showmapped

121

.
.

Ceph Block Devices

QEMU/KVM virtual block devices

They can be backed by RBD images

Recent QEMU releases are built with --enable-rbd 1
Can also create RBD images directly from qemu-img

Available options

qemu-img f rbd rbd:<pool>/<image_name> <size>

-f = file format
rbd: is the prefix like file:
pool is the pool to use (rbd for example)
size as expressed with qemu-img (100M, 10G, ...)

Note 1 : e.g. qemu-utils packaged with Ubuntu 12.04

# qemu-img create -f rbd rbd:rbd/qemu 1G

122

.
.

2011-2014 Inktank Inc.

123 / 148

Ceph Block Devices

rbd protocol
-

libvirt 0.8.7 introduced network drives

The host specifying the monitors part can be omitted if:
A valid /etc/ceph/ceph.conf file exist on the libvirt host

<disk type='network' device='disk>

2011-2014 Inktank Inc.

124 / 148

123

.
.

Ceph Block Devices

2011-2014 Inktank Inc.

As you can see in this slide, data coming from a client to an RBD image will be split among the various OSD processes and
underlying drives. As explained on the previous slide.

124

.
.

125 / 148

Ceph Block Devices

Specific parameters
-

Appending : rbd_cache=1 to the source name

Appending : rbd_cache_max_dirty=0

Enables RBD caching (since Ceph 0.46)

Enables RBD caching in write-through mode

2011-2014 Inktank Inc.

126 / 148

Note 1 : rbd_cache=true example

125

.
.

Ceph Block Devices

Snapshots and virtualization containers

libvirt 0.9.12 is snapshots aware

virsh snapshot-create command

Will freeze qemu-kvm processes

Take a RBD snapshot
Will then unfreeze qemu-kvm processes

# virsh snapshot-create alpha

Domain snapshot 1283504027 created

126

.
.

2011-2014 Inktank Inc.

127 / 148

Ceph Block Devices

Deleting a RBD image

If the image is still in use

The image data will be destroyed first

Then the command will fail with an EBUSY
This image is no longer usable

This image can not be recovered with a snapshot

Because you had to purge them, remember :)

If a client does not respond but did not properly close the image
(such as in the case of a client crash)
-

Reads from it produce all zeroes

30-second grace period after which the device can be removed

2011-2014 Inktank Inc.

128 / 148

127

.
.

Ceph File System

Two CephFS clients available

128

.
.

CephFS: in-kernel
FUSE

Which of the two are available depends on the platform we're

running on.

2011-2014 Inktank Inc.

129 / 148

Ceph File System

CephFS Kernel Module

Currently considered experimental

Preferred way to interact with CephFS on Linux.
Available since 2.6.32
First component of the Ceph stack that was merged upstream
Development continues as part of the normal kernel merge windows
and release cycles.
Due to API compatibility, the Ceph client later is being developed
entirely independently from the server-side, userspace components.
Compiled in as a module (modinfo ceph for details)
Will may be be renamed or aliased to cephfs in the future.

2011-2014 Inktank Inc.

130 / 148

129

.
.

Ceph File System

FUSE
-

Note
-

130

.
.

ceph-fuse is an implementation of the CephFS client layer in FUSE

(Files system in User SpacE).
Preferred on pre-2.6.32 kernels
Future versions might also be useful for working with Ceph on
non-Linux platforms with FUSE support
*BSD, OpenSolaris, Mac OS X currently unsupported
Do note run ceph-fuse clients on 32-bit kernels
CephFS inode numbers are 64 bits wide

2011-2014 Inktank Inc.

131 / 148

Ceph File System

Deep mount
-

Currently, a Ceph Cluster can host a single file system

To work around this limitation, you can perform a DEEP mount

You will adjust your file system ACLs starting at the root

Note
-

You can specify the MON port number in the mount command

# mount -t ceph daisy:/subdir /mnt/foo

# mount -t ceph daisy:9999:/subdir /mnt

Default port is 6789

2011-2014 Inktank Inc.

132 / 148

131

.
.

Ceph File System

Mount options
-

name=<name>

secretfile=/path/to/file

132

.
.

With cephx enabled, we need to specify the cephx username

Maps to client.<name> section in ceph.conf
Default is guest
Path to a keyring file containing our shared secret
This allows not showing the secret in the mount command nor in the
configuration files

2011-2014 Inktank Inc.

133 / 148

Ceph File System

Mount options:
-

rsize=<bytes>

wsize=<bytes>

Read-ahead size in bytes

Must be a 1024 multiple
Default is 512K
Writes max size
Should be the value of stripe layout
Default is none
Value used is the smaller of wsize and stripe unit)

2011-2014 Inktank Inc.

134 / 148

133

.
.

Ceph File System

Retrieving a file location

cephfs <path> show_location

Retrieving a file layout

cephfs <path> show_layout

2011-2014 Inktank Inc.

Note 1 : The output in detail:

- file_offset: the offset (in bytes) from the start of the file
- object_offset: offset (bytes) from the start of the RADOS object
- object_no: the index among the RADOS objects mapped to the file. For offset 0, this will always be 0.
- object_size: the size of the RADOS object we're looking at, as per the defined layout.
- object_name: the RADOS object ID we're looking at (we can look up its more detailed OSD mapping with sdmaptool
--test_map_object <object_name>)
- block_offset: the offset from the start of the stripe
- block_size: the stripe size, as per the defined layout.
Note 2 : The output in detail:
- layout.data_pool:..... 0
- layout.object_size:... 4194304
- layout.stripe_unit:... 4194304
- layout.stripe_count:.. 1

134

.
.

135 / 148

Ceph File System

If a file
-

If a directory
-

Shows the existing, immutable layout for the file

It cannot be altered after we've written any data to the file
getfattr -n ceph.layout -d <file>
Shows the default layout that will be applied to new files
Changes to the layout will only affect files created after the layout
change.

2011-2014 Inktank Inc.

136 / 148

135

.
.

Ceph File System

Changing layout
-

cephfs <path> set_layout <options>

Options available are

-object_size=value in bytes
-stripe_count=value as a decimal integer
-stripe_unit=value in bytes

2011-2014 Inktank Inc.

Note 1 : Syntax and parameters

# cephfs /mnt set_layout --object_size 4194304 --stripe_count 2 --stripe_unit $((4194304/2))
--object_size or -s sets the size of individual objects
--stripe_count or -c sets the number of stripes to distribute objects over
--stripe_unit or -u set the size of a stripe.

136

.
.

137 / 148

Ceph File System

Mount with FUSE

Un-mount with FUSE

FUSE client reads /etc/ceph/ceph.conf

Only a mount point to specify
# ceph-fuse /mnt
# ceph-fuse -u /mnt

2011-2014 Inktank Inc.

138 / 148

137

.
.

Ceph File System

Snapshots with CephFS

In the directory you want to snapshot

Create a .snap directory
Create a directory in the .snap directory

Naming
-

.snap obeys the standard .file rule

They will not show up in ls or find
They don't get deleted accidentally with rm rf
If a different name is required

138

.
.

# mkdir .snap/<name>

Mount the file system with -o snapdir=<name>

2011-2014 Inktank Inc.

139 / 148

Ceph File System

Restoring from a snapshot

Just copy from the .snap directory tree to the normal tree

Full restore is possible

cp -a .snap/<name>/<file> .
rm ./* -rf
cp -a .snap/<name>/<file> .

2011-2014 Inktank Inc.

140 / 148

139

.
.

Ceph File System

Discard a snapshot
-

Remove the corresponding directory in .snap

Never mind that it's not empty; the rmdir will just succeed.

140

.
.

rmdir .snap/<name>

2011-2014 Inktank Inc.

141 / 148

Summary

Deploying a cluster
Configuration file format
Working with Ceph clients
-

rados
rbd
Mounting a CephFS File System

2011-2014 Inktank Inc.

142 / 148

141

.
.

End Module 7

CEPH-101 : Creating a cluster

142

.
.

2011-2014 Inktank Inc.

143 / 148

Module 8 - Thanks For Attending

Thanks For Attending

143

.
.

Module Objectives

At the end of this lesson, you will be able to:

Tell us how you feel

144

.
.

About the
About the
About the
About the
About the

slide deck format

slide deck content
instructor
lab format
lab content

2011-2014 Inktank Inc.

145 / 148

Please Tell Us

We hope you enjoyed it !

Tell us about your feedback

Tell us about what we could do better

http://www.inktank.com/trainingfeedback
Q&A

2011-2014 Inktank Inc.

146 / 148

145

.
.

Summary

See you again soon

146

.
.

2011-2014 Inktank Inc.

147 / 148

End Module 8

CEPH-101 : Thanks

2011-2014 Inktank Inc.

148 / 148

147

.
.

Stupid Simple Kubernetes Final
100% (1)
Stupid Simple Kubernetes Final
76 pages
Zadara VMware Veam Hybrid Cloud
No ratings yet
Zadara VMware Veam Hybrid Cloud
42 pages
RHCS 2017 - Past, Present and Future
No ratings yet
RHCS 2017 - Past, Present and Future
37 pages
Cisco HyperFlex
No ratings yet
Cisco HyperFlex
214 pages
Ticket Confirmation
No ratings yet
Ticket Confirmation
2 pages
V Realize Operations Sizing
No ratings yet
V Realize Operations Sizing
2 pages
Hybrid Cloud Strategy for Banks
No ratings yet
Hybrid Cloud Strategy for Banks
8 pages
Artificial Intelligence Courses - NPTEL
No ratings yet
Artificial Intelligence Courses - NPTEL
1 page
Business Course Curriculum Guide
No ratings yet
Business Course Curriculum Guide
1 page
San Storage Interview Question
0% (1)
San Storage Interview Question
12 pages
SplunkCloud-6 6 3-SearchTutorial PDF
No ratings yet
SplunkCloud-6 6 3-SearchTutorial PDF
103 pages
Cloud Migration: A Case Study of Migrating An Enterprise It System To Iaas
No ratings yet
Cloud Migration: A Case Study of Migrating An Enterprise It System To Iaas
8 pages
SOC201-OpenStack Administration With SUSE OpenStack Cloud - LMS
No ratings yet
SOC201-OpenStack Administration With SUSE OpenStack Cloud - LMS
297 pages
SAN
100% (1)
SAN
81 pages
Ett Zs3 Analyst Paper 2008433
No ratings yet
Ett Zs3 Analyst Paper 2008433
14 pages
Storage Area Network (SAN)
No ratings yet
Storage Area Network (SAN)
81 pages
Gcloud Python
No ratings yet
Gcloud Python
398 pages
EAA ISM v3 Course One-Pager
No ratings yet
EAA ISM v3 Course One-Pager
3 pages
3par Get Thin Guarantee
No ratings yet
3par Get Thin Guarantee
10 pages
E Commerceindustryinindiaprabhat
No ratings yet
E Commerceindustryinindiaprabhat
81 pages
Maxta Customer Success Story: Vedams
No ratings yet
Maxta Customer Success Story: Vedams
3 pages
Widowhood The Severe Liminality of Women
No ratings yet
Widowhood The Severe Liminality of Women
6 pages
Bearing Capactiy of Bolt
No ratings yet
Bearing Capactiy of Bolt
3 pages
1824 - Young - Bridge PDF
No ratings yet
1824 - Young - Bridge PDF
28 pages
Examples p1
No ratings yet
Examples p1
4 pages
What Is The Level of Granularity of A Fact Table
No ratings yet
What Is The Level of Granularity of A Fact Table
15 pages
Chapter 3
100% (1)
Chapter 3
23 pages
Weekly Challenge 2 - Coursera
No ratings yet
Weekly Challenge 2 - Coursera
1 page
English: Follow Us:, Youtube
No ratings yet
English: Follow Us:, Youtube
205 pages
Calendar 2015-2016
No ratings yet
Calendar 2015-2016
12 pages
0460 Example Candidate Responses Paper 4
No ratings yet
0460 Example Candidate Responses Paper 4
34 pages
Minetruck MT54 - Epiroc
No ratings yet
Minetruck MT54 - Epiroc
4 pages
HUAWEI P9 Android7.0 Rollback To Android6.0 Operation Instruction
No ratings yet
HUAWEI P9 Android7.0 Rollback To Android6.0 Operation Instruction
5 pages
STPM 2005
No ratings yet
STPM 2005
2 pages
Morountodun Women of Owu: Representation of Women'S War Experiences in Femi Osofisan'S AND
No ratings yet
Morountodun Women of Owu: Representation of Women'S War Experiences in Femi Osofisan'S AND
20 pages
Politeknik Seberang Perai: Lecturername: Pn. Siti Zawana Bt. Abdul Rani Class: Dip2B
No ratings yet
Politeknik Seberang Perai: Lecturername: Pn. Siti Zawana Bt. Abdul Rani Class: Dip2B
21 pages
Fomapan 100
No ratings yet
Fomapan 100
3 pages
TMP 3553
No ratings yet
TMP 3553
8 pages
Farming Simulator
No ratings yet
Farming Simulator
3 pages
Building The Project Team and Project or
No ratings yet
Building The Project Team and Project or
10 pages
Python Basics and Features Guide
No ratings yet
Python Basics and Features Guide
129 pages
Syunik NGO Newsletter Issue 18 PDF
No ratings yet
Syunik NGO Newsletter Issue 18 PDF
5 pages
On Sequences Defined by Linear Recurrence Relations : (2) U +K + Aiun+K-X + + Aku A, 0
No ratings yet
On Sequences Defined by Linear Recurrence Relations : (2) U +K + Aiun+K-X + + Aku A, 0
9 pages
Pictionary-Teacher Words To Pictures Relay Game
No ratings yet
Pictionary-Teacher Words To Pictures Relay Game
3 pages
Graphene: Physics and Applications
No ratings yet
Graphene: Physics and Applications
17 pages
Conflict Resolution for Instructors
No ratings yet
Conflict Resolution for Instructors
5 pages
American National Security Post-9/11
No ratings yet
American National Security Post-9/11
4 pages
Customer Service Executive
No ratings yet
Customer Service Executive
2 pages
14.01 Types of Data - Worksheet
No ratings yet
14.01 Types of Data - Worksheet
4 pages
Aptitude Test for Job Applicants
No ratings yet
Aptitude Test for Job Applicants
4 pages
Reflection On Personal Development in Relation To Self-Awareness and Listening Skills
No ratings yet
Reflection On Personal Development in Relation To Self-Awareness and Listening Skills
8 pages