0% found this document useful (0 votes)

44 views38 pages

C Eph Best Practices

The Ceph Best Practices Manual provides guidelines for configuring, deploying, operating, and troubleshooting Ceph storage systems, aimed at deployment engineers and storage administrators. It covers integration with OpenStack, deployment considerations, operational procedures, and troubleshooting techniques. Key topics include adding/removing nodes, configuring imaging and block storage services, and managing cluster performance during time-consuming operations.

Uploaded by

Md. Mostafijur Rahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views38 pages

C Eph Best Practices

Uploaded by

Md. Mostafijur Rahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

US HEADQUARTERS

Sunnyvale, CA
525 Almanor Ave, 4th Floor
Sunnyvale, CA 94085
+16509639828 Phone
+16509682997 Fax

Ceph Best Practices Manual

Version 1.1
19-May-2016

Authors:
Christian Huebner (Storage Architect)
Pawel Stefanski (Senior Deployment Engineer)
Kostiantyn Danilov (Principal Software Engineer)
Igor Fedotov (Senior Software Engineer)

© 2005–2016 All Rights Reserved www.mirantis.com

Table of contents
1 Introduction
2 Deployment considerations
2.1 Ceph and OpenStack integration
2.2 Fuel Ceph deployment
2.3 Ceph default configuration
2.4 Configure the Imaging and Block Storage services for Ceph
2.5 Adding nodes to the cluster
2.6 Removing nodes from the cluster
2.7 Timeconsuming operations on cluster
2.8 Cache configuration
2.8.1 Design
2.8.2 Implementation
2.9 Cache tiering HowTo
2.9.1 Create buckets
2.8.2 CRUSH map modifications
2.8.3 Create new caching pools
2.9.4 Set up caching
2.9.5 Turn cache down
3 Operations
3.1 Procedures
3.1.1 Remove an OSD
3.1.2 Add an OSD
3.1.3 Remove Ceph monitor from healthy cluster
3.1.4 Decreasing recovery and backfilling performance impact
3.1.5 Remove Ceph monitor(s) from downed cluster
3.1.6 Add Ceph monitor to cluster
3.2 Failure Scenarios
3.2.1 Failed OSD device
3.2.2 Lost journal device
3.2.3 Failed storage node
3.2.4 Failed Ceph monitor
3.2.5 Ceph monitor quorum not met
3.2.6 Client loses connection
3.2.7 Network issue in Ceph cluster environment
3.2.8 Time synchronization issue
3.2.9 Object Service failure
3.2.10 Complete cluster restart/power failure
3.2.11 Out of disk space on MON
3.2.12 Out of disk space on OSD
3.3 Tuning
3.3.1 Using cephdeploy to distribute configuration over cluster

© 2005–2016 All Rights Reserved www.mirantis.com

3.3.2 Changes
3.3.2.1 Changes in a config file
3.3.2.2 Online changes with monitor
3.3.2.3 Online changes with admin socket
3.3.3 Common tuning parameters
3.3.4 Performance measurement best practice
3.4 Ongoing operations
3.4.1 Background activities
3.4.2 Monitoring
3.4.3 Dumping memory heap
3.4.4 Maintenance
4 Troubleshooting
4.1 Overall Ceph cluster health
4.2 Logs
4.3 Failed MON
4.4 Failed OSD
4.4.1 OSD is flapping during peering state, after restart or recovery
4.4.2 How to determine that a drive is failing
4.5 Failed node
4.6 Issues with Placement Groups (PGs)
4.6.1 PG Status
4.6.2 PG stuck in some state for a long time
4.6.3 Default ruleset constraints
4.6.4 Inconsistent PG after scrub or deepscrub
4.6.5 Incomplete PG
4.6.6 Unfound objects
4.6.7 Stale PG
4.6.8 Peering and down PGs
4.7 Resolving issues with CRUSH maps
4.8 Object service RadosGW troubleshooting
4.8.1 RadosGW logs
4.8.2 RadosGW daemon pools
4.8.3 Authorization issues
4.8.4 Remapping index of RadosGW buckets
4.8.5 Quick functional check for RadosGW service

1 Introduction
The purpose of this manual is to provide the best practices for Ceph configuration, deployment,
operation, and troubleshooting. It is aimed to help deployment and operations engineers, as well
as storage administrators, to recognize and fix a majority of common Ceph operational issues.

© 2005–2016 All Rights Reserved www.mirantis.com

2 Deployment considerations

2.1 Ceph and OpenStack integration

When you deploy Ceph softwaredefined storage with Fuel and MOS, Cinder uses it to provide
volumes and Glance to provide image service. Ceph RadosGW object storage can be used by
any other service like an object store.
You can use Ceph as a back end for Glance and Cinder. However, in this case you need to
upload images to Glance in .raw format.

Ceph integration in OpenStack

The diagram above shows all data flows that Ceph is involved with. It simply does a back end
for the Cinder, replacing any legacy storage array; replaces Swift for the Object Service as a
back end for Glance, and provides ephemeral storage for Nova directly.

Ceph is integrated into OpenStack Nova and Cinder by Rados Block Device (RBD). This overlay
interface to Rados is using block addressing and it is supported by QEMU and Libvirt as a
native storage back end.

© 2005–2016 All Rights Reserved www.mirantis.com

Ceph communication inside openStack

The main advantage of Cinder with Ceph over Cinder with LVM is that Ceph is distributed and
networkavailable. Ceph also provides redundancy by data replication and allows the use of
commodity hardware. A properly defined CRUSH map is rack and host aware with full cluster
and HA based on quorum rule.
Another feature that can be used is Copyonwrite. It allows using an existing volume as a
source for unmodified data of another volume. Copyonwrite significantly accelerates
provisioning and consumes less space for new VMs based on templates or snapshots.
With networkavailable and distributed storage, the Live Migration feature is available even for
ephemeral disks. This can be used to evacuate failing hosts or to implement nondisruptive
upgrades of infrastructure.
The level of integration into QEMU also gives a possibility to use Cinder QoS feature to limit
uncontrollable VMs and prevent them from consuming all IOPS and storage throughput.

© 2005–2016 All Rights Reserved www.mirantis.com

2.2 Fuel Ceph deployment

Fuel uses the native Ceph tool cephdeploy to help with successful and clean node
deployments.
Fuel relies on adding a role to the host. For Ceph, Fuel provides the Ceph/OSD role. Network
and interface configurations should be changed to meet the requirements of an environment.

Note: By default, Ceph monitors reside on OpenStack controllers, and we do not have a
specific role for the monitors.

When a deployment change action is triggered by the Fuel UI or CLI, the node is deployed and
the ceph::osd manifests are applied; the disk is prepared and tagged with its UUIDs by
Cobbler just before the cephdeploy is used to populate configuration and finish making a new
OSD daemon instance. The last part done by cephdeploy is to place the new OSD into CRUSH
map and make it available to extended cluster.

2.3 Ceph default configuration

Fuel deploys a cluster with some bestpractice and entry configuration, but it should be tuned
and corrected according to specific expectations, workload and hardware configuration. The
parameters can change depending on the cluster size. Fuel deploys standard values
considerable up to midrange installations.

The Ceph configuration file can be found in:

/etc/ceph/ceph.conf

This file is managed by the cephdeploy tool. When a new node is deployed,
cephdeploy pulls this file from the controller. Any manual changes should also be populated

© 2005–2016 All Rights Reserved www.mirantis.com

on all nodes to maintain consistency of configuration files.

The configuration file is divided into the following sections: global, client, osd, mon,
client.radosgw , mds for each Ceph daemon.

The main section is global. It contains general configuration options and default values:

[global]
fsid = a78251904cb041689d5e7353e56c8b01 # clusterid
mon_initial_members = node17 # initial mons to connect
mon_host = 192.168.0.4 # mon host list
auth_cluster_required = cephx #when cephx is used
auth_service_required = cephx #when cephx is used
auth_client_required = cephx #when cephx is used
filestore_xattr_use_omap = true #for ext4 and other fs
log_to_syslog_level = info
log_to_syslog = True
osd_pool_default_size = 3 #default replica number for new pool
osd_pool_default_min_size = 1 #default mandatory replica count for
osd_pool_default_pg_num = 256 #default pg number for new pools

public_network = 192.168.0.4/24 #network for client communication

log_to_syslog_facility = LOG_LOCAL0
osd_journal_size = 2048 #default journal size (MB)
auth_supported = cephx #when cephx is used
osd_pool_default_pgp_num = 256 #default pg number for new pools
osd_mkfs_type = xfs #default fs for cephdeploy
cluster_network = 192.168.1.2/24 #inter cluster data network
osd_recovery_max_active = 1 #recovery throttling
osd_max_backfills = 1 #recovery and resize throttling

Example of client section:

[client]
rbd_cache_writethrough_until_flush = True
rbd_cache = True

RBD Cache is used to accelerate IO operations on instances, the default is to use writeback
cache mode, though it can be disabled by setting the additional option “rbd cache max
dirty” to 0. This option should comply Nova and libvirt settings on OpenStack side.
The rbd_cache_write through_until_flush option is used to start operations as a
writethrough and then switch to writeback mode to comply older clients.

© 2005–2016 All Rights Reserved www.mirantis.com

Example of RadosGW section:

[client.radosgw.gateway]
rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_socket_path = /tmp/radosgw.sock
rgw_keystone_revocation_interval = 1000000
rgw_keystone_url = 192.168.0.2:35357
rgw_keystone_admin_token = _keystone_admin_token
host = node17
rgw_dns_name = *.domain.tld
rgw_print_continue = True
rgw_keystone_token_cache_size = 10
rgw_data = /var/lib/ceph/radosgw
user = wwwdata

This whole section is describing the RadosGW configuration. All rgw_keystone_ prefixed
parameters are set to support Keystone user authentication. Keystone admin user is used, and
the Keystone vip is pointed.

The option rgw_print _continue = True should only be used when HTTP gateway
understands this HTTP response code and supports it. Fuel deploys the Inktank version of
Apache2 and FastCGI module that supports it.

The option rgw_dns_n ame = *.domain.tld should be modified to proper domain value
when using container names as a domain suffixes. It should resolve as a CNAME or A record to
the RGW host.

For example in BIND zone style:

* IN CNAME rgw.domain.tld.
rgw IN A 192.168.0.2

The configuration file can include many other options. For available options, refer to original
Ceph documentation.

© 2005–2016 All Rights Reserved www.mirantis.com

2.4 Configure the Imaging and Block Storage services for Ceph
You must manually modify the default OpenStack Imaging and Block Storage services to work
properly with Ceph.

To configure the Imaging and Block Storage for Ceph:

1. Log in to a controller node.
2. Open the /etc/cinder/cinder.conf for editing.
3. Change the glance_api_version to 2.

Example:

glance_api_version = 2

4. Save and exit.

5. Restart the Cinder API:

# service cinderapi restart

6. Open the /etc/glance/glanceapi.conf for editing.

how_image_direct_url parameter to False.
7. Set the s

Example:

show_image_direct_url = False

8. Restart the Glance API:

# service glanceapi restart

2.5 Adding nodes to the cluster

The new nodes are discovered and appear in the Fuel interface. When the node is added to the
“Ceph OSD” role, disk allocation can be reviewed in UI or CLI. There are two types of disks for
Ceph:
● OSD Data, that holds the data
● OSD Journal, that only stores all written data into a journal
The partitions are marked with different UUIDs, so they can be recognized later.

© 2005–2016 All Rights Reserved www.mirantis.com

JOURNAL_UUID = ’45b0969e9b034f30b4c6b4b80ceff106’
OSD_UUID = ’4fbd7e299d2541b8afd0062c0ceff05d’

If there is one Journal device, it will be evenly allocated to OSDs on the host.

Next, the Puppet manifests are started to use cephdeploy as the main tool to create the new
OSD. The Puppet script automatically adjusts the CRUSH map with the new OSD(s). After the
disks are cataloged by UUID and prepared (ceph mkfs), the new OSD is activated and a
daemon is launched. The cluster map, which contains the CRUSH map, is automatically
disseminated to the new cluster nodes from the monitors when the nodes attach to the cluster.

2.6 Removing nodes from the cluster

When a node is removed from a deployed cluster, Ceph configuration stays untouched. So
Ceph treats the node as if it had gone down. To completely remove the node from Ceph,
manual intervention is needed. The procedure is covered in the Remove an OSD subsection in
the Procedures section.

2.7 Timeconsuming operations on cluster

There are several clusterwide operations that are IO consuming, and the administrator should
be aware of the impact these operations can cause before starting those tasks. Most of these
operations have no option to cancel them or tune options to lessen the impact on cluster
performance. Therefore, great care must be taken to only execute the operations when safe.
Сluster remapping is the most affecting performance operation. It occurs when making any
changes in cluster size or placement. When the cluster is changed, the CRUSH algorithm
recalculates placement group positions, which causes data migrations inside the cluster.
Prior to the Hammer version of Ceph, there is a very limited possibility to throttle those
operations, and any cluster changes cause harmful operations and performance impact.

In the Fueldeployed default configuration, there are two options that help to address this issue:
osd_recovery_max_active = 1
osd_max_backfills = 1

Both options prevent the OSD from executing more than one recovery/backfill PG operation at a
time. By reducing the parallelism of operations, the overall internal load of the cluster is reduced
to a reasonable level. These options adversely affect the speed of recovery and backfill
operations because the operations are severely throttled.
The Ceph documentation also recommends tuning IO thread priorities. Ceph Hammer is the first
release to provide these options.

Another type of timeconsuming operation is the peering phase after OSD process is (re)started

© 2005–2016 All Rights Reserved www.mirantis.com

or brought up. The OSD process scans the whole disk just after is starts. When the OSD has to
scan a lot of files and directories, it takes a long time to gather the full tree (especially on slow
7.2k HDD drives).

2.8 Cache configuration

Starting with Firefly, Ceph provides a feature that allows fronting a large amount of OSDs with
spinning drives with a cache layer, most commonly executed in SSD. The SSD cache can be
deployed on nodes that provide regular OSDs or on specific cache nodes.

2.8.1 Design
Architecture
In a Ceph cache tier design, the underlying Ceph infrastructure remains unchanged. An
“Objecter” instance is created to manage the tiering infrastructure and communicate with both
the cache and the OSD back end.

Ceph cache tier architecture overview. Source: Ceph manual

Ceph provides two modes of caching: writeback and readonly. In the readonly mode, Ceph
maintains a set of the most requested rados objects in the cache for quick read. It can be paired
with SSD journaling for a moderate write performance increase. In the writeback mode, Ceph
writes to the cache tier first and provides a mechanism for the data in the cache tier to be written
to disk in an orderly fashion.

© 2005–2016 All Rights Reserved www.mirantis.com

In writeback mode, the data in the cache tier must be replicated or erasurecoded to ensure
data is not lost if a cache tier component fails before the cluster can write the data to disk.

2.8.2 Implementation
Deployment
As a cache tier is not deployed out of the box by Mirantis OpenStack, Mirantis recommends
deploying the cache nodes as regular Ceph nodes. Upon completion of the deployment, specific
rules are created in the crush map to place the cache pools on the SSD backed OSDs and the
regular pools on HDD backed OSDs.

Once the cache pools are established, they are added to the regular storage pools with the
ceph osd tier command set.

No changes are necessary on the client side, as long as the current CRUSH map is available to
all clients, which it must be for the Ceph cluster to function.

Cached Pools
As cache tiers are a perpool property, a separate cache pool must be created on the SSD
infrastructure for each pool that requires caching.
The pools that benefit most from caching are pools that either have high performance SLAs and
experience heavy read and especially write traffic.
The backup pool should only be cached if performance requirements can not be met with hard
disk based storage.

CopyonWrite
When CopyonWrite is utilized, the direct image URL must be exposed. Glance cache
management should be disabled to avoid double caching. An extensive explanation and a
lock Devices and OpenStack section of the Ceph
stepbystep guide is available in the B
documentation. As caching is necessary for both the copyonwrite source and destination, both
the images and compute volumes must be cached.

See also:
● Ceph Cache Tiering documentation:
http://docs.ceph.com/docs/master/rados/operations/cachetiering/

● Placing different pools on different OSDs

http://docs.ceph.com/docs/master/rados/operations/crushmap/#placingdifferentpool
sondifferentosds

● Block devices and OpenStack

http://docs.ceph.com/docs/master/rbd/rbdopenstack/

© 2005–2016 All Rights Reserved www.mirantis.com

2.9 Cache tiering HowTo

2.9.1 Create buckets

Before starting, verify that all OSD devices which are supposed to be caching OSDs are marked
“out”.
1. List of OSDs may be retrieved with the command:
#ceph osd tree

2. Finding out which OSDs need to be moved to caching bucket, mark them as “out”:

#ceph osd out <osd_number>

Watching at ceph w output, find out when the process of replication of the placement
groups is finished.

3. Create 2 buckets for regular OSDs and cache OSDs and move them to root:
ceph osd crush addbucket regular datacenter
ceph osd crush addbucket cache datacenter
ceph osd crush move cache root=default
ceph osd crush move regular root=default

4. Move those hosts with regular OSDs to the bucket “regular”, and all hosts with fast
OSDs to the bucket “cache” (execute for every host):
# ceph osd crush move <hostname> datacenter=<regular/cache>

5. Verify that the structure of the OSDs is correct with ceph osd tree command.

2.8.2 CRUSH map modifications

1. Get current map and decompile it:
ceph osd getcrushmap o crushmap.compiled

crushtool d crushmap.compiled o crushmap.decompiled

2. Сhange the first CRUSH rule and add one more for caching pools:

© 2005–2016 All Rights Reserved www.mirantis.com

………………………...
# rules
rule replicated_ruleset {
ruleset 0
type = replicated
min_size 1
max_size 10
step take regular
step chooseleaf firstn 0 type host
step emit
}
rule cache_ruleset {
ruleset 1
type replicated
min_size 1
max_size 10
step take cache
step chooseleaf firstn 0 type host
step emit
}

3. Save the map, compile and upload it:

crushtool c crushmap.decompiled o crushmap_modified.compiled
ceph osd set crushmap i crushmap_modified.compiled

Watching at ceph w output, find out when the process of replication of the placement
groups is finished.
4. Bring all inactive OSDs back in, so that they become active:
# ceph osd in <osd_number>

2.8.3 Create new caching pools

ache_ruleset C
1. Create caching pools using c RUSH rule and creating 512 placement
groups per pool (the number is calculated for 16 OSDs):
ceph osd pool create cacheimages 512 cache_ruleset
ceph osd pool create cachevolumes 512 cache_ruleset
ceph osd pool create cachecompute 512 cache_ruleset

Watching at ceph w output, find out when the process of replication of the placement
groups is finished.
2. Update ACLs for existing CEPH users, so that they can use new pools:
ceph auth caps client.compute mon 'allow r' osd 'allow classread
object_prefix rbd_children, allow rwx pool=volumes, allow rx

© 2005–2016 All Rights Reserved www.mirantis.com

pool=images, allow rwx pool=computes, allow rwx pool=cachevolumes,
allow rx pool=cacheimages, allow rwx pool=cachecompute’

ceph auth caps client.volumes mon 'allow r' osd 'allow classread
object_prefix rbd_children, allow rwx pool=volumes, allow rx
pool=images, allow rwx pool=cachevolumes, allow rx pool=cacheimages’

ceph auth caps client.images mon 'allow r' osd 'allow classread
object_prefix rbd_children, allow rwx pool=images, allow rwx
pool=cacheimages’

2.9.4 Set up caching

Now, turn on caching. Caching pools have to be set up as the overlays for the regular pools.
ceph osd tier add compute cachecompute
ceph osd tier cachemode cachecompute writeback
ceph osd tier setoverlay compute cachecompute

ceph osd tier add compute cachevolumes

ceph osd tier cachemode cachevolumes writeback
ceph osd tier setoverlay volumes cachevolumes

ceph osd tier add compute cacheimages

ceph osd tier cachemode cacheimages writeback
ceph osd tier setoverlay images cacheimages

Setting up the pools will require some specific parameters of the cache to be set up.
ceph osd pool set cachecompute hit_set_type bloom
ceph osd pool set cachevolumes hit_set_type bloom
ceph osd pool set cacheimages hit_set_type bloom

ceph osd pool set cachecompute cache_target_dirty_ratio 0.4

ceph osd pool set cachecompute cache_target_dirty_high_ratio 0.6
ceph osd pool set cachecompute cache_target_full_ratio 0.8

ceph osd pool set cachevolumes cache_target_dirty_ratio 0.4

ceph osd pool set cachevolumes cache_target_dirty_high_ratio 0.6
ceph osd pool set cachevolumes cache_target_full_ratio 0.8

ceph osd pool set cacheimages cache_target_dirty_ratio 0.4

ceph osd pool set cacheimages cache_target_dirty_high_ratio 0.6
ceph osd pool set cacheimages cache_target_full_ratio 0.8

© 2005–2016 All Rights Reserved www.mirantis.com

At this time, caching has to be ready to work.

2.9.5 Turn cache down

To turn off caching of a particular pool, execute the following set of commands:
ceph osd tier cachemode <cache pool> forward
rados p <cache pool> cacheflushevictall
ceph osd tier removeoverlay <regular pool>
ceph osd tier remove <storage pool> <cache pool>

3 Operations

3.1 Procedures

3.1.1 Remove an OSD

Do not let your cluster reach its full ratio when removing an OSD. Removing OSDs could
cause the cluster to reach or exceed its full ratio.

1. Remove the old OSD from the cluster:

ceph osd out {osdnum}

2. Wait till data migration complete

ceph w

You should see the placement group states change from active+clean to active, some
degraded objects, and finally active+clean when migration completes.
3. Stop the OSD:

service ceph stop osd.{osdnum}

4. Remove the OSD from the CRUSH map:

ceph osd crush remove osd.{osdnum}

5. Delete the authentication key:

ceph auth del osd.{osdnum}

6. Remove the OSD from cluster:

© 2005–2016 All Rights Reserved www.mirantis.com

ceph osd rm {osdnum}

Note: If an OSD is removed from the CRUSH map, a new OSD subsequently created
eph osd create is called without parameters.
will be assigned the same number if c
7. Remove entry for the OSD from /etc/ceph/ceph.conf if present.
8. Optional. If a device is to be replaced, add the new OSD using the procedure described
in the Add an OSD subsection in the P
rocedures section.
Note: Replication of the data to the new OSD will be performed here. If multiple OSDs
are to be replaced, add new OSDs gradually to prevent excessive replication load.

3.1.2 Add an OSD

1. List disks in a node:

cephdeploy disk list {node}

2. The Cephdeploy tool can be used with one ‘create’ command or with two steps, as a
safer option, or while preparing disks manually.
a. Create a new OSD using one command:

cephdeploy osd create {node}:{devicename}[:{journaldevice}]

b. Use the twostep method:

cephdeploy osd prepare {node}:{devicename}

cephdeploy osd activate {node}:{devicename}
c. If you are trying to add an OSD with the journal on the separate partition:

cephdeploy osd prepare {node}:{devicename}:{journal_dev_name}

cephdeploy osd activate {node}:{devicename}

Note: Avoid simultaneous activation of multiple OSDs with default Ceph settings
as it can severely impact cluster performance. Backfilling (osd_max_backfills)
and recovery settings (osd_recovery_max_active) can be tuned to lessen the
impact of addition of multiple OSDs at once. Alternatively, multiple OSDs can be
added at a lower weight and gradually increased, though this approach prolongs
the addition process.
3. You may want to replace any physical device on the CephOSD node (in case it is
broken). In this case, same journaling partition may be used, and the steps for the drive
replacement may be the following:
a. Shut down the cephosd daemon if it is still running:
stop cephosd <ID>

© 2005–2016 All Rights Reserved www.mirantis.com

b. Figure out which device was used as a journal (it is a soft link to
/var/lib/ceph/osd/ceph<ID>/journal).
c. Remove the OSD from the CRUSH map (see above).
d. Shut down the node and replace the physical drive.
ephosd instance with the new drive
e. Bring the node up and add the new c
following the steps above.

4. Clean the previously used drive and prepare it for the new OSD:

cephdeploy disk zap {nodename}:{devicename}

cephdeploy overwriteconf osd prepare {nodename}:{devicename}

Important: This will DELETE all data on {devicename} disk.

5. Verify the new device is placed inside the CRUSH tree and recovery started:

ceph osd tree

ceph s

3.1.3 Remove Ceph monitor from healthy cluster

Remove the monitor from the cluster:

ceph mon remove {monid}

Important: This operation is extremely dangerous to a working cluster, use with care.

3.1.4 Decreasing recovery and backfilling performance impact

The main settings which affect recovery are:
● osd max backfills integer, default 10, the maximum number of backfills allowed to or
from a single OSD.
● osd recovery max active integer, default 15, the number of active recovery requests per
OSD at one time.
● osd recovery threads integer, default 1. The number of threads for recovering data.

Increasing these settings value will increase recovery/backfill performance, but decrease client
performance and vice versa.

© 2005–2016 All Rights Reserved www.mirantis.com

3.1.5 Remove Ceph monitor(s) from downed cluster
1. Find the most recent monmap:

ls $mon_data/monmap

2. Copy the monmap to a temporary location and remove all nodes that are damaged or
failed:
cp $mon_map/monmap/{node_number} ~/newmonmap
monmaptool ~/newmonmap rm {nodename}
...

3. Verify that cephmon is not running on the affected node(s):

service ceph stop mon

4. Inject the modified map on all surviving nodes:

cephmon i a injectmonmap /tmp/foo

5. Start surviving monitors:

service ceph start mon

6. Remove the old monitors from the c eph.conf.

3.1.6 Add Ceph monitor to cluster

1. Procure the monmap:

ceph mon getmap o ~/monmap

2. Export the monkey:

ceph auth export mon. o ~/monkey

3. Add a section for the new monitor to the / etc/ceph/ceph.conf.

4. Add mon_addr for the new monitor to the new section with IP and port.
5. Create the ceph monitor:

cephmon i {monname} mkfs monmap ~/monmap keyring ~/monkey

6. The monitor will automatically join the cluster.

3.2 Failure Scenarios

© 2005–2016 All Rights Reserved www.mirantis.com

3.2.1 Failed OSD device
1. Determine the failed OSD:

ceph osd tree |grep i down

Example output:
# id weighttype name up/down reweight
0 0.06 osd.0 down 1
2. Set “noout”.
3. Remove the failed OSD(0/osd.0 in the example) from the cluster. See the Remove OSD
subsection in the Procedures section.
The cluster will start to replicate data to recover the potentially lost copies.
4. Examine the node holding disk and eventually replace the drive.
5. If the drive is lost, create a new OSD and add it to the cluster. See the Add OSD
subsection in the Procedures section.

What will happen to my data if one of the OSDs fails?

If an OSD is failed, Ceph starts a countdown trigger (mon_osd_down_interval), and when it

expires (default is 5 minutes), recovery is commenced by replicating data to achieve assumed
data replication ratio even with a failed OSD. As data is replicated across multiple OSDs, data
loss only occurs if all OSDs containing a replica of the data are lost at the same time.

3.2.2 Lost journal device

As the journal device is also a physical drive, it can fail. All OSDs, which use the failed journal
device, will also fail and must be recovered. After finding the root cause of failure, the whole
OSD should be recreated to preserve data safety. For this, follow the steps described in the
Remove an OSD and Add an OSD subsections in the P rocedures section.

3.2.3 Failed storage node

1. Determine which node has failed with cluster health commands and tree.
2. Remove all OSDs from the cluster if they are still present. For details, see the Remove
OSD subsection in the Procedures section.
3. Deploy a replacement node.
4. Add Ceph OSDs from the node as appropriate. See the Add OSD subsection in the
Procedures section.

3.2.4 Failed Ceph monitor

1. Remove the monitor from the healthy cluster. For details, see the Remove monitor from
rocedures section.
healthy cluster subsection in the P

© 2005–2016 All Rights Reserved www.mirantis.com

dd monitor subsection in the
2. Add a new monitor to the cluster. For details, see the A
Procedures section.

3.2.5 Ceph monitor quorum not met

1. Remove the monitor from the downed cluster. For details, see the Remove monitor from
downed cluster subsection in the P rocedures section.
2. Add a sufficient number of new monitors to a cluster to form quorum. For details, see the
rocedures section.
Add monitor subsection in the P

3.2.6 Client loses connection

1. Repair Client network connectivity.
2. Client must be able to communicate with all Ceph monitors and OSDs.
3. Verify Ceph cluster access is restored.

3.2.7 Network issue in Ceph cluster environment

1. Repair intercluster connectivity.
2. Ceph monitors and OSD nodes should have working intracluster communication. Two
networks can be used there.
3. Verify Ceph cluster access is restored.

3.2.8 Time synchronization issue

There are strong requirements to keep all cluster nodes in sync. Most important is to have every
MON node synchronized with some arbitrary time source. The Paxos algorithm relays on
timestamped maps that are created and marked down relying on cluster state. If there is any
time difference, Paxos can mark healthy process as down, become unreliable, partition a
cluster, or even die. Internal MON check verifies that the maximal time difference between
.05sec (default value).
running nodes is not higher than 0

It is therefore strongly recommended using NTP, ideally a common NTP source, to keep all
cluster nodes and clients in time sync. However, if the cluster is not configured with NTP,
monitoring should be configured to react to the following warning:

HEALTH_WARN clock skew detected on mon.XXXX

The solution is to check clock accuracy on all MON nodes and to correct any errors.

After fixing this issue, verify that the warning disappears. It can take up to 300 seconds for Ceph
to recheck clock skew.

3.2.9 Object Service failure

© 2005–2016 All Rights Reserved www.mirantis.com

If the Object Service fails, the troubleshooting procedure should be engaged. First, perform the
following steps:

1. Check the Apache service availability on controllers:

curl i http://localhost:6780/
This should give a 200 result code with standard AWS response about empty bucket list.
2. If there is a 500 response, restart RadosGW as a simplest possible solution:
/etc/init.d/radosgw restart

3.2.10 Complete cluster restart/power failure

Ceph monitors must be started first, then OSDs:
1. Make sure that the network connectivity is restored (wait for the switch to boot).
2. Start MON nodes/daemons.
3. Wait for the quorum to be established.
4. Start the OSD daemons.

Peering state can take significant time on big clusters with OSD daemons full of data placed on
HDD drives. This can even cause timeouts or a daemon irresponsibility timeout finished with a
daemon suicide procedure. During this stage, OSD flapping can be observed when a heavily
busy OSD daemon is losing and recovering connectivity to MON and other OSDs. When an
OSD dies during this phase, it should be restarted. The second or another peering try should be
significantly faster because of the file, descriptors, and directory tree cache.

If starting a cluster causes a massive amount of timeouts and the daemon suicides all the time,
a couple of options can be changed to help OSD daemons wait a little bit more.

Increasing those options will help to establish a stable OSD, and will decrease the amount of
daemon suicides caused by timeouts.

osd heartbeat grace = 240 # default is 20

mon osd report timeout = 1800 # default is 900
osd heartbeat interval = 12 # default is 6

During a complete cluster startup, client operations will be enabled when at first MON quorum is
established and a minimal amount of replicas is available (active PGs).

3.2.11 Out of disk space on MON

It is crucial to have enough free space in MON work directories. LevelDB, used as core MON
internal state database, is very fragile to out of disk space situations. At the level of 5% of disk
space available, MON will exit and will not start.

© 2005–2016 All Rights Reserved www.mirantis.com

MON database is growing during normal operations, this is the standard behaviour of the
LevelDB store. To reclaim this space, perform one of the following:

● Add the following option to the ceph.conf file and restart mon:
mon compact on start = true

● Or run the command:

ceph tell mon.XXX compact

Note: During the compacting procedure, the LevelDB database will grow even more at first.
Then compacting will replace many files with one with actual data and thus reclaim the free
space.

3.2.12 Out of disk space on OSD

Ceph prevents writing to an almost full OSD device. By default it stops at 95% of used disk
space. A warning message appears at the level of 85%.

1. Analyse the situation, check “ceph health detail” to find an almost full OSD.
2. Add more space to logical volume, or
Add new OSD devices.
3. Wait for rebalance and refill the OSDs.

3.3 Tuning
Ceph clusters can be parametrized after deployment to better fit the requirements of the
workload. Some configuration options can affect data redundancy and have significant
implications on stability and safety of data. Tuning should be performed on test environment
prior issuing any command and configuration changes on production.
All changes should be documented and reviewed by experienced staff. Before and after the
change, the full set of tests should be executed.

3.3.1 Using cephdeploy to distribute configuration over cluster

The cephdeploy tool can also be used to distribute configuration changes over the cluster. It is
recommended implementing any fixes on one node and then distributing the new configuration
file to the rest of the nodes.
The following can be executed on a node with changed configuration:

cephdeploy config push nodename1 nodename2 nodename3

© 2005–2016 All Rights Reserved www.mirantis.com

Another useful tip is to use this tool with name expansion format as:

cephdeploy config push nodename{1,2,3}

which will be equal to the example above.

3.3.2 Changes

3.3.2.1 Changes in a config file

All changes made in configuration file will be read and implemented during the daemon startup.
Thus, after making any changes in the ceph.conf configuration file, the daemons need to be
restarted to take the changes in effect.

3.3.2.2 Online changes with monitor

Changes can be injected online through monitor to the daemon communication channel:

ceph tell osd.0 injectargs debugosd 20

3.3.2.3 Online changes with admin socket

Changes can also be implemented by using admin socket communication to daemon while
MON is unreachable or when it is more convenient:

ceph admindaemon /var/run/ceph/cephosd.0.asok config set debug_osd

20/20

Important: Any online changes will not be saved during daemon restart. To make the
changes permanent, configuration file change is mandatory.

3.3.3 Common tuning parameters

For a production cluster, any changes in configuration should be tested in a test environment.
However, there are some situations when a production cluster can respond differently to
changes and make a regression. Extreme caution is required when performing any tuning.

The most commonly changed parameters:

public_network = 192.168.0.4/24 #points to client network
cluster_network = 192.168.1.2/24 #points to inter OSD communication
osd_recovery_max_active = 1 #
osd_max_backfills = 1

© 2005–2016 All Rights Reserved www.mirantis.com

3.3.4 Performance measurement best practice
For best measurement results, follow these rules while testing:
1. Change one option at a time.
2. Understand what is changing.
3. Choose the right performance test for the changed option.
4. Retest at least ten times.
5. Run tests for hours, not seconds.
6. Trace for any errors.
7. Decisively look at results.
8. Always try to estimate results and see at standard difference to eliminate spikes and
false tests.

3.4 Ongoing operations

3.4.1 Background activities

The Ceph cluster is constantly monitoring itself with scrub and deep scrub. Scrub verifies
attributes and object sizes. It is very fast and not very resourcehungry ideal for daily checks.
Deep scrub checks each rados object’s checksum by CRC32 algorithm, and every difference in
replicas is reported as inconsistent.
Scrub and deep scrub operations are very IOconsuming and can affect cluster performance.
However, these operations should be enabled to ensure data integrity and availability. Ceph
tries to execute scrub and deep scrub when a cluster is not overloaded. But once executed,
scrub is running till it finishes checking of all the PGs.

To disable scrub and deep scrub, run the following commands:

ceph set noscrub

ceph set nodeepscrub

To restore standard options, run:

ceph unset noscrub

ceph unset nodeepscrub

To finetune the scrub processes, use the following configuration options (default values
provided):

osd_scrub_begin_hour = 0 # begin at this hour

osd_scrub_end_hour = 24 # start last scrub at

© 2005–2016 All Rights Reserved www.mirantis.com

osd_scrub_load_threshold = 0.05 #scrub only below load
osd_scrub_min_interval = 86400 # not more often than 1 day
osd_scrub_max_interval = 604800 # not less often than 1 week
osd_deep_scrub_interval = 604800 # scrub deeply once a week

3.4.2 Monitoring
Monitoring of a Ceph cluster should include utilization, saturation and errors. Ceph itself has a
number of tools to check cluster health beginning with simple CLI tools, ending with API
methods to gather health status. Several methods for observing cluster performance and health
can be deployed. It is most common to include health checks in some dedicated monitoring
software like Zabbix and Nagios. Standard Ceph messages have two severities WARN and
ERR, both of them should be noted by cluster operators and both should be treated as
significant signals to check the cluster.
Basic metrics of the Ceph cluster and of the OS on the Ceph nodes should also be monitored to
see actual cluster utilization and predict possible performance issues.

In a Fuel deployed Ceph cluster, Zabbix is configured as the main monitoring software. It
gathers and stores all information into an internal database. All parameters are configured as
items. When an item arrives, it is also compared with configured triggers. When the value of an

© 2005–2016 All Rights Reserved www.mirantis.com

item is beyond the trigger rule, some action can be executed. Zabbix provides a dashboard with
graphs and plots trends for easy to use day to day monitoring.
The amount of gathered data depends on the Zabbix agent and options configuration. The
administrator can add or modify items to better fit monitoring requirements.

Special attention should be given to disk space management, since stability and overall cluster
health depends on many database or metadata operations.

The Ceph cluster consists of a number of separate, dedicated daemon processes. Monitoring
software should check for the correct number of MON, OSD’s and RadosGW processes. It is
also useful to monitor the memory allocated and used by processes to quickly find any memory
leaks especially from an OSD process.

Process table
Name Purpose Process Count on Open Ports System
name host Memory Per
OSD

Ceph MON cluster cephmon 1, at least 3 6789 ~1GB

coordination summary

Ceph OSD data daemon cephosd as many as 6800, 6801, ~13GB

devices 6802, 6803

RadosGW http rest radosgw as necessary fastcgi socket <1GB

interface on HTTP
gateways

3.4.3 Dumping memory heap

Every Ceph process can be instructed to dump memory heap. This can be very helpful to debug
any memory leaks and processrelated problems.

Memory heap dumping procedure:

1. Start profiler:
ceph osd tell osd.0 heap start_profiler
2. Dump heap:

© 2005–2016 All Rights Reserved www.mirantis.com

ceph tell osd.0 heap dump

Example output:
osd.0dumping heap profile now.

MALLOC: 41200288 ( 39.3 MiB) Bytes in use by application
MALLOC: + 581632 ( 0.6 MiB) Bytes in page heap freelist
MALLOC: + 5124544 ( 4.9 MiB) Bytes in central cache freelist
MALLOC: + 2752512 ( 2.6 MiB) Bytes in transfer cache freelist
MALLOC: + 6964128 ( 6.6 MiB) Bytes in thread cache freelists
MALLOC: + 1323160 ( 1.3 MiB) Bytes in malloc metadata
MALLOC:
MALLOC: = 57946264 ( 55.3 MiB) Actual memory used (physical + swap)
MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped)
MALLOC:
MALLOC: = 57946264 ( 55.3 MiB) Virtual address space used
MALLOC:
MALLOC: 2544 Spans in use
MALLOC: 86 Thread heaps in use
MALLOC: 8192 Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the

3. Stop profile:
ceph tell osd.0 heap stop_profiler

3.4.4 Maintenance
The Ceph cluster is mostly automanaged and autohealing on the software side, but the
hardware underlay should be maintained, reviewed, and repaired as needed. Ceph reacts to
any hardware failure and prevents data loss by making additional replicas when any OSD is
lost.
The administrator should apply available upgrades, especially revision releases, after careful
consideration and planning. Security and critical patches are backported to longterm support
releases (as Firefly) just after implementation to current release.

For Fuel deployed environment, refer to the Mirantis OpenStack Operations Guide.

4 Troubleshooting

4.1 Overall Ceph cluster health

To observe overall Ceph cluster health, use simple bundled commands:

Simple status:

ceph s

Detailed Ceph health with additional informations and debug output:

ceph health detail

Report on config and runtime statuses and counters (very extensive):

ceph report

4.2 Logs
Logs for all Ceph core components are stored in the /var/log/ceph directory. Logs are
rotated daily.
Default debug levels are not sufficient in many times. Ceph daemons enable changing the log
level through a live argument injection to daemon process. While most of the log facilities are
running with semihigh (5) logging level in memory, the loglevel for ondisk storing is enabled at
very low level only for some services.
Debug level can be changed online or in / etc/ceph/ceph.conf file for boot time read.
Debug level => 0 turned off => 1 terse => 20 verbose

The simplest way to change debug level online:

ceph tell osd.0 injectargs debugosd 5/5
Note: This method requires MON connectivity. If you have issues with that, then use the next
method (configuration change or socket connection).

Changing debug level by connecting to local socket (has to be run on daemon’s machine):
ceph admindaemon /var/run/ceph/cephosd.0.asok config set debug_osd 5/5

The most common service list to debug Ceph issues is: rados, crush, osd, filestore, ms, mon,
auth.

Ceph logging subsystem is very extensive and resource consuming, it can generate a lot of data
in a very short time. Be aware of free disk space for verbose logging.
Procedure entry into log and debug routines are also very timeconsuming, you should be
aware that best performance results can be archived without any debug options and log levels
set to turnedoff. It is recommended that you keep reasonable low level of debugging during
normal operations and set it higher only for troubleshooting.

4.3 Failed MON

The MON instances are most important to the cluster, so troubleshooting and recovery should
begin with those instances.

Use the following command to display the current state of quorum, MONs, and PAXOS
algorithm status:
ceph quorum_status format jsonpretty

If a client can not connect to MON, there can be problems with:

1. Connectivity and firewall rules. Verify that the TCP port 6789 is allowed on monitor
hosts.
2. Disk space. There should be safe free disk space margin for LevelDB internal database
operation on every MON node.
3. MON process that is not working or is out of quorum. Check quorum_status ,
mon_status and ceph s output to identify failed MON and try to restart it or deploy a
new one instead.
If the methods above fail, try to increase debug level on debug_mon to 10/10 via inject args, or
admin socket as described in 3.1 to find the root cause of failure.
If a daemon is failing on LevelDB operation or another assertion, file a bug report for Ceph.

4.4 Failed OSD

It is important to continuously monitor cluster health as there can be many different root causes
that cause OSD processes to die. Some of them may be caused by hardware failures, including
hard to determine and unpredictable firmware and physical failures. There is also a possibility to
experience a software bug that can cause bad assertion and abnormal OSD exit.
Ceph administrator has several ways to determine Ceph cluster health, most common is to
observe admin commands output first, then to deeply debug failed devices and equipment.
Ceph cluster can be monitored from any node involved into cluster operations, but good practice
is to check it from MON nodes, as they are the closest to local MON daemons.

ceph s
or
ceph health

In case of any concerns, warnings, and errors, the troubleshooting procedure should be
engaged. The health of the cluster is crucial to the data safety and for the operations reliability.
This is an example of command output while one of OSD daemons is down:
cluster f4ad6d656d3743189e5ca5f59d6e6ad7
health HEALTH_WARN 767 pgs stale; 1 requests are blocked > 32 sec; 1/4 in
osds are down
monmap e1: 1 mons at {node36=192.168.0.1:6789/0}, election epoch 1, quorum 0

node36
osdmap e74: 4 osds: 3 up, 4 in
pgmap v1452: 3008 pgs, 14 pools, 12860 kB data, 51 objects
8403 MB used, 245 GB / 253 GB avail
767 stale+active+clean
2241 active+clean
client io 0 B/s rd, 0 B/s wr, 0 op/s

If a problem with an OSD is identified (by looking at HEALTH_WARN and the number of the OSDs
that are down and up OSD count), perform the procedure for replacing the failed OSD. See the
Failed OSD device subsection in the F ailure Scenarios section of this document.

Possible causes (most common) are:

● Hard disk failure. It can be determined by system messages, or SMART activity. Some
defective disks are very slow because of extensive TLER activity.
● Network connectivity issues. You can use ordinary network check tools like ping,
tracepath, iperf to debug this.
● Out of disk space on filestore. When you are running out of space, Ceph is triggering
alarms with HEALTH_WARN on 85% full and HEALTH_ERR on 95% full. Then it stops
to prevent fulfillment of whole disk. Note that it is not just filestore, it holds also indexed
metadata and metadata for files. It is very important to keep enough free space for
smooth operations.
● Running out of system resources or hitting limits cap. There should be enough system
memory to hold all OSD processes on machine, and system limits for open files and
maximal number of threads should be big enough.
● OSD process heartbeats limits causes processes to suicide. A default process and
communication timeouts can be not enough to perform IOhungry operations especially
during recovery after failure. This can be also observed as OSD flapping.

4.4.1 OSD is flapping during peering state, after restart or recovery

You can stabilize IOhungry operations causing timeouts by turning on “nodown”, “noup” options
for the cluster:
ceph set nodown
ceph set noup
ceph set noout
When the whole cluster is healthy and stable, restore this to default values by running:
ceph unset nodown
ceph unset noup
ceph unset noout

4.4.2 How to determine that a drive is failing
Logs should contain extensive information regarding the failing device. There should also be
some sign in SMART.

To check the logs, the administrator can execute:

dmesg | egrep sd[az]

martctl to extract informations and perform tests:

Examine the suspicious device with s
smartctl a /dev/sdX

The drive that is going to fail can be also determined by watching response (seek) times and
overall disk responsiveness. Any drive that shows sustained unusual values can be about to fail:
iostat x /dev/sdX
It is also necessary to watch for avgqusz values (should be lower than device queue) and
util parameters. They should be more or less equal on all the devices of the same type.

4.5 Failed node

First examination should determine connectivity issues and networkrelated problems. If SSH
connection to node is working, and simple ping tests ensure that network layers are OK, further
examination should focus on:

1. Failed node hardware failure that can cause:

○ Connectivity issues
○ OSD to die on longlasting IO operation
○ OSD to die on IO error or EOT
○ Many unpredictable issues with OS or OSD daemons
2. Failed node software (OS, Ceph) issue

4.6 Issues with Placement Groups (PGs)

4.6.1 PG Status
The optimum PG state is 100% active + clean. This means that all Placement Groups are
accessible, and assumed replica number is avaliable for all PGs. If Ceph also reports other
states, it is a warning or an error status (beside scrub or deepscrub operations).
PG status quick reference (for a complete one, refer to the official Ceph documentation):

State Description

Active Ceph will process requests to the placement group.

Clean Ceph replicated all objects in the placement group the correct
number of times.

Down A replica with necessary data is down, so the placement group is

offline.

Degraded Ceph has not replicated some objects in the placement group the
correct number of times yet.

Inconsistent Ceph detects inconsistencies in the one or more replicas of an

object in the placement group (for example objects are wrong size,
fter recovery finished, and so
objects are missing from one replica a
on).

Peering The placement group is undergoing the peering process.

Recovering Ceph is migrating/synchronizing objects and their replicas.

Incomplete Ceph detects that a placement group is missing information about

writes that may have occurred, or does not have any healthy
copies. If you see this state, try to start any failed OSDs that may
contain the needed information or temporarily adjust min_size to
allow recovery.

Stale The placement group is in an unknown state the monitors have

not received an update for it since the placement group mapping
changed.

4.6.2 PG stuck in some state for a long time

When a new Pool is created and after a reasonable time does not get an active+clean status,
there is most likely an issue with configuration, CRUSH map, or there are too few resources to
achieve configured replication level.
Debugging should start with examination of cluster state and PG statuses:

ceph osd pg dump # to find PG number for any status

ceph pg {pg_id} query #to see verbose information about PG in JSON

While analysing query output, special attention should be paid on “info”, “peer_info”, and
“recovery_status” sections.

The monitor warns about PGs that are stuck in the same status for some time. They can be
listed with:
ceph pg dump_stuck stale
ceph pg dump_stuck inactive
ceph pg dump_stuck unclean

4.6.3 Default ruleset constraints

The Ceph data distribution algorithm is working according to the rulesets that are stored in the
monmap encoded file. The monmaps are replicated and versioned to maintain cluster
consistency and condition.
If there is an issue with the syntax of the ruleset, it should be found during monmap compilation,
but there can be logical mistakes that will pass analyze before compilation and will cause the
CRUSH algorithm to not distribute data as was assumed, or it will prevent the cluster from
getting the active+clean state on all PGs.

The first thing to verify is whether the default replication ratio condition is met by checking
min_size and size in conjunction with the hardware configuration of cluster. The default Ceph
configuration is prepared for replication against hosts, not OSDs.

4.6.4 Inconsistent PG after scrub or deepscrub

The scrub operation is used to check the availability and health of objects. PGs are scrubbed
while a cluster is not running any IO intensive operations, for example recovery (scrubbing
already started will continue, though). If this task finds any object with broken or mismatched
data (checksum is checked), it will mark this object as unusable and manual intervention and
recovery is needed.
Ceph prior 0.90 version does not store object checksum information while it is written.
Checksums are calculated on OSD write operations, and Ceph cannot arbitrarily decide which
one is the correct one. For a simple example with 3 replicas and one different checksum it is
easy to guess which one is wrong and should be corrected (recovered from other replica), but
when there are 3 different checksums, or we got some bit rot, or malfunction of the controller on
two nodes we cannot arbitrarily say which one is good. It is not an endtoend data correction
check.

Manual repair of a broken PG is necessary:

1. First find a broken PG with inconsistent objects:

ceph pg dump | grep inconsistent
or
ceph health detail

2. Then instruct to repair (when the primary copy is our good data), or repair manually, by
moving/deleting wrong files on OSD disk:
ceph pg repair {pgnum}

Important: The repair process is very tricky when the primary copy is broken. Current
repair behavior with replicated PGs is to copy the primary's data to the other nodes.
This makes the Ceph cluster selfconsistent, but might cause problems for consumers
if the primary had the wrong data.

4.6.5 Incomplete PG
This warning is issued when the actual replica number is less that min_size.

4.6.6 Unfound objects

When Ceph cluster health command returns information about unfound objects, it means that
there are some parts of data that are not accessible in even one copy.

ceph health detail

Thу following command displays a PG name with unfound objects. Then the PG should be
examined for any missing parts:

ceph pg {pgname} list_missing

4.6.7 Stale PG
Simply restart an affected OSD. This issue occurs when an OSD cannot map all objects that it
holds. To find the OSD, run the following command:
ceph pg dump_stuck stale

Then map the PG:

ceph pg map {pgname}

Alternatively, the information can be acquired with:

ceph health detail

This command will display defective OSDs as “last acting” ones.

Those daemons should be restarted and deeply debugged.

4.6.8 Peering and down PGs
When any peering and down PGs are lasting for a long time after any cluster change (recovery,
adding new OSD, map or ruleset changes).

The following command will display affected PGs, then we should identify the issue causing
peer:
ceph health detail

Thу following command will display in the ["recovery_state"][“blocked”] section why the peering
is stopped:
ceph pg {pgname} query
There will be information about some OSD being down in most of cases.

When the OSD cannot be brought up again, it should be marked as “lost”, and the recovery
process will begin:
ceph osd lost {osd_number}

4.7 Resolving issues with CRUSH maps

After making any changes in CRUSH maps, the new version should be tested to confirm
compliance with OSD layout and to review any issues with new data placement. It is also good
to review it and look at the amount of data that will be remapped with new placement reorder.

crushtool i crush.map test showbadmappings \

rule 1 \
numrep 9 \
minx 1 maxx $((1024 * 1024))

Placement statistics can be checked with the following command:

crushtool i crush.map test showutilization rule 1

4.8 Object service RadosGW troubleshooting

RadosGW is the object storage service component of Ceph. It provides an S3 and Swift
compatible RESTful interface to the Ceph RADOS backend store.
The RadosGW daemon is connected through FastCGI interface with an Apache HTTPD server,
which acts as an HTTP gateway to the outside.

4.8.1 RadosGW logs
RadosGW logs are separated and stored in:
/var/log/radosgw/cephclient.radosgw.gateway.log

The logs are rotated daily like the rest of Ceph logs.

To debug the rgw service, the log level can be increased:

debug rgw = 10/10 # (Representing Log Level and Memory Level)

This setting will provide extensive information output and a significant amount of data to
analyse. The default settings are ‘1/5’.

For the Apache HTTPd daemon the logs are stored in:
/var/log/apache

The main error.log in this directory is useful for debugging most of the issues.

4.8.2 RadosGW daemon pools

Object storage is using several pools. And to troubleshoot any performance and availability
issue, debugging of underlying Rados pools is essential:

1. Check overall cluster health and PG statuses.

2. Check Ceph cluster performance with simple checks.
3. Check RadosGW process and logs.

Default pool names to consider health status:

.rgw.root
.rgw.control
.rgw
.rgw.gc
.users.uid
.users
.rgw.buckets.index
.rgw.buckets

4.8.3 Authorization issues

RadosGW connects to OpenStack Identity service (Keystone) for authorization. If Keystone is
03 Access Forbidden
not available, this will result in constant authorization failures and 4

responses to the clients.
Connection and availability of the Identity service should be checked.

To check whether a user is available:

radosgwadmin user info uid {radosgwuser}

The user should not be suspended to use this service and should have an S3 key and a Swift
key to use both service endpoints.

4.8.4 Remapping index of RadosGW buckets

While PGs holding bucket index are remapped (during for example cluster expansion or osd
failure), significant delays and slow queries can occur.

4.8.5 Quick functional check for RadosGW service

The s3curl tool can be used to perform simple and quick tests for the RadosGW. A modified
version is available to test out RGW, while the original version was written for the AWS S3
service.
aptget install libdigesthmacperl
git clone https://github.com/rzarzynski/s3curl.git
cd s3curl
chmod 755 s3curl.pl

#to get user credentials (keys)

radosgwadmin user info uid={rgwuid}

#bucket creation
./s3curl.pl debug id {accesskey} key {secretkey} endpoint
createBucket http://localhost:6780/test

#put object into test bucket

./s3curl.pl debug id {accesskey} key {secretkey} endpoint put
/etc/hostname http://localhost:6780/test/hostname

#list objects in test bucket

./s3curl.pl debug id {accesskey} key {secretkey} endpoint
http://localhost:6780/test/

#put object into test bucket

./s3curl.pl debug id {accesskey} key {secretkey} endpoint delete
http://localhost:6780/test/hostname

All test should pass and response correct HTTP response codes.

Red Hat Openstack
No ratings yet
Red Hat Openstack
59 pages
How To Configure or Build Ceph Storage Cluster in Openstack
No ratings yet
How To Configure or Build Ceph Storage Cluster in Openstack
30 pages
Mirantis OpenStack 8.0 OperationsGuide
No ratings yet
Mirantis OpenStack 8.0 OperationsGuide
153 pages
OpenStack Ceph Deployment Guide
No ratings yet
OpenStack Ceph Deployment Guide
168 pages
Ceph at CSC
No ratings yet
Ceph at CSC
27 pages
Openstack Ceph and Containers
No ratings yet
Openstack Ceph and Containers
42 pages
Ceph, Storage For CERN Cloud
No ratings yet
Ceph, Storage For CERN Cloud
10 pages
Deploying Red Hat Ceph Storage Clusters Based On Supermicro Storage Servers
No ratings yet
Deploying Red Hat Ceph Storage Clusters Based On Supermicro Storage Servers
43 pages
Mirantis: 525 Almanor Ave, 4th Floor Sunnyvale, CA 94085
No ratings yet
Mirantis: 525 Almanor Ave, 4th Floor Sunnyvale, CA 94085
93 pages
Learning Ceph Sample Chapter
No ratings yet
Learning Ceph Sample Chapter
23 pages
RH Ceph-Storage-5-Datasheet
No ratings yet
RH Ceph-Storage-5-Datasheet
6 pages
Preflight Checklist-Ceph
No ratings yet
Preflight Checklist-Ceph
7 pages
Red Hat Ceph Storage Architecture and Administration (CEPH125)
No ratings yet
Red Hat Ceph Storage Architecture and Administration (CEPH125)
5 pages
Red Hat Ceph Storage-1.2.3-Red Hat Ceph Architecture-En-US
No ratings yet
Red Hat Ceph Storage-1.2.3-Red Hat Ceph Architecture-En-US
24 pages
Ceph Luminous Mimicgeneric Howto January 2019 v1.91 PDF
No ratings yet
Ceph Luminous Mimicgeneric Howto January 2019 v1.91 PDF
41 pages
Ceph Cluster Management Commands
No ratings yet
Ceph Cluster Management Commands
2 pages
Lab - 2 Deploying The Ceph Cluster Using Cephadm
100% (1)
Lab - 2 Deploying The Ceph Cluster Using Cephadm
9 pages
Mirantis OpenStack Planning Guide
No ratings yet
Mirantis OpenStack Planning Guide
62 pages
Ceph Cookbook - Sample Chapter
No ratings yet
Ceph Cookbook - Sample Chapter
28 pages
Virtual Machine Block Storage With The Distributed Storage System
No ratings yet
Virtual Machine Block Storage With The Distributed Storage System
40 pages
ST Ceph Storage Intel Configuration Guide Technology Detail f11532 201804 en
No ratings yet
ST Ceph Storage Intel Configuration Guide Technology Detail f11532 201804 en
23 pages
Deploying A Hyperconverged Infrastructure environment-en-US
No ratings yet
Deploying A Hyperconverged Infrastructure environment-en-US
19 pages
EOS On CephFS
No ratings yet
EOS On CephFS
15 pages
Storage Ceph 5 Documentation IBM
No ratings yet
Storage Ceph 5 Documentation IBM
1,252 pages
Ceph Storage: Features and Components
No ratings yet
Ceph Storage: Features and Components
8 pages
MicroCeph Guide for Ceph Admins
No ratings yet
MicroCeph Guide for Ceph Admins
43 pages
Zint Les Feedback
No ratings yet
Zint Les Feedback
3 pages
Red Hat Ceph Storage Datasheet
No ratings yet
Red Hat Ceph Storage Datasheet
4 pages
IBM Storage Ceph For Beginner's
No ratings yet
IBM Storage Ceph For Beginner's
297 pages
Real Devops Questions Interview
No ratings yet
Real Devops Questions Interview
13 pages
Red Hat Ceph Storage-5-File System Guide-En-Us
No ratings yet
Red Hat Ceph Storage-5-File System Guide-En-Us
160 pages
Ceph - Hands-On Guide - AJ's Data Storage Tutorials
No ratings yet
Ceph - Hands-On Guide - AJ's Data Storage Tutorials
73 pages
Ceph Reference Architecture
100% (1)
Ceph Reference Architecture
12 pages
Enabling High Availability Service With Ovirt Virtualization and Cephfs
No ratings yet
Enabling High Availability Service With Ovirt Virtualization and Cephfs
23 pages
EX260 Dumps - Red Hat Certified Specialist in Ceph Cloud Storage
No ratings yet
EX260 Dumps - Red Hat Certified Specialist in Ceph Cloud Storage
27 pages
OpenStack Workshop: Deploy with Kolla-Ansible
No ratings yet
OpenStack Workshop: Deploy with Kolla-Ansible
49 pages
Redhat EX310 Exam
No ratings yet
Redhat EX310 Exam
8 pages
Fuel For OpenStack 3.1 UserGuide
100% (1)
Fuel For OpenStack 3.1 UserGuide
126 pages
Ceph Commands - MD
No ratings yet
Ceph Commands - MD
4 pages
RHCS 2017 - Past, Present and Future
No ratings yet
RHCS 2017 - Past, Present and Future
37 pages
Docs Ceph Com Ceph Medic en Latest
No ratings yet
Docs Ceph Com Ceph Medic en Latest
27 pages
C144007-Discoveryzone-Deploying NFV Best Practices
No ratings yet
C144007-Discoveryzone-Deploying NFV Best Practices
26 pages
Mirantis Openstack Planning Guide
No ratings yet
Mirantis Openstack Planning Guide
56 pages
Red Hat Ceph Storage 2 Red Hat Ceph Storage Hardware Guide
No ratings yet
Red Hat Ceph Storage 2 Red Hat Ceph Storage Hardware Guide
14 pages
OCS 4.X Troubleshooting
No ratings yet
OCS 4.X Troubleshooting
96 pages
Ol Ceph
No ratings yet
Ol Ceph
61 pages
Cl260 5.0 Student Guide
100% (1)
Cl260 5.0 Student Guide
566 pages
Cheat Sheet
No ratings yet
Cheat Sheet
8 pages
Evaluating Cephfs Performance vs. Cost On High Density Commodity Disk Servers
No ratings yet
Evaluating Cephfs Performance vs. Cost On High Density Commodity Disk Servers
10 pages
Ceph
No ratings yet
Ceph
40 pages
Ceph & Rook: Streamlining CERN Storage
No ratings yet
Ceph & Rook: Streamlining CERN Storage
28 pages
Dominik Joe Pantucek-Ceph
No ratings yet
Dominik Joe Pantucek-Ceph
32 pages
Ceph Architecture
No ratings yet
Ceph Architecture
15 pages
Health Checks - Ceph Documentation
No ratings yet
Health Checks - Ceph Documentation
22 pages
Ceph Commands
No ratings yet
Ceph Commands
4 pages
Ceph File System
100% (1)
Ceph File System
13 pages
Kangaroo Transaction Model Mobile Computing Expanded
No ratings yet
Kangaroo Transaction Model Mobile Computing Expanded
4 pages
Concepts of IT
No ratings yet
Concepts of IT
4 pages
EB022 Sunbird Ebook DataCenterCMDB
No ratings yet
EB022 Sunbird Ebook DataCenterCMDB
12 pages
The Data Whisperer Master Data As The Foundation of Digital Transformation
No ratings yet
The Data Whisperer Master Data As The Foundation of Digital Transformation
40 pages
JUSPAY Analytics Data Set - Product Solution Analyst
0% (1)
JUSPAY Analytics Data Set - Product Solution Analyst
3 pages
Workday Time Tracking k13
No ratings yet
Workday Time Tracking k13
3 pages
BIMT Bells Institute of Management Technology
No ratings yet
BIMT Bells Institute of Management Technology
52 pages
Python Unit 1
No ratings yet
Python Unit 1
73 pages
Training Report
No ratings yet
Training Report
41 pages
CCNA Networking Fundamentals
No ratings yet
CCNA Networking Fundamentals
40 pages
SIEM in Risk Analysis and Assessment
No ratings yet
SIEM in Risk Analysis and Assessment
11 pages
1 s2.0 S1877050920320688 Main
No ratings yet
1 s2.0 S1877050920320688 Main
10 pages
Salesforce Project - Simple-Vision
No ratings yet
Salesforce Project - Simple-Vision
11 pages
FINAL Cybersecurity in The Healthcare Industry
No ratings yet
FINAL Cybersecurity in The Healthcare Industry
46 pages
Introduction To Computing Paradigms
No ratings yet
Introduction To Computing Paradigms
12 pages
PRACH Propagation Delay Report
No ratings yet
PRACH Propagation Delay Report
16 pages
VisioPointer Prices 2023
No ratings yet
VisioPointer Prices 2023
3 pages
Networked Multi-Elevator Controller
No ratings yet
Networked Multi-Elevator Controller
4 pages
Hitachi Replication Plug-In For Containers Configuration Guide
No ratings yet
Hitachi Replication Plug-In For Containers Configuration Guide
3 pages
Database
No ratings yet
Database
14 pages
Power BI Basics
No ratings yet
Power BI Basics
4 pages
11 Phases of Software Testing
No ratings yet
11 Phases of Software Testing
3 pages
Ica Revision
No ratings yet
Ica Revision
120 pages
PFCG BASED AGENT RULE SET UP Step by Step 1694177786
No ratings yet
PFCG BASED AGENT RULE SET UP Step by Step 1694177786
6 pages
21-IOS Boot Process of Router
No ratings yet
21-IOS Boot Process of Router
32 pages
Concurrency Control in Database Systems: Bharat Bhargava, Fellow, IEEE
No ratings yet
Concurrency Control in Database Systems: Bharat Bhargava, Fellow, IEEE
14 pages
Mitre, Bartolomé - Historia de Belgrano y de La Independencia Argentina. Tomo I PDF
No ratings yet
Mitre, Bartolomé - Historia de Belgrano y de La Independencia Argentina. Tomo I PDF
679 pages
Random Data Analysis Guide 1986
No ratings yet
Random Data Analysis Guide 1986
1 page
Az 400
No ratings yet
Az 400
5 pages
SQL Joins for Beginners
No ratings yet
SQL Joins for Beginners
1 page

C Eph Best Practices

Uploaded by

C Eph Best Practices

Uploaded by

US HEADQUARTERS

Ceph Best Practices Manual

© 2005–2016 All Rights Reserved www.mirantis.com

© 2005–2016 All Rights Reserved www.mirantis.com

© 2005–2016 All Rights Reserved www.mirantis.com

2.1 Ceph and OpenStack integration

Ceph integration in OpenStack

© 2005–2016 All Rights Reserved www.mirantis.com

© 2005–2016 All Rights Reserved www.mirantis.com

2.3 Ceph default configuration

The Ceph configuration file can be found in:

© 2005–2016 All Rights Reserved www.mirantis.com

public_network = 192.168.0.4/24 #network for client communication

Example of ​client​ section:

© 2005–2016 All Rights Reserved www.mirantis.com

For example in BIND zone style:

© 2005–2016 All Rights Reserved www.mirantis.com

To configure the Imaging and Block Storage for Ceph:

4. Save and exit.

# service cinder­api restart

6. Open the ​/etc/glance/glance­api.conf ​ for editing.

8. Restart the Glance API:

# service glance­api restart

2.5 Adding nodes to the cluster

© 2005–2016 All Rights Reserved www.mirantis.com

2.6 Removing nodes from the cluster

2.7 Time­consuming operations on cluster

© 2005–2016 All Rights Reserved www.mirantis.com

2.8 Cache configuration

Ceph cache tier architecture overview. Source: Ceph manual

© 2005–2016 All Rights Reserved www.mirantis.com

● Placing different pools on different OSDs

● Block devices and OpenStack

© 2005–2016 All Rights Reserved www.mirantis.com

2.9.1 Create buckets

#ceph osd out ​<osd_number>

2.8.2 CRUSH map modifications

crushtool ­d crushmap.compiled ­o crushmap.decompiled

© 2005–2016 All Rights Reserved www.mirantis.com

3. Save the map, compile and upload it:

2.8.3 Create new caching pools

© 2005–2016 All Rights Reserved www.mirantis.com

2.9.4 Set up caching

ceph osd tier add compute cache­volumes

ceph osd tier add compute cache­images

ceph osd pool set cache­compute cache_target_dirty_ratio 0.4

ceph osd pool set cache­volumes cache_target_dirty_ratio 0.4

ceph osd pool set cache­images cache_target_dirty_ratio 0.4

© 2005–2016 All Rights Reserved www.mirantis.com

2.9.5 Turn cache down

3.1.1 Remove an OSD

1. Remove the old OSD from the cluster:

ceph osd out {osd­num}

2. Wait till data migration complete

service ceph stop osd.{osd­num}

4. Remove the OSD from the CRUSH map:

ceph osd crush remove osd.{osd­num}

5. Delete the authentication key:

ceph auth del osd.{osd­num}

6. Remove the OSD from cluster:

© 2005–2016 All Rights Reserved www.mirantis.com

3.1.2 Add an OSD

1. List disks in a node:

ceph​­deploy disk list {node}

ceph​­deploy osd create {node}:{devicename}[:{journaldevice}]

ceph­deploy osd prepare {node}:{devicename}

ceph­deploy osd prepare {node}:{devicename}:{journal_dev_name}

© 2005–2016 All Rights Reserved www.mirantis.com

ceph­deploy disk zap {node­name}:{devicename}

Important: This will DELETE all data on {devicename} disk.

ceph osd tree

3.1.3 Remove Ceph monitor from healthy cluster

Remove the monitor from the cluster:

3.1.4 Decreasing recovery and backfilling performance impact

© 2005–2016 All Rights Reserved www.mirantis.com

3. Verify that ceph­mon is not running on the affected node(s):

Example of client section:

# service cinderapi restart

6. Open the /etc/glance/glanceapi.conf for editing.

# service glanceapi restart

2.7 Timeconsuming operations on cluster

#ceph osd out <osd_number>

crushtool d crushmap.compiled o crushmap.decompiled

ceph osd tier add compute cachevolumes

ceph osd tier add compute cacheimages

ceph osd pool set cachecompute cache_target_dirty_ratio 0.4

ceph osd pool set cachevolumes cache_target_dirty_ratio 0.4

ceph osd pool set cacheimages cache_target_dirty_ratio 0.4

ceph osd out {osdnum}

service ceph stop osd.{osdnum}

ceph osd crush remove osd.{osdnum}

ceph auth del osd.{osdnum}

cephdeploy disk list {node}

cephdeploy osd create {node}:{devicename}[:{journaldevice}]

cephdeploy osd prepare {node}:{devicename}

cephdeploy osd prepare {node}:{devicename}:{journal_dev_name}

cephdeploy disk zap {nodename}:{devicename}

3. Verify that cephmon is not running on the affected node(s):

cephmon i a injectmonmap /tmp/foo

6. Remove the old monitors from the c eph.conf.

ceph mon getmap o ~/monmap

ceph auth export mon. o ~/monkey

3. Add a section for the new monitor to the / etc/ceph/ceph.conf.

cephmon i {monname} mkfs monmap ~/monmap keyring ~/monkey

ceph osd tree |grep i down

HEALTH_WARN clock skew detected on mon.XXXX

3.3.1 Using cephdeploy to distribute configuration over cluster

cephdeploy config push nodename1 nodename2 nodename3

cephdeploy config push nodename{1,2,3}

ceph tell osd.0 injectargs debugosd 20

ceph admindaemon /var/run/ceph/cephosd.0.asok config set debug_osd

Ceph MON cluster cephmon 1, at least 3 6789 ~1GB

Ceph OSD data daemon cephosd as many as 6800, 6801, ~13GB