OCS 4.X Troubleshooting
OCS 4.X Troubleshooting
applications
1
OVERVIEW RED HAT CONFIDENTIAL
2
OVERVIEW RED HAT CONFIDENTIAL
Registry
Service 1
Metrics
Prometheus
Logging
Service 2
Focus Areas
6
PRESENT & FUTURE RED HAT CONFIDENTIAL
7
RED HAT CONFIDENTIAL
8
TERMINOLOGY
9
Terminology
● CRD: Custom Resource Definition; Schema Extension to Kubernetes API
● CR: Custom Resource; One record/instance/object, conforming to a CRD
● OPERATOR: Daemon that watches for changes to resources
● STORAGE CLASS: “class” of storage service
● PVC: Persistent Volume Claim, attach persistent storage to a pod
● POD: a group of one or more containers managed by Kubernetes
Storage access modes
● RWO - ReadWriteOnce: volume can be mounted as read-write by a single node
12
Operator pattern
● Codifies domain expertise to deploy and manage an application
○ Automates actions a human would normally do
● Apply user’s desired state
○ Observe - discover current actual state of cluster
○ Analyze - determine differences from desired state
○ Act - perform operations to drive actual towards desired
What’s an Operator?
An Operator is an entity that runs just like any of your applications in your
OpenShift cluster. So the Operator manages the life-cycle of a storage cluster.
→ OCS is a meta-operator that bootstraps Rook and Nooba operators, as well as their respective cluster
CRs (CephCluster, Object, File, Block).
Ceph
15
Ceph RED HAT CONFIDENTIAL
16
RED HAT CONFIDENTIAL
Architectural components
APP HOST/VM CLIENT
RADOS components
Monitors:
▪ Maintain cluster membership and state
RADOS components
Managers:
RADOS components
Metadata Server:
▪ Manages metadata for a POSIX-compliant shared
filesystem
▪ Directory hierarchy
▪ File metadata (owner, timestamps, mode, etc.)
▪ Stores metadata in RADOS
▪ Does not serve file data to clients
▪ Only required for shared filesystem
▪ Multiple MDS are supported
RED HAT CONFIDENTIAL
RADOS components
Rados Gateway:
▪ REST-based object storage proxy
▪ Uses RADOS to store objects
▪ API supports buckets, accounts
▪ Usage accounting for billing
▪ Compatible with S3 and Swift applications
▪ Multi-site replication
21
RED HAT CONFIDENTIAL
RADOS Cluster
APPLICATION
M M
M M
M
22 RADOS CLUSTER
OBJECT PLACEMENT WITH CRUSH
Controlled Replication Under Scalable Hashing
23
CRUSH: Data is organized into pools
10 11 10 01
10 01 01 11
OBJECTS POOL
A 01 01 01 10 OSD 0 OSD 1
01 10 11 10
POOL 01 01 10 01
OBJECTS
B 10 01 01 01 OSD 2 OSD 3
OBJECTS POOL 10 01 10 11
C 01 11 10 10
OSD 4 OSD 5
01 10 01 01
OBJECTS POOL 11 10 01 10
D 10 10 01 01
01 01 10 01 OSD 6 OSD 7
POOLS CLUSTER
(CONTAINING PGs)
CRUSH: dynamic placement
26
Ceph-CSI
Ceph CSI plugin implements an interface between CSI enabled Container
Orchestrator (CO) and Ceph cluster.
Since Ceph block and filesystem are distributed it’s a matter of:
32
Architectural Layers
● Rook:
○ The operator owns the management of Ceph
● Ceph-CSI:
○ CSI driver dynamically provisions and connects client pods to
the storage
● Ceph:
○ Data layer: Storage Provider
○ Block/File/Object storage
Rook Components: Pods
Application Storage: Provisioning
Application Storage: Data Path
ENVIRONMENT OVERVIEW
37
What’s running?
How do daemons run? 1/2
Rook does **not** rely on a “ceph.conf” configuration file, instead everything
is CLI based. So don’t be afraid if you see an empty ceph.conf or nothing at
all.
This means all the Ceph commands have all their configuration passed via a
CLI flag.
How do daemons run? 2/2
E.g for a monitor container:
How do I run Ceph commands?
As a direct consequence of not using a “ceph.conf” configuration file, if you exec
into any container, you won’t be able to easily run Ceph commands.
Instead you should be using the “toolbox” or the Operator container, which will
allow you to run any Ceph commands.
Deployment description: Monitor
initContainers:
Containers:
Containers:
1. copy-bins → copy “rook” and “tini” binaries from a container to a given directory. Basically
copies binaries from the Operator image into the Ceph image. Later during the provision
container, the “rook” CLI will be called to perform several actions (preparing the disk).
Containers:
● provision → runs the “ceph-volume lvm prepare” command to prepare the disk
Containers:
● osd
○ runs “ceph-volume lvm activate” command
○ runs the “ceph-osd” process on foreground
47
Get ready to troubleshoot
● “Toolbox” to the rescue!
○ Just adapt the namespace and the container image of this YAML
https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/toolbox.yaml
● If you don’t want to run the toolbox, you can exec into the Operator pod and run:
This change will persist across restart of the daemon, so the config change is
permanent.
Get daemon local configuration
I want to verify the value of osd_memory_target_cgroup_limit_ratio:
The command works locally by talking to the daemon socket, so it does not
connect to the monitors at all.
This command only works when exec’ed into the specific container.
Debug failing daemon 1/2
Scenario:
Now you can exec into this pod and will get the right environment to start
debugging.
55
MULTI-CLOUD OBJECT GATEWAY RED HAT CONFIDENTIAL
Start lean
Scale locally
Workload portability
56
MULTI CLOUD OBJECT GATEWAY RED HAT CONFIDENTIAL
App
Multi-site Buckets
57
EFFICIENCY AND SECURITY BY DEFAULT RED HAT CONFIDENTIAL
London DC
New York DC
Paris DC
CONFIDENTIAL Designator
BUCKET CLAIMS
59
OBJECT BUCKET CLAIM CONFIDENTIAL Designator
Read
Write
60
BUCKET CLAIM CONFIDENTIAL Designator
61
62
CONFIDENTIAL Designator
MCG TROUBLESHOOTING
63
NOOBAA STATUS CONFIDENTIAL Designator
#------------------#
INFO[0000] CLI version: 2.0.8 #- Mgmt Addresses -#
INFO[0000] noobaa-image: noobaa/noobaa-core:5.2.10 #------------------#
INFO[0000] operator-image: noobaa/noobaa-operator:2.0.8 ExternalDNS : [https://noobaa-mgmt-openshift-storage.apps.cluster-ocs-e12b.ocs-e12b.example.opentlc.com
INFO[0000] Namespace: openshift-storage https://a1839c3200b7511eab3bf12891326d01-456461733.us-east-1.elb.amazonaws.com:443]
INFO[0000] ExternalIP : []
INFO[0000] CRD Status: NodePorts : [https://10.0.140.19:32561]
INFO[0001] ✅ Exists: CustomResourceDefinition "noobaas.noobaa.io" InternalDNS : [https://noobaa-mgmt.openshift-storage.svc:443]
INFO[0002] ✅ Exists: CustomResourceDefinition "backingstores.noobaa.io" InternalIP : [https://172.30.46.100:443]
INFO[0002] ✅ Exists: CustomResourceDefinition "bucketclasses.noobaa.io" PodPorts : [https://10.131.2.15:8443]
INFO[0002] ✅ Exists: CustomResourceDefinition "objectbucketclaims.objectbucket.io" #--------------------#
INFO[0002] ✅ Exists: CustomResourceDefinition "objectbuckets.objectbucket.io" #- Mgmt Credentials -#
INFO[0002] #--------------------#
INFO[0002] Operator Status: email : admin@noobaa.io
INFO[0002] ✅ Exists: Namespace "openshift-storage" password : a5F/I4qh56qlUpFb5NJVFw==
INFO[0002] ✅ Exists: ServiceAccount "noobaa" #----------------#
INFO[0003] ✅ Exists: Role "ocs-operator.v0.0.1-hnwlz" #- S3 Addresses -#
INFO[0003] ✅ Exists: RoleBinding "ocs-operator.v0.0.1-hnwlz-noobaa-m2272" #----------------#
INFO[0003] ✅ Exists: ClusterRole "ocs-operator.v0.0.1-vmkpp" ExternalDNS : [https://s3-openshift-storage.apps.cluster-ocs-e12b.ocs-e12b.example.opentlc.com
INFO[0003] ✅ Exists: ClusterRoleBinding "ocs-operator.v0.0.1-vmkpp-noobaa-j28q2" https://a183dc3760b7511eab3bf12891326d01-841639993.us-east-1.elb.amazonaws.com:443]
INFO[0003] ✅ Exists: Deployment "noobaa-operator" ExternalIP : []
INFO[0003] NodePorts : [https://10.0.140.19:30052]
INFO[0003] System Status: InternalDNS : [https://s3.openshift-storage.svc:443]
INFO[0004] ✅ Exists: NooBaa "noobaa" InternalIP : [https://172.30.147.48:443]
INFO[0004] ✅ Exists: StatefulSet "noobaa-core" PodPorts : [https://10.131.2.15:6443]
INFO[0004] ✅ Exists: Service "noobaa-mgmt" #------------------#
INFO[0004] ✅ Exists: Service "s3" #- S3 Credentials -#
INFO[0004] ✅ Exists: Secret "noobaa-server" #------------------#
INFO[0004] ✅ Exists: Secret "noobaa-operator" AWS_ACCESS_KEY_ID : kOevajmwLAMe2o7TVMCc
INFO[0005] ✅ Exists: Secret "noobaa-admin" AWS_SECRET_ACCESS_KEY : eWevwwF+0TwiC2LdG/a9Lyh8bX+LyvRHrSL/fnwc
INFO[0005] ✅ Exists: StorageClass "openshift-storage.noobaa.io" #------------------#
INFO[0005] ✅ Exists: BucketClass "noobaa-default-bucket-class" #- Backing Stores -#
INFO[0005] ✅ (Optional) Exists: BackingStore "noobaa-default-backing-store" #------------------#
INFO[0005] ✅ (Optional) Exists: CredentialsRequest "noobaa-cloud-creds" NAME TYPE TARGET-BUCKET PHASE AGE
INFO[0005] ✅ (Optional) Exists: PrometheusRule "noobaa-prometheus-rules" noobaa-default-backing-store aws-s3 noobaa-backing-store-a1ebfdd3-880a-40f6-b659-be4b588cd1c4 Ready 2h9m10s
INFO[0006] ✅ (Optional) Exists: ServiceMonitor "noobaa-service-monitor" #------------------#
INFO[0006] ✅ (Optional) Exists: Route "noobaa-mgmt" #- Bucket Classes -#
INFO[0006] ✅ (Optional) Exists: Route "s3" #------------------#
INFO[0006] ✅ Exists: PersistentVolumeClaim "db-noobaa-core-0" NAME PLACEMENT PHASE AGE
INFO[0006] ✅ System Phase is "Ready" noobaa-default-bucket-class {Tiers:[{Placement: BackingStores:[noobaa-default-backing-store]}]} Ready 2h9m10s
INFO[0006] ✅ Exists: "noobaa-admin" #-----------------#
#- Bucket Claims -#
#-----------------#
NAMESPACE NAME BUCKET-NAME STORAGE-CLASS BUCKET-CLASS PHASE
64
openshift-storage obc-test obc-test-noobaa-74f10ace-e5fc-4210-a106-84ac02f2caaa openshift-storage.noobaa.io
Bound
TROUBLESHOOTING CONFIDENTIAL Designator
65
TROUBLESHOOTING CONFIDENTIAL Designator
○ https://<management
url>/metric - check the raw data
provided by MCG
● NooBaa CLI - noobaa status
67
Nothing helped? Must gather it!
Platform
68
RED HAT CONFIDENTIAL
Supported Platform
● AWS
● VMware vSphere
69
Workloads
70
OCS WORKLOADS RED HAT CONFIDENTIAL
● Primary for DB and Transactional ● POSIX-compliant shared file system ● Media, AI/ML training data,
71
OCS Must-Gather
72
RED HAT CONFIDENTIAL
Lifecycle of must-gather
73
RED HAT CONFIDENTIAL
74
RED HAT CONFIDENTIAL
75
RED HAT CONFIDENTIAL
76
Where to look for a resource ?(Core)
Resource Location
Resource Location
Resource Location
Resource Location
Resource Location
Resource Location
Resource Location
Resource Location
Resource Location
Ceph command outputs ● All ceph command outputs of a ocs cluster can be found
The output of commonly used ceph under
commands for debugging ceph/namespaces/<namespace-name>/must_gather_commands.
Osd prepare volume logs ● All osd prepare volume logs outputs of a ocs cluster can be
The osd prepare volume logs which gets found under
stored in the nodes ceph/namespaces/<namespace-name>/osd_prepare_volume_lo
gs.
● Osd prepare volume logs resides under the nodes where the
pvc was prepared. So in must-gather dump it will be under
the node-name directory.
Deployment
89
EASE OF USE RED HAT CONFIDENTIAL
OCS
Operator
90
EASE OF USE RED HAT CONFIDENTIAL
Simple Install
91
RED HAT CONFIDENTIAL
Integrated Dashboard
92
RED HAT CONFIDENTIAL
93
RED HAT CONFIDENTIAL
94
RED HAT CONFIDENTIAL
Day 2 Operations
95
Thank you linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
Red Hat is the world’s leading provider of
96