Oracle Real Applications Cluster
Oracle Real Applications Cluster
Oracle Clusterware
A cluster comprises multiple interconnected computers or servers that appear as if they are
one server to end users and applications. The Oracle RAC option with Oracle Database
enables you to cluster Oracle databases. Oracle RAC uses Oracle Clusterware for the
infrastructure to bind multiple servers so they operate as a single system.
Oracle Clusterware is a portable cluster management solution that is integrated with Oracle
Database. Oracle Clusterware is a required component for using Oracle RAC that provides
the infrastructure necessary to run Oracle RAC. Oracle Clusterware also manages
resources, such as Virtual Internet Protocol (VIP) addresses, databases, listeners, services,
and so on. In addition, Oracle Clusterware enables both noncluster Oracle databases and
Oracle RAC databases to use the Oracle high-availability infrastructure. Oracle Clusterware
along with Oracle Automatic Storage Management (Oracle ASM) (the two together comprise
the Oracle Grid Infrastructure) enables you to create a clustered pool of storage to be used by
any combination of noncluster and Oracle RAC databases.
Oracle Clusterware is the only clusterware that you need for most platforms on which Oracle
RAC operates. If your database applications require vendor clusterware, then you can use
such clusterware in conjunction with Oracle Clusterware if that vendor clusterware is certified
for Oracle RAC.
Node Fencing is used to remove the non responsive nodes from cluster
Oracle RAC
Oracle Real Application clusters allow multiple instances to access a single database, the
instances will be running on multiple nodes. In a standard Oracle configuration, a database
can only be mounted by one instance but in a RAC environment, many instances can
access a single database.
SGA Instance has its own SGA Each instance has its own SGA
Background Instance has its own set of Each instance has its own set of background
processes background processes processes
Datafiles Accessed by only one instance Shared by all instances (shared storage)
Control Files Accessed by only one instance Shared by all instances (shared storage)
Online Redo Dedicated for write/read Only one instance can write but other instances can
Logfile to only one instance read during recovery and archiving. If an instance is
shutdown, log switches by other instances can force
the idle instance redo logs to be archived
Archived Redo Dedicated to the instance Private to the instance but other instances will need
Logfile access to all required archive logs during media
recovery
At least one additional thread of redo for each instance
Flash Recovery Accessed by only one instance Shared by all instances (shared storage)
Log
Alert Log and Dedicated to the instance Private to each instance, other instances never read or
Trace Files write to those files.
ORACLE_HOME Multiple instances on the same Same as single instance plus can be placed on a
server accessing different shared file system allowing a common
databases cause the same ORACLE_HOME for all instances in a RAC
executable files environment.
RAC Components
• Oracle Clusterware
• Cluster Interconnects
Disk architecture
raid 0 (Striping)
A number of disks are concatenated together to give the • Improved • Not highly available
appearance of one very large disk. performance (if one disk fails, the
large Volumes
raid 1 (Mirroring)
A single disk is mirrored by another disk, if one disk fails • Improved • Expensive (requires
the system is unaffected as it can use its mirror. performance double the number of
raid 5
Raid stands for Redundant Array of Inexpensive Disks, the • Improved • Slow write operations
disks are striped with parity across 3 or more disks, the performance (read- (caused by having to
parity is used in the event that one of the disks fails, the
data on the failed disk is reconstructed by using the parity only) create the parity bit)
bit.
• Not expensive
Once you have you storage attached to the servers, you have three choices on how to setup
the disks
• Cluster FileSystem – used to hold all the Oracle datafiles can be used by windows and
linux, its not used widely
Oracle ASM uses disk groups to store data files; an Oracle ASM disk group is a collection of
disks that Oracle ASM manages as a unit. Within a disk group, Oracle ASM exposes a file
system interface for Oracle database files. The content of files that are stored in a disk group is
evenly distributed to eliminate hot spots and to provide uniform performance across the disks.
The performance is comparable to the performance of raw devices.
You can add or remove disks from a disk group while a database continues to access files from
the disk group. When you add or remove disks from a disk group, Oracle ASM automatically
redistributes the file contents and eliminates the need for downtime when redistributing the
content.
The Oracle ASM normal and high redundancy disk groups enable two-way and three-way mirroring
respectively. You can use external redundancy to enable a Redundant Array of Independent Disks
(RAID) storage subsystem to perform the mirroring protection function.
Oracle ASM also uses the Oracle Managed Files feature to simplify database file management.
Oracle Managed Files automatically creates files in designated locations. Oracle Managed Files
also names files and removes them while relinquishing space when tablespaces or files are
deleted.
Oracle ASM files can coexist with other storage management options such as raw disks and third-
party file systems. This capability simplifies the integration of Oracle ASM into pre-existing
environments.
All data files (including an undo tablespace for each instance) and redo log files (at least two for
each instance) for an Oracle RAC database must reside on shared storage. Oracle recommends
that you use Oracle ASM to store these files in an Oracle ASM disk group.
If you add a data file to a disk that other instances cannot access, then verification fails. Verification
also fails if instances access different copies of the same data file. If verification fails for any
instance, then diagnose and fix the problem. Then run the ALTER SYSTEM CHECK
DATAFILES statement on each instance to verify data file access.
To create an Oracle ASM disk group, run ASMCA from the Grid_home/bin directory.
ASM Processes
ASMB – Communicates with ASM instance, managing storage and providing statistics.
Provides information to and from CSS used by ASM to manage Disk repurces
RBAL – Rebalances plan to move the extents between the disks, when disk is added/removed
to/from diskgroup. Opens all disks listed under each diskgroup and makes them available to
various clients
Marks ASM allocation units as stale following a missed write to an offline disk
MARK essentially tracks which extents require resynchronization for offline disks.
This process runs in the database instance and is started when the database instance first
begins using the ASM instance.
If required, MARK can also be started on demand when disks go offline in the ASM
redundancy disk group.
ASM Provides mirroring (for data redundancy) and striping (for Disk IO Performance)
Redundancy Types
External
Normal
Allocation Unit – Building block of ASM. 1MB is the smallest allocation unit. But AU size
cannot be changed after installation. SO this should be planned earlier
Use the following syntax to show the configuration of an Oracle ASM instance:
Use the following syntax to display the state of an Oracle ASM instance:
You can create Oracle RAC databases, whether multinode or Oracle Real Application
Clusters One Node (Oracle RAC One Node), using the following deployment models :
RAC Tools
All nodes in an Oracle RAC environment must connect to at least one Local Area Network
(LAN) (commonly referred to as the public network) to enable users and applications to
access the database. In addition to the public network, Oracle RAC requires private network
connectivity used exclusively for communication between the nodes and database instances
running on those nodes. This network is commonly referred to as the interconnect.
The interconnect network is a private network that connects all of the servers in the cluster.
You must configure User Datagram Protocol (UDP) for the cluster interconnect, except in a
Windows cluster. Windows clusters use the TCP protocol.
A typical connect attempt from a database client to an Oracle RAC database instance can
be summarized, as follows:
1. The database client connects to SCAN (which includes a SCAN VIP on a public
network), providing the SCAN listener with a valid service name.
2. The SCAN listener then determines which database instance hosts this service and
routes the client to the local or node listener on the respective node.
3. The node listener, listening on a node VIP and a given port, retrieves the connection
request and connects the client to an instance on the local node.
If a node fails, then the VIP address fails over to another node on which the VIP address can
accept TCP connections, but it does not accept connections to the Oracle database. Clients
that attempt to connect to a VIP address not residing on its home node receive a rapid
connection refused error instead of waiting for TCP connect timeout messages. When the
network on which the VIP is configured comes back online, Oracle Clusterware fails back the
VIP to its home node, where connections are accepted. Generally, VIP addresses fail over
when:
The Clusterware software allows nodes to communicate with each other and forms the cluster
that makes the nodes work as a single logical server. The software is run by the Cluster Ready
Services (CRS) using the Oracle Cluster Registry (OCR) that records and maintains the
cluster and node membership information and the voting disk which acts as a tiebreaker during
communication failures. Consistent heartbeat information travels across the interconnect to
the voting disk when the cluster is running.
If in 3node cluster, one of the nodes time isn’t synchronized with the other 2 nodes, then
CTSSD kills the non-synchronized node to maintain data integrity
OCSSd provides synchronization services among nodes, it provides access to the node
membership and enables basic cluster services, including cluster group services and locking,
failure of this daemon causes the node to be rebooted to avoid split-brain situations.
OHASD
Oracle High Availability Services Daemon (OHASD) anchors the lower part of the Oracle
Clusterware stack, which consists of processes that facilitate cluster operations in RAC
databases. This includes the GPNPD, GIPC, MDNS and GNS background processes.
To enable OHAS for each RAC node, you issue the crsctl enable crs command, and this will
cause OHAS to autostart when each node re-boots.
GPNPd – Grid Plug and Play
You can display the content of OLR on the local node to the text terminal that initiated the
program using the OCRDUMP utility – ocrdump –local –stdout
OS Runlevel
Must be 3 or 5 to start clusterware
0 -> Shutdown
3 -> Multiuser mode but no GUI
5 -> Multiuser with GUI
6 -> Reboot
RAC uses a membership scheme, thus any node wanting to join the cluster as to become a
member. RAC can evict any member that it seems as a problem, its primary concern is
protecting the data. You can add and remove nodes from the cluster and the membership
increases or decrease, when network problems occur membership becomes the deciding
factor on which part stays as the cluster and what nodes get evicted, the use of a voting disk
is used which I will talk about later.
The resource management framework manage the resources to the cluster (disks, volumes),
thus you can have only have one resource management framework per resource. Multiple
frameworks are not supported as it can lead to undesirable affects.
The Oracle Cluster Ready Services (CRS) uses the registry to keep the cluster configuration,
it should reside on a shared storage and accessible to all nodes within the cluster. This shared
storage is known as the Oracle Cluster Registry (OCR) and its a major part of the cluster, it is
automatically backed up (every 4 hours) the daemons plus you can manually back it up. The
OCSSd uses the OCR extensively and writes the changes to the registry
The OCR is loaded as cache on each node, each node will update the cache then only one
node is allowed to write the cache to the OCR file, the node is called the master. The
Enterprise manager also uses the OCR cache, it should be at least 100MB in size. The CRS
daemon will update the OCR about the status of the nodes in the cluster during
reconfigurations and failures.
Voting Disk
The voting disk (or quorum disk) is shared by all nodes within the cluster, information about
the cluster is constantly being written to the disk, this is known as the heartbeat. If for any
reason a node cannot access the voting disk it is immediately evicted from the cluster, this
protects the cluster from split-brains (the Instance Membership Recovery algorithm IMR is
used to detect and resolve split-brains) as the voting disk decides what part is the real cluster.
The voting disk manages the cluster membership and arbitrates the cluster ownership during
communication failures between nodes. Voting is often confused with quorum the are similar
but distinct, below details what each means
The voting disk has to reside on shared storage, it is a small file (20MB) that can be accessed
by all nodes in the cluster.
Nodes N1 N2 N3 Status
N1 ☑ ☑ ☑
N2 ☑ ☑ ☑
N3 ☒ ☑ ☑
They manage lock manager service requests for GCS resources and send them to a
service queue to be handled by the LMSn process. It also handles global deadlock
detection and monitors for lock conversion timeouts.
As a performance gain you can increase this process priority to make sure CPU
starvation does not occur
You can see the statistics of this daemon by looking at the view X$KJMSDP
LMON Lock Monitor This process manages the GES, it maintains consistency of GCS memory structure
Process – GES in case of process death. It is also responsible for cluster reconfiguration and locks
reconfiguration (node joining or leaving), it checks for instance deaths and listens for
local messaging.
A detailed log file is created that tracks any reconfigurations that have happened.
LMD Lock Manager This manages the enqueue manager service requests for the GCS. It also handles
Daemon – GES deadlock detention and remote resource requests from other instances.
You can see the statistics of this daemon by looking at the view X$KJMDDP
LCK0 Lock Process – Manages instance resource requests and cross-instance call operations for shared
GES resources. It builds a list of invalid lock elements and validates lock elements during
recovery.
DIAG Diagnostic This is a lightweight process, it uses the DIAG framework to monitor the health of the
Daemon cluster. It captures information for later diagnosis in the event of failures. It will
perform any necessary recovery if an operational hang is detected.
Cache coherency is the technique of keeping multiple copies of a buffer consistent between
different Oracle instances on different nodes. Global cache management ensures that access
to a master copy of a data block in one buffer cache is coordinated with the copy of the block
in another buffer cache. Cache coherency identifies the most up-to-date copy of a resource,
also called the master copy, it uses a mechanism by which multiple copies of an object are
keep consistent between Oracle instances. Parallel Cache Management (PCM) ensures that
1. When instance A needs a block of data to modify, it reads the bock from disk, before
reading it must inform the GCS (DLM). GCS keeps track of the lock status of the data
block by keeping an exclusive lock on it on behalf of instance A
2. Now instance B wants to modify that same data block, it to must inform GCS, GCS will
then request instance A to release the lock, thus GCS ensures that instance B gets the
latest version of the data block (including instance A modifications) and then exclusively
locks it on instance B behalf.
3. At any one point in time,only one instance has the current copy of the block, thus keeping
the integrity of the block.
GCS maintains data coherency and coordination by keeping track of all lock status of each
block that can be read/written to by any nodes in the RAC. GCS is an in memory database
that contains information about current locks on blocks and instances waiting to acquire
locks.
RAC uses two processes the GCS and GES which maintain records of lock status of each
data file and each cached block using a GRD.
What is Cache Fusion?
Oracle used disk pinging Prior to Oracle RAC Cache Fusion concept. There was no data
block transfer between the buffer cache of one instance buffer cache of another instance.
Instance A read the block from disk and Instance B wants to read the same block which is
not in the instance B buffer cache, then Instance B needs to read the block from the disk
which is causing additional disk read.
Instance A made changes on the particular data block and Instance B wanted to read the
committed row, then, Instance A has to write the changes on the disk before Instance B read
the same record.
This becomes performance bottleneck and Oracle introduced Cache Fusion in Oracle 8i to
enhance the performance improvement.
Cache Fusion Oracle RAC transfer the data block from buffer cache of one instance to the
buffer cache of another instance using the cluster high speed interconnect.
For instance, Instance A read the block and it is in local buffer cache. Now Instance B wants
to read the same block, then it can transfer the block(shared current image - SCUR ) from
instance A buffer cache to Instance B buffer cache. It does not require additional disk read.
Instance A made changes on the particular block and it is not committed yet. Now instance B
Dynamic database services enable you to manage workload distributions to provide optimal
performance for users and applications. Dynamic database services offer the following
features:
• Services: Services are entities that you can define in Oracle RAC databases that
enable you to group database workloads, route work to the optimal instances that are
assigned to offer the service, and achieve high availability for planned and unplanned
actions.
• High Availability Framework: An Oracle RAC component that enables Oracle
Database to always maintain components in a running state.
• Fast Application Notification (FAN): Provides information to Oracle RAC
applications and clients about cluster state changes and Load Balancing Advisory
events, such as UP and DOWN events for instances, services, or nodes.
• Transaction Guard: A tool that provides a protocol and an API for at-most-once
execution of transactions in case of unplanned outages and duplicate submissions.
• Connection Load Balancing: A feature of Oracle Net Services that balances
incoming connections across all of the instances that provide the requested database
service.
• Load Balancing Advisory: Provides information to applications about the current
service levels that the database and its instances are providing. The load balancing
advisory makes recommendations to applications about where to direct application
requests to obtain the best service based on the management policy that you have
defined for that service. Load balancing advisory events are published through
Oracle Notification Service.
• Automatic Workload Repository (AWR): Tracks service-level statistics as metrics.
Server generated alerts can be created for these metrics when they exceed or fail to
meet certain thresholds.
• Fast Connection Failover: This is the ability of Oracle Clients to provide rapid
failover of connections by subscribing to FAN events.
• Runtime Connection Load Balancing: This is the ability of Oracle Clients to
provide intelligent allocations of connections in the connection pool based on the
current service level provided by the database instances when applications request a
connection to complete some work.
• Single Client Access Name (SCAN): Provides a single name to the clients
connecting to Oracle RAC that does not change throughout the life of the cluster,
even if you add or remove nodes from the cluster.
CONNECT Affects the default instance if no instance is specified in the CONNECT command.
HOST Affects the node running the SQL*Plus session, regardless of the location of the current and default
instances.
RECOVER Does not affect any particular instance, but rather the database.
SHOW INSTANCE Displays information about the current instance, which can be different from the default local instan
have redirected your commands to a remote instance.
SHOW PARAMETER Displays parameter and SGA information from the current instance.
and
SHOW SGA
STARTUP Always affects the current instance. These are privileged SQL*Plus commands.
and
SHUTDOWN
The following SRVCTL command, for example, mounts all of the non-running instances of an
Oracle RAC database:
In Windows you must enclose a comma-delimited list in double quotation marks ("").
Note that this command also starts all enabled and non-running services that
have AUTOMATIC management policy, and for which the database role matches one of the
service's roles.
• To stop one or more instances, enter the following SRVCTL syntax from the command line:
• $ srvctl stop instance -db db_unique_name [ -instance
"instance_name_list" |
You can enter either a comma-delimited list of instance names to stop several instances or you can
enter a node name to stop one instance. In Windows you must enclose a comma-delimited list in
double quotation marks ("").
This command also stops the services related to the terminated instances on the nodes where
the instances were running. As an example, the following command shuts down the two
instances, orcl3 and orcl4, on the orcl database using the immediate stop option:
The following command provides an example of using SRVCTL to check the status of the
database instances for the Oracle RAC database named mail:
Additionally, you can check whether PDBs are running in the cluster by checking the availability
of their assigned services, as follows:
Using either of these CRSCTL commands to stop all database instances on a server or in the
cluster can lead to the database instances being stopped similar to shutdown abort, which
requires an instance recovery on startup. If you use SRVCTL to stop the database instances
manually before stopping the cluster, then you can prevent a shutdown abort, but this requires
that you manually restart the database instances after restarting Oracle Clusterware.