RAC nodes reboot themselves randomly after upgrade to 10.2.0.4

I have performed the upgrade on a Oracle Cluster database from 10.2.0.1 to 10.2.0.4.

After followed the procedure and beside a couple of failures during the catalog upgrade everything went fine, the upgrade was performed on a Friday night and was done overnight, but on Sunday one of the nodes rebooted itself, The database and clusterware logs were clear so I had not clue about what might be causing the problem.

On this note about troubleshooting clusterware I saw a new process called “Oracle Process Monitor Daemon (OPROCD)”
and the log directory was: /etc/oracle/hostname.oprocd.log

Notice that the log files are outside the CRS_HOME, to be more precise, in my RedHat system they are on this directory: /etc/oracle/oprocd

Lets take a look:

Jun 15 03:01:15.983 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
Jun 15 03:01:15.983 | INF | switching to the legacy socket mode
Jun 15 03:01:15.992 | INF | fatal mode startup, setting process to fatal mode


The last line means that the oprocd process decided to reboot the server, but, what is the oproc process? The Process Monitor Daemon (OPROCD) is a process that is checking for hardware freeze, it will perform a diagnostic every 1000 millisecond and if after 500 it does not receive a response it will reboot the server to prevent any pending I/O operations from getting reissued to the shared cluster disk.

[oracle@myserver]$ ps -ef |grep oprocd|grep -v grep
root 29305 28583 0 Jun22 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
root 29918 29305 0 Jun22 ? 00:00:47 /var/opt/oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f

Here we can see the parameters -t 1000 and -m 500

-t is timeout value or time between executions in milliseconds
-m is the acceptable margin before rebooting

To fix this situation Oracle recommends to set the DIAGWAIT parameter on the clusterware to 13.

# crsctl set css diagwait 13 -force

Oracle recommends to set this parameter on a clusterwide shutdown.

This is how the OPROC process looks after you set the value for the DIAGWAIT parameter:

root 29918 29305 0 Jun22 ? 00:00:47 /opt/oracle/product/10.1.0/crs_1/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f

We can see the new value for the -m parameter and after that the cluster have remained stable again.

Hope it helps

Carlos Acosta

Advertisements
This entry was posted in High-Availability, Linux/Unix, Oracle Server and tagged , , , , , . Bookmark the permalink.

One Response to RAC nodes reboot themselves randomly after upgrade to 10.2.0.4

  1. sravya says:

    very good note… It helped me understand the Oprocd process.. Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s