I have performed the upgrade on a Oracle Cluster database from 10.2.0.1 to 10.2.0.4.
After followed the procedure and beside a couple of failures during the catalog upgrade everything went fine, the upgrade was performed on a Friday night and was done overnight, but on Sunday one of the nodes rebooted itself, The database and clusterware logs were clear so I had not clue about what might be causing the problem.
On this note about troubleshooting clusterware I saw a new process called “Oracle Process Monitor Daemon (OPROCD)”
and the log directory was: /etc/oracle/hostname.oprocd.log
Notice that the log files are outside the CRS_HOME, to be more precise, in my RedHat system they are on this directory: /etc/oracle/oprocd
Lets take a look:
Jun 15 03:01:15.983 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
Jun 15 03:01:15.983 | INF | switching to the legacy socket mode
Jun 15 03:01:15.992 | INF | fatal mode startup, setting process to fatal mode
The last line means that the oprocd process decided to reboot the server, but, what is the oproc process? The Process Monitor Daemon (OPROCD) is a process that is checking for hardware freeze, it will perform a diagnostic every 1000 millisecond and if after 500 it does not receive a response it will reboot the server to prevent any pending I/O operations from getting reissued to the shared cluster disk.
[oracle@myserver]$ ps -ef |grep oprocd|grep -v grep
root 29305 28583 0 Jun22 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
root 29918 29305 0 Jun22 ? 00:00:47 /var/opt/oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f
Here we can see the parameters -t 1000 and -m 500
-t is timeout value or time between executions in milliseconds
-m is the acceptable margin before rebooting
To fix this situation Oracle recommends to set the DIAGWAIT parameter on the clusterware to 13.
# crsctl set css diagwait 13 -force
Oracle recommends to set this parameter on a clusterwide shutdown.
This is how the OPROC process looks after you set the value for the DIAGWAIT parameter:
root 29918 29305 0 Jun22 ? 00:00:47 /opt/oracle/product/10.1.0/crs_1/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f
We can see the new value for the -m parameter and after that the cluster have remained stable again.
Hope it helps