Archive for 二月, 2010

linux oracle rac oprocd reboot

星期天, 二月 7th, 2010
最近在客户升级10.2.0.4外加,CRS PSU2 DB PSU3,升级完后,系统不定时发生重启,发生重启期间都有大文件操作,根据OSW记录的系统状态情况,当时内存剩余比较少,但是又有某些重启时刻系统重启时,内存也有不少剩余,每次发生系统重启的时候,系统收到的信息都是SysRq:reseting ,然后系统就重启了,中间测试了很多种情况,包括写在PSU2 压力测试都是如此
10.2.0.410.2.0.3在监控节点间的状态的时候,特别LINUX平台,多了一个oprocd进程,用于检测节点状态,如果发生如下情况,就会重启系统
A problem detected by the OPROCD process. This can be caused by 4 things:
 
1) An OS scheduler problem.
2) The OS is getting locked up in a driver or hardware.
3) Excessive amounts of load on the machine, thus preventing the scheduler from
behaving reasonably.
4) An Oracle bug.
 
并且如果是
OPROCD进程导致系统重启,那么会看到SysRq:reseting信息,在如下位置有oprocd日志
/
etc/oracle/oprocd or /var/opt/oracle/oprocd
oprocd默认启动oprocd run -t 1000 -m 500 单位为毫秒,默认允许延迟时间为1S,如果1S内没有响应,那么0.5秒后重启系统,也就说OPROCD检测问题后,允许的延迟时间是1S1S后 延迟后,0.5秒后OPROCD进程就会掉用脚本重启系统,我们这次遇到的都是SysRq:reseting,所以是OPROCD进程重启的系统,我们测试了很多次。
10.2.0.4以前LINUX 上是通过hangcheck timer模块来检测的,默认的延时
    * 
9i: Assuming the default setting of "oracm misscount" is set to 220 seconds:
      
hangcheck_tick=30 hangcheck_margin=180 hangcheck_reboot=1
    *
10g/11g: Assuming the default setting of "CSS misscount" is set to either 30 or 60 seconds:
      
hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1
还有以下几个进程可能重启系统
 
1 ocssd进程
 
ocssd进程引起节点重启的时候,会有类似如下日志:
Rebooted for Cluster Integrity 在系统日志里,比如linux /var/log/message hp-ux syslog,CRS日志里有如下类似记录
Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out
 
 
Network failure or latency between nodes. It would take at least 30 consecutive
missed checkins to cause a reboot, where heartbeats are issued once per second.
 
Example of missed checkins in the CSS log:
 
WARNING: clssnmPollingThread: node <node> (1) at 50% heartbeat fatal, eviction in 29.100 seconds
WARNING: clssnmPollingThread: node <node> (1) at 75% heartbeat fatal, eviction in 14.960 seconds
WARNING: clssnmPollingThread: node <node> (1) at 75% heartbeat fatal, eviction in 13.950 seconds
 
The first thing to do is find out if the missed checkins ARE the problem or are a
result of the node going down due to other reasons. Check the messages file to see
what exact time the node went down and compare it to the time of the missed checkins.
 
-
If the messages file reboot time < missed checkin time then the node eviction was
likely not due to these missed checkins.
 
-
If the messages file reboot time > missed checkin time then the node eviction was
likely a result of the missed checkins.
 
 
-
Problems writing to or reading from the CSS voting disk.
 
Example of a voting disk problem in the CSS log:
 
ERROR: clssnmDiskPingMonitorThread: voting device access hanging (160008 miliseconds)
 
-
Lack of CPU resources. There are some situations which will appear to be missed
heartbeat issues, however turn out to be caused by a user running a high
sustained load average. When a machine gets too heavily loaded, the scheduling
reliability can be bad. This could cause CSS to not get scheduled in time and
thus CSS cannot get its work done. If this happens, the node is declared
not-viable for cluster work and is evicted.
 
-
A problem with the executables (for example, removing CRS Home files)
 
-
Misconfiguration of CRS. Possible misconfigurations:
 
-
Wrong network selected as the private network for CRS (confirm with CSS log,
/
etc/hosts, and ifconfig output). Make sure it is not the public or VIP
address. Look in the CSS log for strings like...
clsc_listen: (*) Listening on
(ADDRESS=(PROTOCOL=tcp)(HOST=dlsun2046)(PORT=61196))
 
-
Putting the CSS vote file on a Netapp that's shared over some kind of public
network or otherwise excessively loaded/unreliable network. If this is the
case, you are likely to see the following message in the CSS logfile:
 
ERROR: clssnmDiskPingThread(): Large disk IO timeout * seconds.
 
If you ever see this error, then it
's important to investigate why the disk
subsystem is unresponsive.
 
See section 3.2 for information on how to correct common misconfiguration
problems.
 
-
Killing the "init.cssd fatal" process or "ocssd" process.
 
-
An unexpected failure of the OCSSD process, this can be caused by any of the
above issues.
 
-
An Oracle bug. Known bugs that can cause CSS reboots:
 
 
3 第三个进程如下 oclsomon
A problem detected by the OCLSOMON process. This can be caused by 4 things:
 
1) A thread(s) within the CSS daemon hung.
2) An OS scheduler problem.
3) Excessive amounts of load on the machine, thus preventing the scheduler from
behaving reasonably.
4) An Oracle bug.
 
更加信息的资料可以看文档:
265769.1 726833.1 395878.1