客户报告他们连接不上数据库了,登陆主机上以看,两个节点的监听服务都DOWN掉了,并且节点1的VIP漂移到节点2上,节点2上的VIP漂移到节点1上,在crsd.log和vip.log里看到如下信息
2009-05-09 16:40:18.973: [ CRSAPP][270962] CheckResource error for ora.xxzbdb1.vip error code = 1
2009-05-09 16:40:18.979: [ CRSRES][270962] In stateChanged, ora.xxzbdb1.vip target is ONLINE
2009-05-09 16:40:18.979: [ CRSRES][270962] ora.xxzbdb1.vip on xxzbdb1 went OFFLINE unexpectedly
2009-05-09 16:40:18.980: [ CRSRES][270962] StopResource: setting CLI values
2009-05-09 16:40:18.983: [ CRSRES][270962] Attempting to stop `ora.xxzbdb1.vip` on member `xxzbdb1`
2009-05-09 16:40:19.696: [ CRSRES][270962] Stop of `ora.xxzbdb1.vip` on member `xxzbdb1` succeeded.
2009-05-09 16:40:19.697: [ CRSRES][270962] ora.xxzbdb1.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2009-05-09 16:40:19.701: [ CRSRES][270962] ora.xxzbdb1.vip failed on xxzbdb1 relocating.
2009-05-09 16:40:19.737: [ CRSRES][270962] StopResource: setting CLI values
2009-05-09 16:40:19.740: [ CRSRES][270962] Attempting to stop `ora.xxzbdb1.LISTENER_xxZBDB1.lsnr` on member `xxzbdb1`
2009-05-09 16:41:34.664: [ CRSRES][270982] startRunnable: setting CLI values
2009-05-09 16:41:37.710: [ CRSRES][270962] Stop of `ora.xxzbdb1.LISTENER_xxZBDB1.lsnr` on member `xxzbdb1` succeeded.
2009-05-09 16:41:37.751: [ CRSRES][270962] Attempting to start `ora.xxzbdb1.vip` on member `xxzbdb2`
2009-05-09 16:41:44.752: [ CRSRES][270962] Start of `ora.xxzbdb1.vip` on member `xxzbdb2` succeeded.
2009-05-12 16:02:53.134: [ CRSRES][281319] xxzbdb1 : CRS-1018: Resource ora.xxzbdb1.vip (application) is already running on xxzbdb2
xxzbdb2 : CRS-1019: Resource ora.xxzbdb1.LISTENER_xxZBDB1.lsnr (application) cannot run on xxzbdb2
2009-05-09 16:40:18.952: [ RACG][1] [6140][1][ora.xazbdb1.vip]: Interface lan900 checked failed (host=xazbdb1)
Invalid parameters, or failed to bring up VIP (host=xazbdb1)
2009-05-09 16:40:18.961: [ RACG][1] [6140][1][ora.xazbdb1.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/u01/app/oracle/product/10.2.0/crs_1
2009-05-09 16:40:18.961: [ RACG][1] [6140][1][ora.xazbdb1.vip]: clsrcexecut: cmd = /u01/app/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=0 54 /u01/app/oracle/product/10.2.0/crs_1/bin/racgvip check xazbdb1
2009-05-09 16:40:18.963: [ RACG][1] [6140][1][ora.xazbdb1.vip]: clsrcexecut: rc = 1, time = 8.694s
2009-05-09 16:40:18.963: [ RACG][1] [6140][1][ora.xazbdb1.vip]: end for resource = ora.xazbdb1.vip, action = check, status = 1, time = 8.775s
2009-05-12 16:54:33.072: [ RACG][1] [26145][1][ora.xazbdb1.vip]: clsrcstartorp: Error with malloc
可以看到VIP漂移的原因是因为网卡LAN900问题,说明是网络引起的,询问客户,9号下午确实动过,解决方法很简单了,重启下就好了,在数据的ALERT.log里发现如下信息
ALTER SYSTEM SET service_names='' SCOPE=MEMORY SID='xazb2';
Sat May 9 16:41:01 2009
Immediate Kill Session#: 739, Serial#: 954
Immediate Kill Session: sess: c00000049f54ef78 OS pid: 29108
Immediate Kill Session#: 740, Serial#: 2717
Immediate Kill Session: sess: c00000049f5504e0 OS pid: 29192
Immediate Kill Session#: 742, Serial#: 89
Immediate Kill Session: sess: c00000049f552fb0 OS pid: 29055
Immediate Kill Session#: 744, Serial#: 23
Immediate Kill Session: sess: c00000049f555a80 OS pid: 29127
Immediate Kill Session#: 745, Serial#: 25
Immediate Kill Session: sess: c00000049f556fe8 OS pid: 29173
Immediate Kill Session#: 746, Serial#: 22
Immediate Kill Session: sess: c00000049f558550 OS pid: 29093
Immediate Kill Session#: 748, Serial#: 28
Immediate Kill Session: sess: c00000049f55b020 OS pid: 29118
Immediate Kill Session#: 749, Serial#: 21
Immediate Kill Session: sess: c00000049f55c588 OS pid: 29249
Immediate Kill Session#: 750, Serial#: 50186
Immediate Kill Session: sess: c00000049f55daf0 OS pid: 29281
Immediate Kill Session#: 752, Serial#: 16380
Immediate Kill Session: sess: c00000049f5605c0 OS pid: 29264
Immediate Kill Session#: 753, Serial#: 58573
Immediate Kill Session: sess: c00000049f561b28 OS pid: 29104
。。。。。。。。。。。。。。。。。。。。。。
上METALINK查了下是BUG 文档号:730315.1
Cause
This is caused by unpublished Bug 6955040 ALL THE SESSIONS LOST CONNECTION AFTER KILLING CRSD.BIN.
The problem is when CRSD is killed or crashed and restarted, CRSD will run resource check action but CRS resource status will not be available at that time. Then in instance check action, it fails to get the preferred node VIP resource status and considered the preferred node VIP resource is not running. Therefore, instance check action will remove the default database service name and disconnect sessions connected using default database service name.
This causes messages "ALTER SYSTEM" and "Immediate Kill Session" printed in alert log.
Solution
1) The fix is included in 10.2.0.5 patchset and 11.1.0.7 patchset.
Apply the patchset once they are available.
OR
2) Configure a service name other than the default one (same as db_name), and get user to use the non-default service name for connection
看来是网络的原因触发的,呵呵