Archive for 五月 13th, 2009

可怕的BUG 5987262

星期三, 五月 13th, 2009

之前一个客户遇到过一次,这次又一个客户遇到了,看来几率也不小,就是一个表空间空间突然爆发性的增长,短短几分钟内,使用了所有空闲空间,DUMP出问题的数据块发现都是空闲的块,这两次发生的客户OS都是HP-UXIA64, 上METALINK查实BUG 5987262,基础BUG5890312
从9.2.0.8到10.2.0.4 任何平台都可能发生,目前解决方法就是打patch p5890312 ,应急策略,监控表空间的频率,准备好应急添加的数据文件。

unpublished Bug 6955040

星期三, 五月 13th, 2009
客户报告他们连接不上数据库了,登陆主机上以看,两个节点的监听服务都DOWN掉了,并且节点1VIP漂移到节点2上,节点2上的VIP漂移到节点1上,在crsd.logvip.log里看到如下信息
2009-05-09 16:40:18.973: [  CRSAPP][270962] CheckResource error for ora.xxzbdb1.vip error code = 1
2009-05-09 16:40:18.979: [  CRSRES][270962] In stateChanged, ora.xxzbdb1.vip target is ONLINE
2009-05-09 16:40:18.979: [  CRSRES][270962] ora.xxzbdb1.vip on xxzbdb1 went OFFLINE unexpectedly
2009-05-09 16:40:18.980: [  CRSRES][270962] StopResource: setting CLI values
2009-05-09 16:40:18.983: [  CRSRES][270962] Attempting to stop `ora.xxzbdb1.vip` on member `xxzbdb1`
2009-05-09 16:40:19.696: [  CRSRES][270962] Stop of `ora.xxzbdb1.vip` on member `xxzbdb1` succeeded.
2009-05-09 16:40:19.697: [  CRSRES][270962] ora.xxzbdb1.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2009-05-09 16:40:19.701: [  CRSRES][270962] ora.xxzbdb1.vip failed on xxzbdb1 relocating.
2009-05-09 16:40:19.737: [  CRSRES][270962] StopResource: setting CLI values
2009-05-09 16:40:19.740: [  CRSRES][270962] Attempting to stop `ora.xxzbdb1.LISTENER_xxZBDB1.lsnr` on member `xxzbdb1`
2009-05-09 16:41:34.664: [  CRSRES][270982] startRunnable: setting CLI values
2009-05-09 16:41:37.710: [  CRSRES][270962] Stop of `ora.xxzbdb1.LISTENER_xxZBDB1.lsnr` on member `xxzbdb1` succeeded.
2009-05-09 16:41:37.751: [  CRSRES][270962] Attempting to start `ora.xxzbdb1.vip` on member `xxzbdb2`
2009-05-09 16:41:44.752: [  CRSRES][270962] Start of `ora.xxzbdb1.vip` on member `xxzbdb2` succeeded.
2009-05-12 16:02:53.134: [  CRSRES][281319] xxzbdb1 : CRS-1018: Resource ora.xxzbdb1.vip (application) is already running on xxzbdb2
xxzbdb2 : CRS-1019: Resource ora.xxzbdb1.LISTENER_xxZBDB1.lsnr (application) cannot run on xxzbdb2
 
2009-05-09 16:40:18.952: [    RACG][1] [6140][1][ora.xazbdb1.vip]: Interface lan900 checked failed (host=xazbdb1)
Invalid parameters, or failed to bring up VIP (host=xazbdb1)
 
2009-05-09 16:40:18.961: [    RACG][1] [6140][1][ora.xazbdb1.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/u01/app/oracle/product/10.2.0/crs_1
 
2009-05-09 16:40:18.961: [    RACG][1] [6140][1][ora.xazbdb1.vip]: clsrcexecut: cmd = /u01/app/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=0 54 /u01/app/oracle/product/10.2.0/crs_1/bin/racgvip check xazbdb1
 
2009-05-09 16:40:18.963: [    RACG][1] [6140][1][ora.xazbdb1.vip]: clsrcexecut: rc = 1, time = 8.694s
 
2009-05-09 16:40:18.963: [    RACG][1] [6140][1][ora.xazbdb1.vip]: end for resource = ora.xazbdb1.vip, action = check, status = 1, time = 8.775s
 
2009-05-12 16:54:33.072: [    RACG][1] [26145][1][ora.xazbdb1.vip]: clsrcstartorp: Error with malloc
可以看到
VIP漂移的原因是因为网卡LAN900问题,说明是网络引起的,询问客户,9号下午确实动过,解决方法很简单了,重启下就好了,在数据的ALERT.log里发现如下信息
 
ALTER SYSTEM SET service_names='' SCOPE=MEMORY SID='xazb2';
Sat May  9 16:41:01 2009
Immediate Kill Session#: 739, Serial#: 954
Immediate Kill Session: sess: c00000049f54ef78  OS pid: 29108
Immediate Kill Session#: 740, Serial#: 2717
Immediate Kill Session: sess: c00000049f5504e0  OS pid: 29192
Immediate Kill Session#: 742, Serial#: 89
Immediate Kill Session: sess: c00000049f552fb0  OS pid: 29055
Immediate Kill Session#: 744, Serial#: 23
Immediate Kill Session: sess: c00000049f555a80  OS pid: 29127
Immediate Kill Session#: 745, Serial#: 25
Immediate Kill Session: sess: c00000049f556fe8  OS pid: 29173
Immediate Kill Session#: 746, Serial#: 22
Immediate Kill Session: sess: c00000049f558550  OS pid: 29093
Immediate Kill Session#: 748, Serial#: 28
Immediate Kill Session: sess: c00000049f55b020  OS pid: 29118
Immediate Kill Session#: 749, Serial#: 21
Immediate Kill Session: sess: c00000049f55c588  OS pid: 29249
Immediate Kill Session#: 750, Serial#: 50186
Immediate Kill Session: sess: c00000049f55daf0  OS pid: 29281
Immediate Kill Session#: 752, Serial#: 16380
Immediate Kill Session: sess: c00000049f5605c0  OS pid: 29264
Immediate Kill Session#: 753, Serial#: 58573
Immediate Kill Session: sess: c00000049f561b28  OS pid: 29104
。。。。。。。。。。。。。。。。。。。。。。
 
METALINK查了下是BUG 文档号:730315.1
Cause
This is caused by unpublished Bug 6955040 ALL THE SESSIONS LOST CONNECTION AFTER KILLING CRSD.BIN.
 
The problem is when CRSD is killed or crashed and restarted, CRSD will run resource check action but CRS resource status will not be available at that time. Then in instance check action, it fails to get the preferred node VIP resource status and considered the preferred node VIP resource is not running. Therefore, instance check action will remove the default database service name and disconnect sessions connected using default database service name.
 
This causes messages "ALTER SYSTEM" and "Immediate Kill Session" printed in alert log.
 
 
Solution
1) The fix is included in 10.2.0.5 patchset and 11.1.0.7 patchset.
    
Apply the patchset once they are available.
 
OR
 
2) Configure a service name other than the default one (same as db_name), and get user to use the non-default service name for connection
 
看来是网络的原因触发的,呵呵