impdp and optimizer_mode rule

2010.03.12 11:28 下午 »Author: bosonmaster »
因为和应用有关,公司的工具一直用imp exp ,最近在客户那是用impdp导入数据的时候,无法导入,报如下错:
Impdp fails with the following errors:
UDI-00008: operation generated ORACLE error 31626
ORA-31626: job does not exist
ORA-06512: at "SYS.KUPC$QUE_INT", line 536
ORA-25254: time-out in LISTEN while waiting for a message
 
一开始怀疑,主表的问题,但是删除重新导入还是不行,最后确认是和数据库的优化模式有关,因为我们的应用,数据库的优化模式为
RULE,改为除RULE其他模式后,导入正常,文档: 577562.1有描述:
In a RAC database a full Data Pump export fails with:
 
ORA-39097: Data Pump job encountered unexpected error -1422
ORA-39065: unexpected master process exception in DISPATCH
ORA-01422: exact fetch returns more than requested number of rows
这个类似的错误也是这个原因引起的

linux ipcs max sharememory

2010.03.09 2:29 下午 »Author: bosonmaster »
最近同事遇到一个LINUX上共享内存段的问题,一般我们会根据系统内存的大小来定义系统支持的最大内存段大小,kernel.shmmax ,检查了这些设置都没问题,系统版本也没啥问题,最后经老熊提醒确认为NUMA的问题
ipcs -m
 
----
-- Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status     
0x00000000 98304      gdm       600        393216     2          dest         
0x00000000 1310721    oracle    640        1543503872 31                     
0x00000000 1343490    oracle    640        2835349504 31                     
0x00000000 1376259    oracle    640        2835349504 31                     
0x00000000 1409028    oracle    640        2852126720 31                     
0x00000000 1441797    oracle    640        2835349504 31                     
0x1714b88c 1474566    oracle    640        2097152    31   
设置了如下参数后就解决了次问题
alter system set "_enable_numa_optimization"=false scope=spfile;
有关
NUMA的问题,可以看文档:759565.1

linux oracle rac oprocd reboot

2010.02.07 7:14 下午 »Author: bosonmaster »
最近在客户升级10.2.0.4外加,CRS PSU2 DB PSU3,升级完后,系统不定时发生重启,发生重启期间都有大文件操作,根据OSW记录的系统状态情况,当时内存剩余比较少,但是又有某些重启时刻系统重启时,内存也有不少剩余,每次发生系统重启的时候,系统收到的信息都是SysRq:reseting ,然后系统就重启了,中间测试了很多种情况,包括写在PSU2 压力测试都是如此
10.2.0.410.2.0.3在监控节点间的状态的时候,特别LINUX平台,多了一个oprocd进程,用于检测节点状态,如果发生如下情况,就会重启系统
A problem detected by the OPROCD process. This can be caused by 4 things:
 
1) An OS scheduler problem.
2) The OS is getting locked up in a driver or hardware.
3) Excessive amounts of load on the machine, thus preventing the scheduler from
behaving reasonably.
4) An Oracle bug.
 
并且如果是
OPROCD进程导致系统重启,那么会看到SysRq:reseting信息,在如下位置有oprocd日志
/
etc/oracle/oprocd or /var/opt/oracle/oprocd
oprocd默认启动oprocd run -t 1000 -m 500 单位为毫秒,默认允许延迟时间为1S,如果1S内没有响应,那么0.5秒后重启系统,也就说OPROCD检测问题后,允许的延迟时间是1S1S后 延迟后,0.5秒后OPROCD进程就会掉用脚本重启系统,我们这次遇到的都是SysRq:reseting,所以是OPROCD进程重启的系统,我们测试了很多次。
10.2.0.4以前LINUX 上是通过hangcheck timer模块来检测的,默认的延时
    * 
9i: Assuming the default setting of "oracm misscount" is set to 220 seconds:
      
hangcheck_tick=30 hangcheck_margin=180 hangcheck_reboot=1
    *
10g/11g: Assuming the default setting of "CSS misscount" is set to either 30 or 60 seconds:
      
hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1
还有以下几个进程可能重启系统
 
1 ocssd进程
 
ocssd进程引起节点重启的时候,会有类似如下日志:
Rebooted for Cluster Integrity 在系统日志里,比如linux /var/log/message hp-ux syslog,CRS日志里有如下类似记录
Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out
 
 
Network failure or latency between nodes. It would take at least 30 consecutive
missed checkins to cause a reboot, where heartbeats are issued once per second.
 
Example of missed checkins in the CSS log:
 
WARNING: clssnmPollingThread: node <node> (1) at 50% heartbeat fatal, eviction in 29.100 seconds
WARNING: clssnmPollingThread: node <node> (1) at 75% heartbeat fatal, eviction in 14.960 seconds
WARNING: clssnmPollingThread: node <node> (1) at 75% heartbeat fatal, eviction in 13.950 seconds
 
The first thing to do is find out if the missed checkins ARE the problem or are a
result of the node going down due to other reasons. Check the messages file to see
what exact time the node went down and compare it to the time of the missed checkins.
 
-
If the messages file reboot time < missed checkin time then the node eviction was
likely not due to these missed checkins.
 
-
If the messages file reboot time > missed checkin time then the node eviction was
likely a result of the missed checkins.
 
 
-
Problems writing to or reading from the CSS voting disk.
 
Example of a voting disk problem in the CSS log:
 
ERROR: clssnmDiskPingMonitorThread: voting device access hanging (160008 miliseconds)
 
-
Lack of CPU resources. There are some situations which will appear to be missed
heartbeat issues, however turn out to be caused by a user running a high
sustained load average. When a machine gets too heavily loaded, the scheduling
reliability can be bad. This could cause CSS to not get scheduled in time and
thus CSS cannot get its work done. If this happens, the node is declared
not-viable for cluster work and is evicted.
 
-
A problem with the executables (for example, removing CRS Home files)
 
-
Misconfiguration of CRS. Possible misconfigurations:
 
-
Wrong network selected as the private network for CRS (confirm with CSS log,
/
etc/hosts, and ifconfig output). Make sure it is not the public or VIP
address. Look in the CSS log for strings like...
clsc_listen: (*) Listening on
(ADDRESS=(PROTOCOL=tcp)(HOST=dlsun2046)(PORT=61196))
 
-
Putting the CSS vote file on a Netapp that's shared over some kind of public
network or otherwise excessively loaded/unreliable network. If this is the
case, you are likely to see the following message in the CSS logfile:
 
ERROR: clssnmDiskPingThread(): Large disk IO timeout * seconds.
 
If you ever see this error, then it
's important to investigate why the disk
subsystem is unresponsive.
 
See section 3.2 for information on how to correct common misconfiguration
problems.
 
-
Killing the "init.cssd fatal" process or "ocssd" process.
 
-
An unexpected failure of the OCSSD process, this can be caused by any of the
above issues.
 
-
An Oracle bug. Known bugs that can cause CSS reboots:
 
 
3 第三个进程如下 oclsomon
A problem detected by the OCLSOMON process. This can be caused by 4 things:
 
1) A thread(s) within the CSS daemon hung.
2) An OS scheduler problem.
3) Excessive amounts of load on the machine, thus preventing the scheduler from
behaving reasonably.
4) An Oracle bug.
 
更加信息的资料可以看文档:
265769.1 726833.1 395878.1

vip gateway 10.2.0.3 10.2.0.4

2010.01.01 9:11 下午 »Author: bosonmaster »
在升级数据库从10.2.0.310.2.0.4 后,VIP服务死活启动不起来,检查了半天也没发现问题,最后发现没有设置缺省网关,因为这个库安装的比较早,所以没设,而且10.2.0.3运行也什么问题,设置缺省网关后就好了,VIP启动起来。看来10.2.0.310.2.0.4 RACGVIP脚本改动还是有关的,还是那句话,越是认为不会出问题的地方,越是出问题,新年第一天。还不错运气

vip &ipc

2010.01.01 9:08 下午 »Author: bosonmaster »
今天在客户这升级数据库从10.2.0.310.2.0.4,升级过程基本很顺利,可是在测试拔网线,VIP切换时速度比较慢,去METALINK搜索了下,发现如下提示:
Cause
This problem is caused by the first address in the listener.ora configuration being an address that uses the TCP protocol.
 
In this circumstance, when a network cable is pulled, "lsnrctl stop" listener has to wait for TCP timeout before it can check next address. On the Solaris platform, TCP timeout is defined by tcp_ip_abort_cinterval with a default value of 180000 (3 minutes).   That is why shutting down listener almost took 3.5 minutes. (TCP timeout on other platforms may vary)The error message "Solaris Error: 145: Connection timed out" in ora.node1.LISTENER_NODE1.lsnr.log also indicates it is waiting for tcp timeout.
 
The listener.ora in this scenario is defined as:
 
 
 
[
LISTENER_NODE1 =
 
(DESCRIPTION_LIST =
  
(DESCRIPTION =
    
(ADDRESS_LIST =
      
(ADDRESS = (PROTOCOL = TCP)(HOST = node1vip)(PORT = 1521)(IP = FIRST))
    
)
    
(ADDRESS_LIST =
      
(ADDRESS = (PROTOCOL = TCP)(HOST = 10.1.10.100)(PORT = 1521)(IP = FIRST))
    
)
    
(ADDRESS_LIST =
      
(ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))
    
)
  
)
 
)
Solution
To prevent this, move the IPC address to be the first address for the listener in the listener.ora, eg:
 
LISTENER_NODE1 =
 
(DESCRIPTION_LIST =
    
(DESCRIPTION =
      
(ADDRESS_LIST =
          
(ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))
      
)
      
(ADDRESS_LIST =
          
(ADDRESS = (PROTOCOL = TCP)(HOST = node1vip)(PORT = 1521)(IP = FIRST))
        
)
      
(ADDRESS_LIST =
          
(ADDRESS = (PROTOCOL = TCP)(HOST = 10.1.10.100)(PORT = 1521)(IP = FIRST))
        
)
    
)
 
)
 
 
When lsnrctl tries to stop the listener, it will now connect to the IPC address first, which is available during that time. It will not have to wait for tcp timeout.
 
After the above change, the VIP failover only takes 48 to 50 seconds to complete regardless of the tcp_ip_abort_cinterval setting.
 
Please note, listener.ora files newly created from 10.2.0.3 to 11.1.0.7 should have the IPC protocol as the first address in listener.ora in most casesHowever, if you have upgraded from a previous release, or manually modified/copied over a listener.ora from a previous install, you may not have the IPC protocol as the first address, regardless of your version. Manual modification is required to move IPC protocol to be the first address to avoid the problem described in this note.
 
也就说
IPC协议需要放在监听地址第一列,修改后,我们在测试,从原来2分钟缩减到20多秒,符合应用切换的要求