淘先锋技术网

首页 1 2 3 4 5 6 7

Anything that can go wrong will go wrong. :“凡事只要有可能出错,那就一定会出错。”  

 

墨菲定律再次被验证,早上数据库宕机,日志如下:

Dump continued from file: /opt/ora11g/diag/rdbms/prodb/ORABJ/trace/ORABJ_cjq0_4189.trc

ORA-00445: background process "J000" did not start after 120 seconds

 

*** 2013-07-10 09:13:24.852

*** SESSION ID:(618.5) 2013-07-10 09:13:24.852

*** CLIENT ID:() 2013-07-10 09:13:24.852

*** SERVICE NAME:(SYS$BACKGROUND) 2013-07-10 09:13:24.852

*** MODULE NAME:() 2013-07-10 09:13:24.852

*** ACTION NAME:() 2013-07-10 09:13:24.852

Dump continued from file: /opt/ora11g/diag/rdbms/prodb/ORABJ/trace/ORABJ_cjq0_4189.trc

ORA-00445: background process "J000" did not start after 120 seconds

 

ID 1379200.1】中对这个错误的描述:

What does this message mean ?

The message indicates that we failed to spawn a new process at the Operating System level to serve the request. There are various causes for this issue. This typically occurs when there is a shortage or misconfiguration in Operating System Resources, and thereby the problem should be investigated from an OS perspective. However there are a few causes related to the Oracle Database as well.

 

The default 120 seconds (after which Oracle times out) can be extended dynamically (without a database restart) by setting the following event:

 

$ sqlplus / as sysdba

alter system set events '10281 trace name context forever, level xxx';

-- where xxxxxx is the number of seconds to timeout at.

eg: alter system set events '10281 trace name context forever, level 300';

 

的确我们的硬件是有些问题,每隔半年就要重启一次,否则就会操作系统就会hang住。前几

次到半年人工重启了机器,没造成事故。这次过了半年,由于停机申请还在走审批流程,可

是机器等不到那一天了,于是最后一根稻草压垮了它。庆幸的是在我休假前它挂了,如果我

在火车上它挂掉了,后果就……

 

 

 

按照文档【ID 1379200.1】中所说的检查了操作系统的参数设置,发现有些参数设置有问题。

但这些参数的调整需要经过严格的测试和验证,才能在生产上进行。况且这篇文章最后的

更新日期是2013-5-13,说明Oracle也是刚刚发现这个问题不久,所以它推荐的方法也不能

轻易的尝试。

 

1.       kernel.randomize_va_space

 

Issues caused by the Linux feature Address Space Layout Randomization (ASLR

This problem is reported in Redhat 5 and Oracle 11.2.0.2. You can verify whether ASLR is being used as follows:

 

# /sbin/sysctl -a | grep randomize

kernel.randomize_va_space = 1

 

If the parameter is set to any value other than 0 then ASLR is in use. Refer the document for details:

 

Note 1345364.1: ORA-00445: Background Process "xxxx" Did Not Start After 120 Seconds

 The solution will be to disable ASLR

 

 

2.       Setting PGA_AGGREGATE_TARGET=TRUE 

The parameter pga_aggregate_target is a numeric value not a boolean value and therefore must be set to a number for it to function correctly. By specifying it to a text string, we will try to convert it to a meaningful value but which may be insufficient for your environment

 

Solution: Properly set PGA_AGGREGATE_TARGET to a numeric value.

 

3.       Setting the PRE_PAGE_SGA to TRUE or Altering SGA_SIZE with PRE_PAGE_SGA set to TRUE

PRE_PAGE_SGA instructs Oracle to read the entire SGA into active memory at instance startup. Operating system page table entries are then prebuilt for each page of the SGA. This setting can increase the amount of time necessary for instance startup, but it is likely to decrease the amount of time necessary for Oracle to reach its full performance capacity after startup. PRE_PAGE_SGA can increase the process startup duration, because every process that starts must access every page in the SGA, this can cause the PMON process to take longer to start and exceed the timeout which is by default 120 seconds causing the instance startup to fail.

 

Setting PRE_PAGE_SGA to TRUE can increase the process startup duration, because every process that starts must access every page in the SGA, however overhead can be significant if your system frequently creates and destroys processes by, for example, continually logging on and logging off.

 

 

Check whether PRE_PAGE_SGA is set to TRUE

--OR--

Verify the generate trace for the occurance of function: ksmprepage()

 

Solution: Setting PRE_PAGE_SGA to FALSE will avoid this code executing so pages are only touched as needed rather than touching every single page when the process starts. This can avoid or minimize the problem from occuring however the underlying cause is still an Operating System resource shortage

 

 

参考文档:

1.       Troubleshooting Guide (TSG) - ksvcreate: Process(xxxx) creation failed / ORA-00445: background process "xxxx" did not start after n seconds [ID 1379200.1]

2.       Bug 9871302 - Windows: Cannot make new connection to database on Windows platforms with TNS-12560 [ID 9871302.8]

3.       ORA-00445: Background Process "xxxx" Did Not Start After 120 Seconds [ID 1345364.1]