Customer reported a system slowdown/process hung during small and well defined period of time and it was related to one specific process only. Standard Oracle 9i performance analyze tool - Statspack didn't show anything unusual in that time but shapshots were taking very 30 min and this issue took about 5 minutes.
As we could not replicate issue I thought about SASH (a free simulator of Oracle Active Session History) and I have installed it and configured. From that time I was able to gather and keep information about active sessions.
When next time I got detailed information about similar issue I was able to use collected data and found a problem. After investigation I figured out that process used for that business functionality stuck on “library cache lock” event. When I compared my findings from SASH with Statspack I realize why I didn't spotted it before - "library cache lock" was about 1.8 % of all locks between statspack snapshots. From overall database review there was no problem but from process (and business) perspective that lock hung important process for almost 5 minutes.
Below I described all major steps of my investigation and what was a root cause.
Let's start with SASH output (I just left 3 lines but there was many more similar lines):
Below I described all major steps of my investigation and what was a root cause.
Let's start with SASH output (I just left 3 lines but there was many more similar lines):
SQL> select sample_time,program, session_id, session_state, event, seq#, sql_id, '0' || trim(to_char(p1,'XXXXXXXXXXXXXXXXX')) "p1raw", p3 2 from v$active_session_history 3 where 4 sample_time >= to_date('18-7-2010 13:50:10','dd-mm-yyyy hh24:mi:ss') 5 and sample_time <= to_date('18-7-2010 13:54:58','dd-mm-yyyy hh24:mi:ss') 6 and event like 'library%' 7 order by sample_time desc 8 / SAMPLE_TI PROGRAM SESSION_ID SESSION_STATE EVENT SEQ# SQL_ID p1raw P3 --------- -------------- ---------- ------------- ------------------ ----- ------- ----------- --- 18-JUL-10 importantp.exe 472 WAITING library cache lock 214 1548901218 07000002AC789A98 210 18-JUL-10 otherprogr.exe 436 WAITING library cache lock 8 2855865705 07000002AC789A98 210 18-JUL-10 375 WAITING library cache lock 33 261624711 07000002AC789A98 310
First idea was about parsing problem but when I took a look on SQL it looked OK but it was just entry point to PL/SQL procedure. Next thing was to check event parameters and there was first surprise. According to Oracle documentation and Metalink note 34579.1 that event has following parameters:
- p1 is an address of required object - in my case 07000002AC789A98
- p3 is a lock type and namespace type (lock mode * 100 + namespace) - in my case 310 or 210
In Oracle definitions (metalink or decode in v$librarycache) namespace number 10 is not defined. See whole list below
• 0 SQL Area
• 1 Table / Procedure / Function / Package Header
• 2 Package Body
• 3 Trigger
• 4 Index
• 5 Cluster
• 6 Object
• 7 Pipe
• 13 Java Source
• 14 Java Resource
• 32 Java Data
• 0 SQL Area
• 1 Table / Procedure / Function / Package Header
• 2 Package Body
• 3 Trigger
• 4 Index
• 5 Cluster
• 6 Object
• 7 Pipe
• 13 Java Source
• 14 Java Resource
• 32 Java Data
So what object it is ? P1 had a address and X$KGLOB has a list of all objects (I checked that table next day but there was no restart and it was a chance that there is still that same object in shared pool as session was running 24/7):
SQL> select kglnaown "Owner", kglnaobj "Object" from x$kglob where kglhdadr='07000002AC789A98'; Owner Object ------ -------- SOFT SNDQ
What type of objects is it ?
SQL> select object_type, object_name from dba_objects where owner='SOFT' and object_name = 'SNDQ'; OBJECT_TYPE OBJECT_NAME ----------- ----------- QUEUE SNDQ
Bingo - object was a queue there was a dequeue in that PL/SQL code. My process called importantp.exe was waiting for lock on queue. Let's check lock type:
- 2 - Share mode
- 3 - Exclusive mode
That was a explanation of importantp.exe hungs - as dequeue function was called after add_subscriber function it had to wait until DDL command was finished. Add subscriber function had to wait until all previous operations in shared mode will be completed and there was typical chain of locks.
There are bugs in Metalink raised by other people describing similar situations and there is only one workaround do not try any DDL on queues during traffic peaks. This recommendation is well know and most DBAs preventing database from typical DDL operation ( like index rebuild or schema changes) during business hours. The problems starts with hidden DDL commands used in application.
There are bugs in Metalink raised by other people describing similar situations and there is only one workaround do not try any DDL on queues during traffic peaks. This recommendation is well know and most DBAs preventing database from typical DDL operation ( like index rebuild or schema changes) during business hours. The problems starts with hidden DDL commands used in application.