Thursday, July 29, 2010

Process hung on library cache lock

Customer reported a system slowdown/process hung during small and well defined period of time and it was related to one specific process only. Standard Oracle 9i performance analyze tool - Statspack didn't show anything unusual in that time but shapshots were taking very 30 min and this issue took about 5 minutes.
As we could not replicate issue I thought about SASH (a free simulator of Oracle Active Session History) and I have installed it and configured. From that time I was able to gather and keep information about active sessions.
When next time I got detailed information about similar issue I was able to use collected data and found a problem. After investigation I figured out that process used for that business functionality stuck on “library cache lock” event. When I compared my findings from SASH with Statspack I realize why I didn't spotted it before - "library cache lock" was about 1.8 % of all locks between statspack snapshots. From overall database review there was no problem but from process (and business) perspective that lock hung important process for almost 5 minutes.

Below I described all major steps of my investigation and what was a root cause.
Let's start with SASH output (I just left 3 lines but there was many more similar lines):
SQL> select sample_time,program, session_id, session_state, event, seq#, sql_id, '0' || trim(to_char(p1,'XXXXXXXXXXXXXXXXX')) "p1raw", p3
  2  from v$active_session_history
  3  where 
  4  sample_time >= to_date('18-7-2010 13:50:10','dd-mm-yyyy hh24:mi:ss')
  5  and sample_time <= to_date('18-7-2010 13:54:58','dd-mm-yyyy hh24:mi:ss')
  6  and event like 'library%'
  7  order by sample_time desc
  8  /

SAMPLE_TI PROGRAM        SESSION_ID SESSION_STATE EVENT              SEQ#  SQL_ID     p1raw             P3
--------- -------------- ---------- ------------- ------------------ ----- -------    -----------       ---
18-JUL-10 importantp.exe        472 WAITING       library cache lock  214  1548901218 07000002AC789A98  210
18-JUL-10 otherprogr.exe        436 WAITING       library cache lock    8  2855865705 07000002AC789A98  210
18-JUL-10                       375 WAITING       library cache lock   33   261624711 07000002AC789A98  310

First idea was about parsing problem but when I took a look on SQL it looked OK but it was just entry point to PL/SQL procedure. Next thing was to check event parameters and there was first surprise. According to Oracle documentation and Metalink note 34579.1 that event has following parameters:
  • p1 is an address of required object - in my case 07000002AC789A98
  • p3 is a lock type and namespace type (lock mode * 100 + namespace)  - in my case 310 or 210
In Oracle definitions (metalink or decode in v$librarycache) namespace number 10 is not defined. See whole list below
•    0 SQL Area
•    1 Table / Procedure / Function / Package Header
•    2 Package Body
•    3 Trigger
•    4 Index
•    5 Cluster
•    6 Object
•    7 Pipe
•    13 Java Source
•    14 Java Resource
•    32 Java Data
So what object it is ? P1 had a address and X$KGLOB has a list of all objects (I checked that table next day but there was no restart and it was a chance that there is still that same object in shared pool as session was running 24/7):
SQL> select kglnaown "Owner", kglnaobj "Object" from x$kglob where kglhdadr='07000002AC789A98'; 

Owner  Object
------ --------
SOFT   SNDQ
What type of objects is it ?
SQL> select object_type, object_name from dba_objects where owner='SOFT' and object_name = 'SNDQ';

OBJECT_TYPE OBJECT_NAME
----------- -----------
QUEUE       SNDQ
Bingo - object was a queue there was a dequeue in that PL/SQL code. My process called importantp.exe was waiting for lock on queue. Let's check lock type:
  • 2    - Share mode 
  • 3    - Exclusive mode 
For importantp.exe lock mode in p3 parameter was 210 which mean wait for shared lock. But queuing and dequeueing operations are running with shared locks so reason for waiting had to be related to none of them. I found it in last line of my listing there was another processes waiting for exclusive lock on that same queue. Using SQL_ID (261624711)  I discovered that there was another PL/SQL piece of code and add_subscriber function was called inside. 
That was a explanation of importantp.exe hungs - as dequeue function was called after add_subscriber function it had to wait until DDL command was finished. Add subscriber function had to wait until all previous operations in shared mode will be completed and there was typical chain of locks.

There are bugs in Metalink raised by other people describing similar situations and there is only one workaround do not try any DDL on queues during traffic peaks. This recommendation is well know and most DBAs preventing database from typical DDL operation ( like index rebuild or schema changes) during business hours. The problems starts with hidden  DDL commands used in application.


5 comments:

Taral said...

Hi Marcin,

As per Metalink Comment
"In Oracle definitions (metalink or decode in v$librarycache) namespace number 10 is not defined."

I think if we see v$librarycache in 10g r2 i can see this validation on view
from x$kglst where indx < 8 or indx = 13 or indx = 14 or indx = 32

Where if you go to 11gr2 this is removed and will show all the namespaces. Also, kglstidn column i think relate with namespace number where 10 is specified as pipe.

For 10g i don't know. May be i am wrong if then please correct me.

Marcin Przepiorowski said...

Hi,

I used "not defined" as view definition has a decode function where this namespace is undefined
- of course in 9i.

I have check now 11gR2 - thanks for tip - and there is much more namespaces

select kglstdsc, indx from x$kglst where kglsttyp = 'NAMESPACE';

and for indx 10 namespace name is QUEUE.

Taral said...

Sorry for typo i mean to say 10 is specified as queue

sicbo said...

Hi Marcin,

i have faced this issue and i followed your instructions.

but i couldn't find the object that occured ,the machine runs 24/7

the value of P3 is 2 ,so i guess i might be a package body..

so do u have any suggestions ?

thanks in advance

sicbo said...

sorry ,i forgot to tell some info to you

Database Version is Oracle RAC 11gR2 running on Solaris.