Sunday, July 18, 2010

OEM performance tab and Active Session History data.

Graphic representation of Active Session History data in Database Console or in Grid Control is causing some misunderstanding of how to read it. I have lot of questions about Y scale on OEM performance tab and now I decide to blog about it. I want to describe now OEM is using a ASH date and what we can do with this information. Anyway using SQL instead of OEM is more useful in performance tuning and can be easily focused on existing issue.

For other point of view knowledge of ASH give you opportunity to use free implementation of it developed by Kyle Hailey. (Thanks Kyle for your excellent scripts – this is great staff to learn from). By the way recently I have deployed SASH on big production database running on Oracle 9i and it help me to identify and address performance problems. As soon as I finish all these work I will blog about it too.

CPU utilization
CPU usage on performance tab is in green and in my case it has upper limit set to 2 – as I have two core CPU.


It is quite clear how to read CPU usage and what is a maximum CPU utilization. Scope of time which sessions can spend on CPU has an upper limit which is 100 % utilization of all CPU. It mean that maximum CPU time for session in sampling time can be calculated using following formula

Number of CPU x Sampling Time

So if we want to calculate a CPU usage we need to use following formula:

time consumed per all sessions / (Number of CPU x Sampling Time) * 100 %

ex.
time consumed per all sessions = 16 s
Number of CPU = 2
Sampling time = 15 s 

Util of all CPUs = 15 / (2 * 15) * 100 % = 53.3 %

How to calculate CPU utilisation using Active Session History data instead of OEM performance tab ?

Active Session History table is a results of sampling V$SESSION view every second. That mean there is no “real” waiting or executing time in V$ACTIVE_SESSION_HISTORY but just samples. We are assuming that every state in ASH view (WAITING or ON CPU) took exactly 1 s. This is important as we don’t sum waiting time from ASH view but we have to count how many 1 s events were in observed time.
Be aware that DBA_HIST_ACTIVE_SESS_HISTORY has only 1/10 data of  V$ACTIVE_SESSIN_HISTORY (only sample_id modulo 10 are copied into it) and each sample is now 10s long if we want to use it for our calculations.

Following query will calculate average CPU utilization for two minutes:
SQL> select trunc(count(*)/
  2  (min(p.value)*(max(sample_id) - min (sample_id)))*100,2) "CPU_UTIL",
  3  max(sample_id) - min (sample_id) "sampling time [s]",
  4  count(*) "time consumed per sessions [s]",
  5  min(p.value) * (max(sample_id) - min (sample_id)) "maximum possible time [s]"
  6  from v$active_session_history,
  7  (select value from v$parameter  where name = 'cpu_count') p
  8  where 1=1
  9  and sample_time >= to_date('18-07-2010 12:23:00','dd-mm-yyyy hh24:mi:ss')
 10  and sample_time <= to_date('18-07-2010 12:25:00','dd-mm-yyyy hh24:mi:ss')
 11  and session_state = 'ON CPU';

  CPU_UTIL sampling time [s] time consumed per sessions [s] maximum possible time [s]
---------- ----------------- ------------------------------ -------------------------
     29.23               118                             69                       236


But this doesn’t help us in understanding OEM graph. Refresh rate is set to 15 sec so we need to display data from that period split into 15 sec slices.
SQL> select trunc(to_char(sample_time,'SSSSS')/15) "15s_sample", trunc(count(*)/
  2  (min(p.value)*15)*100,2) "CPU_UTIL",
  3  count(*) "time consumed per sessions [s]"
  4  from v$active_session_history,
  5  (select value from v$parameter  where name = 'cpu_count') p
  6  where 1=1
  7  and sample_time >= to_date('18-07-2010 12:23:00','dd-mm-yyyy hh24:mi:ss')
  8  and sample_time <= to_date('18-07-2010 12:25:00','dd-mm-yyyy hh24:mi:ss')
  9  and session_state = 'ON CPU'
 10  group by trunc(to_char(sample_time,'SSSSS')/15)
 11  order by 1;

15s_sample   CPU_UTIL time consumed per sessions [s]
---------- ---------- ------------------------------
      2972       6.66                              2
      2973       6.66                              2
      2974      53.33                             16
      2975      23.33                              7
      2976      36.66                             11
      2977      26.66                              8
      2978      33.33                             10
      2979      43.33                             13

8 rows selected.

Now if you use any graphic tool to create a chart using 15s_sample column as X axis and "CPU util" column as Y axis you should get very similar picture to OEM chart.

Wait events

There is no easily calculated upper limit for waiting sessions. Why ? Number of waits per sample is depended on session state and number of waiting sessions. So what is a maximum for that ? Should we use only active sessions or all connected sessions ? The worst case scenario is all sessions are active and all are in waiting state. Both values multiplied each other give us a potential upper limit. 

Performance tab in OEM is calculating a average of waits in sample time divided by groups and all groups are displayed using stacked area graph.

Following SQL return data used to create following graphs
SQL> select trunc(to_char(sample_time,'SSSSS')/15) "15s_sample", wait_class, count(*)/15
  2  from v$active_session_history
  3  where sample_time >= to_date('18-07-2010 12:23:00','dd-mm-yyyy hh24:mi:ss')
  4  and sample_time <= to_date('18-07-2010 12:25:00','dd-mm-yyyy hh24:mi:ss')
  5  and session_state = 'WAITING'
  6  group by trunc(to_char(sample_time,'SSSSS')/15), wait_class order by 1,3
  7  ;

15s_sample WAIT_CLASS                                                       COUNT(*)/15
---------- ---------------------------------------------------------------- -----------
      2974 Other                                                                     .2
      2974 System I/O                                                        .533333333
      2974 Commit                                                                     2
      2974 User I/O                                                          2.06666667
      2974 Concurrency                                                       4.66666667
      2975 Application                                                       .066666667
      2975 System I/O                                                        .933333333
      2975 User I/O                                                                 3.8
      2975 Commit                                                            4.93333333
      2976 Concurrency                                                       .066666667
      2976 Application                                                       .066666667
      2976 Other                                                                     .4
      2976 System I/O                                                               1.2
      2976 Commit                                                                   5.4
      2976 User I/O                                                                 5.8
      2977 System I/O                                                        .866666667
      2977 User I/O                                                          2.93333333
      2977 Commit                                                                   5.6
      2978 Concurrency                                                       .066666667
      2978 System I/O                                                        .933333333
      2978 User I/O                                                          3.26666667
      2978 Commit                                                                     4
      2979 System I/O                                                        .866666667
      2979 User I/O                                                          1.46666667
      2979 Commit                                                            2.53333333

25 rows selected.
As results we have average time spend by all sessions divided per wait class for a 1 second.  What it mean that in sample 2976 wait class “Commit” average time was 5.4 ? It exactly mean that in that 15 seconds sampling time all active sessions were 5.4*15 = 81 times waiting for events in commit class. Is it bad or not ? This is a different question and will not be answered in that post.

Now if we take a look on sample nr 2976
2976 Application                                                       .066666667
2976 Other                                                                     .4
2976 System I/O                                                               1.2
2976 Commit                                                                   5.4
2976 User I/O                                                                 5.8

Sum of all waits give us 12.8 and this is more less what we can see as a top value on graph. A differences between graph and our calculation are related to observation period start time and how we split results into 15 sampling time. We are using a group by function based on sample_time truncated into 4 periods of seconds - using group by to_char function (based on Kyle idea).These periods are: 0 – 15, 15- 30 , 30-45, 45-60 but OEM can display data based on different 15 sec groups ex. 10 - 25, 25 - 40 so small differences can occure.

Now the question is – is this a bad thing to have high numbers of OEM graph ? And like always when we are talking about Oracle performance an answer is depends. 
What is bad for sure if our CPU usage is going over 85-90 % and stay there for a long time. In that case CPU response time will have good opportunity to increase and whole system performance can go down. What to do with waits ? In my opinion a best way is to create a baseline (even on paper) with average waiting time and react only when system is working over our baseline or what is more important when system users are claiming a performance problems.

Next time I will use drill-into functionality of OEM and will try to explain what other graphs means – unfortunately most of them are changing a scale between main performance graph and detailed ones so if you are not aware of that it can cause a lot of misunderstandings.


Regards,
Marcin

3 comments:

Anonymous said...

Shouldn't we multiple cpu_count by parallel_threads_per_cpu? Like OEM does when drawing "Maximum CPU line"

Anonymous said...

How about "Total Disk I/O Per Second" ?

Please.

Marcin Przepiorowski said...

Sorry for delay.
According to 1st question - I think it is related to type of CPU which you have.

According to 2nd:
What do you mean ? Total Disk I/O is takes as far as I know from operating system.

regards,
Marcin