Sunday, March 3, 2013

New OakTable World 2012 video is published

Just published Kevin Closson video from OakTable World 2012.

Enjoy,

Marcin

Thursday, February 21, 2013

Oracle on AIX - where's my cpu time ?

This story begins couple of weeks ago when I was asked to look on performance problem for critical processes before go-live. There were differences between run times between different test systems and between runs with 10046 trace enabled or not.

First of all I started with tracing issue. I have checked server CPU and disk utilization using NMON (tool name is important here). There wasn’t any problem with CPU utilization – or at least this is what I was thinking at that time. Tracing was enabled with level 12 (bind and wait) and it generate 2 GB of trace file for 30 min work. First I decided to limit trace level to 8 and give it another try. In the meantime, I double-checked with Oracle community on Twitter (thanks @CaryMillsap and @MartinBerx) that there is not any know Oracle bugs related to trace and performance. Changing tracing level from 12 to 8 solve issue – I mean there no difference between process running with trace and without trace and also limit size of the file to 130MB from 2 GB (don’t ask ?)

In next step I focused on analyze differences between run times on two systems. It is quite easy to do with trace files and I was hoping for short solution.

My first surprise was that profile of SQL in both trace file were almost identical – there was not OS calls (waits) and query spend whole time on CPU. Execution plan were same and amount of data was same as well. Elapsed time for both executions was different and was equal to wall clock time recorded in process logs. To clarify my finding I will focus now on one UPDATE statement. 

When I compared elapsed time with CPU time for EXEC db call I have seen a huge gap between both values – on faster system ratio between CPU time / elapsed time was around 60 % and slower one around 37 % - remember there is no OS calls (wait events)
According to all my knowledge if there is no OS calls CPU time should be equal to elapsed time with probably some difference related to measurement or rounding. If there are differences, there is a huge chance that Oracle process was per-empted from CPU and was waiting in CPU queue. Well according to NMON stats system was utilized in 30 % only so why Oracle process was to wait in queue if there are lot of CPU’s free ? So maybe NMON is reporting system utilization it in wrong way and server itself is overloaded ? That question had to be answered.
I was lucky enough and I had a chance to run Oracle processes on idle server when no one else was connected. I got small LPAR with one core and 4 logical CPU’s as AIX SMT was enabled. I created a small test 

declare
 a number;
 begin
 for i in 1..100000 loop
 select count(*) into a from dba_users;
 end loop;
 end;
/
and I was surprised with results. I doesn’t trust nmon output anymore so I run topas  observing CPU utilization – it showed around 63 % of utilization for my session ! why not 95 – 100 % ? there was nothing else running there ?
I was even more surprised when I checked trace file – here is output formatted by tkprof for query I was looking at

SQL ID: 821zd9kr9k4yr Plan Hash: 3759210642

SELECT COUNT(*) 
FROM
 DBA_USERS


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 100000      1.66       2.73          0          0          0           0
Fetch   100000     53.75      86.03          0    4200000          0      100000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   200001     55.41      88.77          0    4200000          0      100000
55.51 / 88.71 * 100 % = 62.5 %

Hey ! Dude where is my CPU time ? Did my process spend 33 s in CPU run queue ? on idle box ?

I needed more data and at that stage I had three hypothesis – Oracle bug, configuration problem or SMT option. I had funny feeling that SMT option is related to these results somehow but I checked MOS looking for bugs and asked Twitter friends to run same test on other system. (@MartinDBA  confirmed my findings - Thanks for your help).

Next step was to disable SMT option and I learn that AIX is so nice that you can do it online.  I rerun my test – Bingo – topas is showing 95 – 98 % of CPU utilization for my Oracle process and trace file is confirming it

SQL ID: 821zd9kr9k4yr Plan Hash: 3759210642

SELECT COUNT(*) 
FROM
 DBA_USERS


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 100000      2.69       2.78          0          0          0           0
Fetch   100000     85.08      88.33          0    4200000          0      100000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   200001     87.78      91.12          0    4200000          0      100000
87.78 / 91.12 * 100 % = 96.3 %

This is what I was looking for. Based on both observations I was sure that there is no bug’s in tracing functionality for this version of Oracle (11.2.0.3) and I was sure that Oracle process can spin CPU to almost 100 %. There was still question – why there is a huge gap in cpu to elapsed time with SMT enabled. I start looking for IBM documentation and bugs and after some time I found that page - Understanding Processor Utilization on Power Systems - AIX and this picture (linked from IBM wiki page)



So for new PowerPC 7 if your process is running 1 sec on logical CPU (thread)  CPU utilization (user plus system) is 0.63 sec only. If you compare this one with no SMT enabled (or previous version of CPU) you can learn that if process was running for 1 sec on logical CPU it used to be reported as 1 sec of time used by CPU in user and system mode.

How tools are calculating CPU utilization for one process for  1 sec ? you need to divide CPU time used by this processes by 1 sec. So for pre-powerpc 7 or for power pc 7 without SMT

1 s on cpu / 1 s real time * 100 % = 100 %

For powerpc7 with 4 SMT

0.63 on cpu / 1 s * 100 % = 63 %

Have I seen this number before ? Yes this is CPU time taken from trace file divided by elapsed time ! My mystery has been solved – 63 % is now 100 % at least for that version of AIX and Oracle.


Where is a problem ? 

If you are using Cary Millsap Method-R methodology to diagnose performance problems you need to remember that unaccounted time for trace files will be impacted by this feature/bug/you name it.

ps.
I think this post will be updated when I will have more details.

regards,
Marcin

Saturday, February 9, 2013

OakTable World 2012 - Greg Rahn video

New video from OakTable World 2012 is available - Greg Rahn - Beyond Relational Databases

 

regards
Marcin

Friday, February 1, 2013

New OakTable Video - Delphix presenation

New OakTable World 2012 video from Kyle Hailey presentation has been uploaded into YouTube.


regards,
Marcin

Sunday, January 27, 2013

How dNFS database clone works – part 1

There is new feature in Oracle 11.2.0.2 called dNFS clone. It has been described by Kevin Closson on his blog post - Oracle Database 11g Direct NFS Clonedb Feature  and very good configuration description has been posted by Tim Hall on his blog - Direct NFS (DNFS) Clonedb in Oracle Database 11g Release 2 (Patchset 11.2.0.2). I have played with it just after I found both blogs but never think how Oracle implemented that feature. I came back to it when my colleague asked me if I ever use that in production environment and what are performance implications of using it. 

Oracle is using Direct NFS option as Oracle Disk Manager 

Oracle instance running with ODM: Oracle Direct NFS ODM Library Version 3.0
(is there any other still in use ? I remember one – Veritas Quick I/O module for Veritas filesystem / volume manager). As far as I understand ODM is a library which is providing IO functionality for database. As Oracle can be used on different hardware and Oracle kernel can call ODM functions to read or write data from storage and from Oracle kernel perspective it doesn’t really matter how ODM fulfill IO request.

When I started thinking how Oracle implemented that feature, two ideas came into my mind:
  • Oracle keep in-memory map of changed blocks and know where to go to find latest block – actually I never believe in it – this too costly solution 
  • Oracle is just doing read from both sources (backup and NFS) and it is comparing which block is newer if necessary
But guessing work is against BAAG and I want to check what database is actually doing when we are using dnfs clone. I have started with two scenarios – read data which have been changed after clone has been established and read unchanged data. Thanks to Oracle instrumentation (trace 10046), strace and wireshark I see much more details now but still there is long way to have whole picture. Let me share my findings with you.

Read changed data in clone 

I added one row to scott.bonus table which was empty when I cloned database and I executed simple query – select * from scott.bonus – below is a mixture output from all three monitoring sources
EXEC #140195143601328:c=0,e=42,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=953918076,tim=2013-01-17 14:29:43:108500
WAIT #140195143601328: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=13891 tim=2013-01-17 14:29:43:108660
WAIT #140195143601328: nam='Disk file operations I/O' ela= 5 FileOperation=2 fileno=4 filetype=2 obj#=75337 tim=2013-01-17 14:29:43:108800
First WAIT line with disk OS CALL (thanks to @CaryMillsap Method R training I started to thinking about WAIT lines as I should before) has no strace nor wireshark activity and call itself is very short – 5us –so I just ignored it for time being.
statfs("/u01/app/oracle/backup/data_data_D-TESTSE_I-4169104998_TS-USERS_FNO-4_10nvidnv", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=5158559, f_bfree=2226212, f_bavail=1964171, f_files=1310720, f_ffree=1268869, f_fsid={1982750833, 2010996514}, f_namelen=255, f_frsize=4096})
open("/u01/app/oracle/backup/data_data_D-TESTSE_I-4169104998_TS-USERS_FNO-4_10nvidnv", O_RDWR|O_DSYNC) = 11
fcntl(11, F_DUPFD, 256) = 257
close(11)               = 0
fcntl(257, F_SETFD, FD_CLOEXEC) = 0
fstatfs(257, {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=5158559, f_bfree=2226212, f_bavail=1964171, f_files=1310720, f_ffree=1268869, f_fsid={1982750833, 2010996514}, f_namelen=255, f_frsize=4096}) = 0
WAIT #140195143601328: nam='Disk file operations I/O' ela= 182 FileOperation=2 fileno=4 filetype=2 obj#=75337 tim=2013-01-17 14:29:43:109050
WAIT #140195143601328: nam='Disk file operations I/O' ela= 160 FileOperation=2 fileno=4 filetype=2 obj#=75337 tim=2013-01-17 14:29:43:109150
There are next two i/o related calls and above we can see that backup file has been opened , handler has been duplicated and fstatfs call has been executed. There is another mystery here I have seen couple of system calls and then two entries from Oracle trace files – not sure why it look like this – as we are dealing here on microsecond level it is hard to match tim / ela from Oracle trace file with actuall OS calls from strace – but I will try to do it later - in next post probably. Right now we now that Oracle process opened backup file and it is ready to read from it. 
Query is using full scan on table so first of all Oracle need to read segment header
sendmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\0\214\252\16\236\235\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\3\0\0\0\6\0\0\0\1\0\0\0000\253\315\357\253\0\0\0\24oraclese.localdomain\0\0\3241\0\0\3241\0\0\0\2\0\0\3241\0\0\3242\0\0\0\0\0\0\0\0\0\0\0$\1\0\7\1\1\240\202\3\0\0\0\0\222\364,

V3 READ Call (Reply In 44853), FH:0x8f3a5fe4 Offset:256131072 Len:8192,44846,2013-01-17 14:29:43.109264,192.168.1.60,192.168.1.20,NFS,210

poll([{fd=32, events=POLLIN}], 1, 500) = 1 ([{fd=32, revents=POLLIN}])
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0 \200\252\16\236\235\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0 \0\0\0\0\0\0004\24\0\0\0\0\0\0!z\0\1\1\0\0\0 z\0\1!z\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0I&\1\0\0\0\0\20 z\0\1\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\

V3 READ Reply (Call In 44846) Len:8192,44853,2013-01-17 14:29:43.144922,192.168.1.20,192.168.1.60,NFS,1150

recvmsg(32, 0x7fffe701bc20, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily unavailable)

WAIT #140195143601328: nam='db file sequential read' ela= 36202 file#=4 block#=31266 blocks=1 obj#=75337 tim=2013-01-17 14:29:43:145410
We can see that process is using sendmsg function sending NFS READ call (captured by wireshark) and it is waiting for answer. Due to my NFS and network configuration respond has been split into two recvmsg system calls but it is one NFS READ Reply – one database block ( 8kB) has been read and this action has been registered in trace file as db file sequential read. Now process is going to execute full scan and read blocks from table
sendmsg(32, {msg_name(0)=NULL, msg_iov(2)=[{"\200\0\0\214\252\16\236\237\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\3\0\0\0\6\0\0\0\1\0\0\0000\253\315\357\253\0\0\0\24oraclese.localdomain\0\0\3241\0\0\3241\0\0\0\2\0\0\3241\0\0\3242\0\0\0\0\0\0\0\0\0\0\0$\1\0\7\1\1\240\202\3\0\0\0\0\222\364,V3 READ Call (Reply In 44888), FH:0x8f3a5fe4 Offset:256139264 Len:32768  ; 
V3 READ Call (Reply In 44888), FH:0x8f3a5fe4 Offset:256139264 Len:32768
V3 READ Call (Reply In 44862), FH:0x8f3a5fe4 Offset:256172032 Len:8192
poll([{fd=32, events=POLLIN}], 1, 500) = 1 ([{fd=32, revents=POLLIN}])
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0 \200\252\16\236\240\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0,
V3 READ Reply (Call In 44855) Len:8192,44862,2013-01-17 14:29:43.158250,192.168.1.20,192.168.1.60,NFS,1150,,,,
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\200\200\252\16\236\237\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
V3 READ Reply (Call In 44855) Len:32768,44888,2013-01-17 14:29:43.159262,192.168.1.20,192.168.1.60,NFS,1110,,,,
recvmsg(32, 0x7fffe7016aa0, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily unavailable)
WAIT #140195143601328: nam='db file scattered read' ela= 13870 file#=4 block#=31267 blocks=5 obj#=75337 tim=2013-01-17 14:29:43:159580
There are two NFS READ calls and two NFS READ Reply and 5 blocks has been read – it exactly match Oracle trace file entry – db file scattered read – with blocks equal to 5. So this scenario is straight forward – Oracle is going directly to NFS to read changed blocks – but still I can’t be sure how it decide to go there. Let’s check now what will happen when we try to read unchanged blocks.

Read unchanged blocks.  

Oracle is executing query – select * from scott.salgrade – and here is a trace / strace / wireshark output. Process is executing table full scan so it need to read segment header first
sendmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\0\214\252\16\236\245\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\3\0\0\0\6\0\0\0\1\0\0\0000\253\315\357\253\0\0\0\24oraclese.localdomain\0\0\3241\0\0\3241\0\0\0\2\0\0\3241\0\0\3242\0\0\0\0\0\0\0\0\0\0\0$\1\0\7\1\1\240\202\3\0\0\0\0\222\364\16\216\312.\252\227\0\0\0\0\0\0\0\0\25`\ 
V3 READ Call (Reply In 44938), FH:0x8f3a5fe4 Offset:1327104 Len:8192,44930,2013-01-17 14:29:55.162780,192.168.1.60,192.168.1.20,NFS,210    
sendmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\0X\252\16\236\246\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\3\0\0\0\0\0\0\0\1\0\0\0000\253\315\357\253\0\0\0\24oraclese.localdomain\0\0\3241\0\0\3241\0\0\0\2\0\0\3241\0\0\3242\0\0\0\0\0\0\0\0", 92}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 92
poll([{fd=32, events=POLLIN}], 1, 500) = 1 ([{fd=32, revents=POLLIN}])
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0 \200\252\16\236\245\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\0P\370\nZ\0\0\0\0P\370\nZP\366\241\37\0\0
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 7996}], msg_con
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\0\30\252\16\236\246\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\0P\370\nZ\0\0\0\0P\370\nZP\366\241\37\0\0
V3 READ Reply (Call In 44930) Len:8192 44938 2013-01-17 14:29:55.163827,192.168.1.20,192.168.1.60,NFS,1178
recvmsg(32, 0x7fffe701bc20, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily unavailable)         
WAIT #140195143601328: nam='db file sequential read' ela= 1323 file#=4 block#=162 blocks=1 obj#=75338 tim=2013-01-17 14:29:55:164020
At first sight nothing has been changed – Oracle is doing NFS READ call and got a response. But if you will take a look on returned data – database block is full of zeros. It isn’t a surprise as NFS files are sparse one and it should contain only changed blocks. That read activity has been recorded in trace file as db file sequential read but we still have no data - so what is going next
pread(257, "#\242\0\0\242\0\0\1\250\34\17\0\0\0\1\4\317\215\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\10\0\0\0\234\n\0\0\0\0\0\0\10\0\0\0\10\0\0\0\250\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\10\0\0\0\250\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\240\0\0\1"..., 8192, 1327104) = 8192
WAIT #140195143601328: nam='Disk file operations I/O' ela= 49 FileOperation=9 fileno=4 filetype=2 obj#=75338 tim=2013-01-17 14:29:55:164170
Disk read is performed on file with handler 257 and as we can check in first strace output this file is a backup data file. Oracle is reading segment data from here and recorded that operation in trace file as ‘Disk file operations I/O’. Based on that behaviour I think there is no map of changed blocks but process is reading NFS first and if NFS will return zeroed block it is reading backup data file. Let see rest of full scan operation.
sendmsg(32, {msg_name(0)=NULL, msg_iov(2)=[{"\200\0\0\214\252\16\236\250\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\3\0\0\0\6\0\0\0\1\0\0\0000\253\315\357\253\0\0\0\24oraclese.localdomain\0\0\3241\0\0\3241\0\0\0\2\0\0\3241\0\0\3242\0\0\0\0\0\0\0\0\0\0\0$\1\0\7\1\1\240\202\3\0\0\0\0\222\364\16\216\312.\252\227\0\0\0\0\0\0\0\0\25`\ 
V3 READ Call (Reply In 44972), FH:0x8f3a5fe4 Offset:1335296 Len:32768  ; V3 READ Call (Reply In 44978), FH:0x8f3a5fe4 Offset:1368064 Len:8192
poll([{fd=32, events=POLLIN}], 1, 500) = 1 ([{fd=32, revents=POLLIN}])
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\200\200\252\16\236\250\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\0P\370\nZ\0\0\0\0P\370\nZP\366\241\37\
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32572}], msg_co
V3 READ Reply (Call In 44940) Len:32768 44972,2013-01-17 14:29:55.168776,192.168.1.20,192.168.1.60,NFS,1514    
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0 \200\252\16\236\251\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\0P\370\nZ\0\0\0\0P\370\nZP\366\241\37\0\0
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 7996}], msg_con
V3 READ Reply (Call In 44940) Len:8192,44978,2013-01-17 14:29:55.168790,192.168.1.20,192.168.1.60,NFS,746    
recvmsg(32, 0x7fffe7016aa0, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily unavailable)           
WAIT #140195143601328: nam='db file scattered read' ela= 4762 file#=4 block#=163 blocks=5 obj#=75338 tim=2013-01-17 14:29:55:169040
Once again process send to NFS READ calls (even if segement header wasn’t on NFS) and again only zeroed blocks has been send back. Whole operation is recorded as db file scattered read but actually process still has no data.
pread(257, "\6\242\0\0\243\0\0\1\250\34\17\0\0\0\1\4\202\0\0\0\1\0\0\0J&\1\0\247\34\17\0\0\0\0\0\2\0002\0\240\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\16\0\230\37\212\37\212\37\0\0\0\0@\0\6\0\0\0`\37\0\0"..., 40960, 1335296) = 40 
WAIT #140195143601328: nam='Disk file operations I/O' ela= 6883 FileOperation=9 fileno=4 filetype=2 obj#=75338 tim=2013-01-17 14:29:55:176020      
Again process is reading file 257 (backup data file) and it is recording that activity as "Disk file operations I/O". Based on that observation I can say that Oracle is looking for data first on NFS and then in backup file. 

But it there any caching / learning algorithm in place ? I don’t think so cause when I run this query select * from dba_source where line like 'marcin' twice with flushing buffer cache between runs and execution pattern looks similar – there was NFS activity fallowed by reading backup disk file. 

This is end of part one and I will try to publish next post about same topic soon.

regards,
Marcin 

Tuesday, January 15, 2013

Oracle Linux and ESX hot add cpu and memory

I'm still quite new in Vmware ESX environment and features which I'm using now as my lab (curiosity of VMWare Guru Program).
This time I decided to test hot add CPU and memory functionality. It looks very interesting and in past years hot cpu / memory games where restricted to sophisticated hardware only - now everybody can test is at home.

My lab is basesd on Intel 4 cores and VM are running Oracle Linux 6.3 with or without Oracle Enterprise Kernel. Here are my findings:

1. Hot add CPU / memory is disabled by default and has to be enabled when VM is down. To do so you have to edit your VM configuration and go to options.

2. When feature is enabled you can ADD CPU or memory by editing your VM configuration. Keep in mind that there is no possibility to decrease number of CPU or memory.

3. Here is a table with supported options for ESX 5i and Oracle Linux with or without UEK

I tested adding a CPU to system for RedHat and UEK kernel. As you can find in table linked above adding memory is supported for RedHat kernels only.

RedHat kernel

OEL 6.3 with RedHat kernel running with 1 CPU
[root@oraclese ~]# uname -a
Linux oraclese.localdomain 2.6.32-279.5.2.el6.x86_64 #1 SMP Thu Aug 23 12:05:59 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
[root@oraclese ~]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
stepping        : 7
cpu MHz         : 3110.469
cache size      : 6144 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc up arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat epb xsaveopt pln pts dts
bogomips        : 6220.93
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
New CPU has been added in vSphere client. Below is output from messages log.
Jan 13 15:17:25 oraclese kernel: SMP alternatives: switching to SMP code
Jan 13 15:17:25 oraclese kernel: Booting Node 0 Processor 1 APIC 0x1
Jan 13 15:17:25 oraclese kernel: Disabled fast string operations
Jan 13 15:17:25 oraclese kernel: mce: CPU supports 0 MCE banks
Jan 13 15:17:25 oraclese kernel: x86 PAT enabled: cpu 1, old 0x0, new 0x7010600070106
Jan 13 15:17:25 oraclese kernel: Skipped synchronization checks as TSC is reliable.
Jan 13 15:17:25 oraclese kernel: microcode: CPU1 sig=0x206a7, pf=0x1, revision=0x14
Jan 13 15:17:25 oraclese kernel: platform microcode: firmware: requesting intel-ucode/06-2a-07
Jan 13 15:17:25 oraclese firmware.sh[2070]: Cannot find  firmware file 'intel-ucode/06-2a-07'
Jan 13 15:17:25 oraclese kdump: kexec: failed to unload kdump kernel
Jan 13 15:17:25 oraclese kdump: failed to stop
Jan 13 15:17:25 oraclese kdump: No crashkernel parameter specified for running kernel
Now system is running with two CPU's
[root@oraclese ~]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
stepping        : 7
cpu MHz         : 3110.469
cache size      : 6144 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat epb xsaveopt pln pts dts
bogomips        : 6220.93
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
stepping        : 7
cpu MHz         : 3110.469
cache size      : 6144 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat epb xsaveopt pln pts dts
bogomips        : 6220.93
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
Adding memory - OEL 6.3 wtih RedHat kernel - system running with 1 GB of RAM
[root@oraclese ~]# uname -a
Linux oraclese.localdomain 2.6.32-279.5.2.el6.x86_64 #1 SMP Thu Aug 23 12:05:59 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
[root@oraclese ~]# free
             total       used       free     shared    buffers     cached
Mem:       1016680     190384     826296          0       4736      77112
-/+ buffers/cache:     108536     908144
Swap:      2097144          0    2097144
Additional 1 GB has been added via vSphere Client. Below is output from messages log.
Jan 13 15:40:54 oraclese kernel: init_memory_mapping: 0000000040000000-0000000048000000
Jan 13 15:40:54 oraclese kernel: Built 1 zonelists in Node order, mobility grouping on.  Total pages: 274190
Jan 13 15:40:54 oraclese kernel: Policy zone: Normal
Jan 13 15:40:54 oraclese kernel: init_memory_mapping: 0000000048000000-0000000050000000
Jan 13 15:40:54 oraclese kdump: kexec: failed to unload kdump kernel
Jan 13 15:40:54 oraclese kdump: failed to stop
Jan 13 15:40:54 oraclese kdump: kexec: failed to unload kdump kernel
Jan 13 15:40:54 oraclese kdump: failed to stop
Jan 13 15:40:54 oraclese kdump: No crashkernel parameter specified for running kernel
Jan 13 15:40:54 oraclese kdump: No crashkernel parameter specified for running kernel
Jan 13 15:40:54 oraclese kernel: init_memory_mapping: 0000000050000000-0000000058000000
Jan 13 15:40:54 oraclese kdump: kexec: failed to unload kdump kernel
Jan 13 15:40:54 oraclese kdump: failed to stop
Jan 13 15:40:54 oraclese kdump: No crashkernel parameter specified for running kernel
Jan 13 15:40:54 oraclese kernel: init_memory_mapping: 0000000058000000-0000000080000000
Jan 13 15:40:54 oraclese kdump: kexec: failed to unload kdump kernel
Jan 13 15:40:54 oraclese kdump: kexec: failed to unload kdump kernel
Jan 13 15:40:54 oraclese kdump: kexec: failed to unload kdump kernel
Jan 13 15:40:54 oraclese kdump: kexec: failed to unload kdump kernel
Jan 13 15:40:54 oraclese kdump: failed to stop
Jan 13 15:40:54 oraclese kdump: failed to stop
Jan 13 15:40:54 oraclese kdump: failed to stop
Jan 13 15:40:54 oraclese kdump: kexec: failed to unload kdump kernel
Jan 13 15:40:54 oraclese kdump: failed to stop
Jan 13 15:40:54 oraclese kdump: failed to stop
Jan 13 15:40:54 oraclese kdump: No crashkernel parameter specified for running kernel
Jan 13 15:40:54 oraclese kdump: No crashkernel parameter specified for running kernel
Jan 13 15:40:54 oraclese kdump: No crashkernel parameter specified for running kernel
Jan 13 15:40:54 oraclese kdump: No crashkernel parameter specified for running kernel
Jan 13 15:40:54 oraclese kdump: No crashkernel parameter specified for running kernel
Now Linux reconized 2 GB of RAM
[root@oraclese ~]# free
             total       used       free     shared    buffers     cached
Mem:       2065256     211840    1853416          0       4764      77328
-/+ buffers/cache:     129748    1935508
Swap:      2097144          0    2097144

Oracle UEK release 2 kernel

OEL 6.3 running with Oracle UEK release 2
[root@oraclese ~]# uname -a
Linux oraclese.localdomain 2.6.32-300.3.1.el6uek.x86_64 #1 SMP Fri Dec 9 18:57:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
[root@oraclese ~]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
stepping        : 7
cpu MHz         : 3110.469
cache size      : 6144 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc up arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat
bogomips        : 6220.93
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
New CPU has been added in vSphere client. Below is output from messages log.
Jan 13 15:38:12 oraclese kernel: processor LNXCPU:01: registered as cooling_device1
Jan 13 15:38:12 oraclese kernel: SMP alternatives: switching to SMP code
Jan 13 15:38:12 oraclese kernel: Booting processor 1 APIC 0x1 ip 0x6000
Jan 13 15:38:12 oraclese kernel: Initializing CPU#1
Jan 13 15:38:12 oraclese kernel: CPU: Physical Processor ID: 2
Jan 13 15:38:12 oraclese kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
Jan 13 15:38:12 oraclese kernel: CPU: L2 cache: 256K
Jan 13 15:38:12 oraclese kernel: CPU: L3 cache: 6144K
Jan 13 15:38:12 oraclese kernel: CPU 1/0x1 -> Node 0
Jan 13 15:38:12 oraclese kernel: mce: CPU supports 0 MCE banks
Jan 13 15:38:12 oraclese kernel: x86 PAT enabled: cpu 1, old 0x0, new 0x7010600070106
Jan 13 15:38:12 oraclese kernel: CPU1: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz stepping 07
Jan 13 15:38:12 oraclese kernel: Skipping synchronization checks as TSC is reliable.
Jan 13 15:38:12 oraclese kernel: microcode: CPU1 sig=0x206a7, pf=0x1, revision=0x14
Jan 13 15:38:12 oraclese kernel: platform microcode: firmware: requesting intel-ucode/06-2a-07
Jan 13 15:38:12 oraclese firmware.sh[2074]: Cannot find  firmware file 'intel-ucode/06-2a-07'
Jan 13 15:38:12 oraclese kdump: kexec: failed to unload kdump kernel
Jan 13 15:38:12 oraclese kdump: failed to stop
Jan 13 15:38:12 oraclese kdump: No crashkernel parameter specified for running kernel
Now system is running with two CPU's
[root@oraclese ~]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
stepping        : 7
cpu MHz         : 3110.469
cache size      : 6144 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat
bogomips        : 6220.93
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
stepping        : 7
cpu MHz         : 3110.469
cache size      : 6144 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat
bogomips        : 6220.93
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
regards, Marcin

Sunday, January 13, 2013

Oaktable videos

Hi,

Short update - I have upload new video from OakTable World 2012 - Brendan Gregg -- USE Method for Performance Tuning.


All videos and presentations can be found here - OakTable world

regards,
Marcin