Sunday, January 27, 2013

How dNFS database clone works – part 1

There is new feature in Oracle 11.2.0.2 called dNFS clone. It has been described by Kevin Closson on his blog post - Oracle Database 11g Direct NFS Clonedb Feature  and very good configuration description has been posted by Tim Hall on his blog - Direct NFS (DNFS) Clonedb in Oracle Database 11g Release 2 (Patchset 11.2.0.2). I have played with it just after I found both blogs but never think how Oracle implemented that feature. I came back to it when my colleague asked me if I ever use that in production environment and what are performance implications of using it. 

Oracle is using Direct NFS option as Oracle Disk Manager 

Oracle instance running with ODM: Oracle Direct NFS ODM Library Version 3.0
(is there any other still in use ? I remember one – Veritas Quick I/O module for Veritas filesystem / volume manager). As far as I understand ODM is a library which is providing IO functionality for database. As Oracle can be used on different hardware and Oracle kernel can call ODM functions to read or write data from storage and from Oracle kernel perspective it doesn’t really matter how ODM fulfill IO request.

When I started thinking how Oracle implemented that feature, two ideas came into my mind:
  • Oracle keep in-memory map of changed blocks and know where to go to find latest block – actually I never believe in it – this too costly solution 
  • Oracle is just doing read from both sources (backup and NFS) and it is comparing which block is newer if necessary
But guessing work is against BAAG and I want to check what database is actually doing when we are using dnfs clone. I have started with two scenarios – read data which have been changed after clone has been established and read unchanged data. Thanks to Oracle instrumentation (trace 10046), strace and wireshark I see much more details now but still there is long way to have whole picture. Let me share my findings with you.

Read changed data in clone 

I added one row to scott.bonus table which was empty when I cloned database and I executed simple query – select * from scott.bonus – below is a mixture output from all three monitoring sources
EXEC #140195143601328:c=0,e=42,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=953918076,tim=2013-01-17 14:29:43:108500
WAIT #140195143601328: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=13891 tim=2013-01-17 14:29:43:108660
WAIT #140195143601328: nam='Disk file operations I/O' ela= 5 FileOperation=2 fileno=4 filetype=2 obj#=75337 tim=2013-01-17 14:29:43:108800
First WAIT line with disk OS CALL (thanks to @CaryMillsap Method R training I started to thinking about WAIT lines as I should before) has no strace nor wireshark activity and call itself is very short – 5us –so I just ignored it for time being.
statfs("/u01/app/oracle/backup/data_data_D-TESTSE_I-4169104998_TS-USERS_FNO-4_10nvidnv", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=5158559, f_bfree=2226212, f_bavail=1964171, f_files=1310720, f_ffree=1268869, f_fsid={1982750833, 2010996514}, f_namelen=255, f_frsize=4096})
open("/u01/app/oracle/backup/data_data_D-TESTSE_I-4169104998_TS-USERS_FNO-4_10nvidnv", O_RDWR|O_DSYNC) = 11
fcntl(11, F_DUPFD, 256) = 257
close(11)               = 0
fcntl(257, F_SETFD, FD_CLOEXEC) = 0
fstatfs(257, {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=5158559, f_bfree=2226212, f_bavail=1964171, f_files=1310720, f_ffree=1268869, f_fsid={1982750833, 2010996514}, f_namelen=255, f_frsize=4096}) = 0
WAIT #140195143601328: nam='Disk file operations I/O' ela= 182 FileOperation=2 fileno=4 filetype=2 obj#=75337 tim=2013-01-17 14:29:43:109050
WAIT #140195143601328: nam='Disk file operations I/O' ela= 160 FileOperation=2 fileno=4 filetype=2 obj#=75337 tim=2013-01-17 14:29:43:109150
There are next two i/o related calls and above we can see that backup file has been opened , handler has been duplicated and fstatfs call has been executed. There is another mystery here I have seen couple of system calls and then two entries from Oracle trace files – not sure why it look like this – as we are dealing here on microsecond level it is hard to match tim / ela from Oracle trace file with actuall OS calls from strace – but I will try to do it later - in next post probably. Right now we now that Oracle process opened backup file and it is ready to read from it. 
Query is using full scan on table so first of all Oracle need to read segment header
sendmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\0\214\252\16\236\235\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\3\0\0\0\6\0\0\0\1\0\0\0000\253\315\357\253\0\0\0\24oraclese.localdomain\0\0\3241\0\0\3241\0\0\0\2\0\0\3241\0\0\3242\0\0\0\0\0\0\0\0\0\0\0$\1\0\7\1\1\240\202\3\0\0\0\0\222\364,

V3 READ Call (Reply In 44853), FH:0x8f3a5fe4 Offset:256131072 Len:8192,44846,2013-01-17 14:29:43.109264,192.168.1.60,192.168.1.20,NFS,210

poll([{fd=32, events=POLLIN}], 1, 500) = 1 ([{fd=32, revents=POLLIN}])
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0 \200\252\16\236\235\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0 \0\0\0\0\0\0004\24\0\0\0\0\0\0!z\0\1\1\0\0\0 z\0\1!z\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0I&\1\0\0\0\0\20 z\0\1\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\

V3 READ Reply (Call In 44846) Len:8192,44853,2013-01-17 14:29:43.144922,192.168.1.20,192.168.1.60,NFS,1150

recvmsg(32, 0x7fffe701bc20, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily unavailable)

WAIT #140195143601328: nam='db file sequential read' ela= 36202 file#=4 block#=31266 blocks=1 obj#=75337 tim=2013-01-17 14:29:43:145410
We can see that process is using sendmsg function sending NFS READ call (captured by wireshark) and it is waiting for answer. Due to my NFS and network configuration respond has been split into two recvmsg system calls but it is one NFS READ Reply – one database block ( 8kB) has been read and this action has been registered in trace file as db file sequential read. Now process is going to execute full scan and read blocks from table
sendmsg(32, {msg_name(0)=NULL, msg_iov(2)=[{"\200\0\0\214\252\16\236\237\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\3\0\0\0\6\0\0\0\1\0\0\0000\253\315\357\253\0\0\0\24oraclese.localdomain\0\0\3241\0\0\3241\0\0\0\2\0\0\3241\0\0\3242\0\0\0\0\0\0\0\0\0\0\0$\1\0\7\1\1\240\202\3\0\0\0\0\222\364,V3 READ Call (Reply In 44888), FH:0x8f3a5fe4 Offset:256139264 Len:32768  ; 
V3 READ Call (Reply In 44888), FH:0x8f3a5fe4 Offset:256139264 Len:32768
V3 READ Call (Reply In 44862), FH:0x8f3a5fe4 Offset:256172032 Len:8192
poll([{fd=32, events=POLLIN}], 1, 500) = 1 ([{fd=32, revents=POLLIN}])
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0 \200\252\16\236\240\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0,
V3 READ Reply (Call In 44855) Len:8192,44862,2013-01-17 14:29:43.158250,192.168.1.20,192.168.1.60,NFS,1150,,,,
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\200\200\252\16\236\237\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
V3 READ Reply (Call In 44855) Len:32768,44888,2013-01-17 14:29:43.159262,192.168.1.20,192.168.1.60,NFS,1110,,,,
recvmsg(32, 0x7fffe7016aa0, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily unavailable)
WAIT #140195143601328: nam='db file scattered read' ela= 13870 file#=4 block#=31267 blocks=5 obj#=75337 tim=2013-01-17 14:29:43:159580
There are two NFS READ calls and two NFS READ Reply and 5 blocks has been read – it exactly match Oracle trace file entry – db file scattered read – with blocks equal to 5. So this scenario is straight forward – Oracle is going directly to NFS to read changed blocks – but still I can’t be sure how it decide to go there. Let’s check now what will happen when we try to read unchanged blocks.

Read unchanged blocks.  

Oracle is executing query – select * from scott.salgrade – and here is a trace / strace / wireshark output. Process is executing table full scan so it need to read segment header first
sendmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\0\214\252\16\236\245\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\3\0\0\0\6\0\0\0\1\0\0\0000\253\315\357\253\0\0\0\24oraclese.localdomain\0\0\3241\0\0\3241\0\0\0\2\0\0\3241\0\0\3242\0\0\0\0\0\0\0\0\0\0\0$\1\0\7\1\1\240\202\3\0\0\0\0\222\364\16\216\312.\252\227\0\0\0\0\0\0\0\0\25`\ 
V3 READ Call (Reply In 44938), FH:0x8f3a5fe4 Offset:1327104 Len:8192,44930,2013-01-17 14:29:55.162780,192.168.1.60,192.168.1.20,NFS,210    
sendmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\0X\252\16\236\246\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\3\0\0\0\0\0\0\0\1\0\0\0000\253\315\357\253\0\0\0\24oraclese.localdomain\0\0\3241\0\0\3241\0\0\0\2\0\0\3241\0\0\3242\0\0\0\0\0\0\0\0", 92}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 92
poll([{fd=32, events=POLLIN}], 1, 500) = 1 ([{fd=32, revents=POLLIN}])
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0 \200\252\16\236\245\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\0P\370\nZ\0\0\0\0P\370\nZP\366\241\37\0\0
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 7996}], msg_con
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\0\30\252\16\236\246\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\0P\370\nZ\0\0\0\0P\370\nZP\366\241\37\0\0
V3 READ Reply (Call In 44930) Len:8192 44938 2013-01-17 14:29:55.163827,192.168.1.20,192.168.1.60,NFS,1178
recvmsg(32, 0x7fffe701bc20, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily unavailable)         
WAIT #140195143601328: nam='db file sequential read' ela= 1323 file#=4 block#=162 blocks=1 obj#=75338 tim=2013-01-17 14:29:55:164020
At first sight nothing has been changed – Oracle is doing NFS READ call and got a response. But if you will take a look on returned data – database block is full of zeros. It isn’t a surprise as NFS files are sparse one and it should contain only changed blocks. That read activity has been recorded in trace file as db file sequential read but we still have no data - so what is going next
pread(257, "#\242\0\0\242\0\0\1\250\34\17\0\0\0\1\4\317\215\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\10\0\0\0\234\n\0\0\0\0\0\0\10\0\0\0\10\0\0\0\250\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\10\0\0\0\250\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\240\0\0\1"..., 8192, 1327104) = 8192
WAIT #140195143601328: nam='Disk file operations I/O' ela= 49 FileOperation=9 fileno=4 filetype=2 obj#=75338 tim=2013-01-17 14:29:55:164170
Disk read is performed on file with handler 257 and as we can check in first strace output this file is a backup data file. Oracle is reading segment data from here and recorded that operation in trace file as ‘Disk file operations I/O’. Based on that behaviour I think there is no map of changed blocks but process is reading NFS first and if NFS will return zeroed block it is reading backup data file. Let see rest of full scan operation.
sendmsg(32, {msg_name(0)=NULL, msg_iov(2)=[{"\200\0\0\214\252\16\236\250\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\3\0\0\0\6\0\0\0\1\0\0\0000\253\315\357\253\0\0\0\24oraclese.localdomain\0\0\3241\0\0\3241\0\0\0\2\0\0\3241\0\0\3242\0\0\0\0\0\0\0\0\0\0\0$\1\0\7\1\1\240\202\3\0\0\0\0\222\364\16\216\312.\252\227\0\0\0\0\0\0\0\0\25`\ 
V3 READ Call (Reply In 44972), FH:0x8f3a5fe4 Offset:1335296 Len:32768  ; V3 READ Call (Reply In 44978), FH:0x8f3a5fe4 Offset:1368064 Len:8192
poll([{fd=32, events=POLLIN}], 1, 500) = 1 ([{fd=32, revents=POLLIN}])
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0\200\200\252\16\236\250\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\0P\370\nZ\0\0\0\0P\370\nZP\366\241\37\
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32572}], msg_co
V3 READ Reply (Call In 44940) Len:32768 44972,2013-01-17 14:29:55.168776,192.168.1.20,192.168.1.60,NFS,1514    
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\200\0 \200\252\16\236\251\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\201\240\0\0\0\1\0\0\3241\0\0\3241\0\0\0\0\21X \0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\227\252.\312\216\16\364\222\0\0\0\0\4\207`\25P\370\n[\0\0\0\0P\370\nZ\0\0\0\0P\370\nZP\366\241\37\0\0
recvmsg(32, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 7996}], msg_con
V3 READ Reply (Call In 44940) Len:8192,44978,2013-01-17 14:29:55.168790,192.168.1.20,192.168.1.60,NFS,746    
recvmsg(32, 0x7fffe7016aa0, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily unavailable)           
WAIT #140195143601328: nam='db file scattered read' ela= 4762 file#=4 block#=163 blocks=5 obj#=75338 tim=2013-01-17 14:29:55:169040
Once again process send to NFS READ calls (even if segement header wasn’t on NFS) and again only zeroed blocks has been send back. Whole operation is recorded as db file scattered read but actually process still has no data.
pread(257, "\6\242\0\0\243\0\0\1\250\34\17\0\0\0\1\4\202\0\0\0\1\0\0\0J&\1\0\247\34\17\0\0\0\0\0\2\0002\0\240\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\16\0\230\37\212\37\212\37\0\0\0\0@\0\6\0\0\0`\37\0\0"..., 40960, 1335296) = 40 
WAIT #140195143601328: nam='Disk file operations I/O' ela= 6883 FileOperation=9 fileno=4 filetype=2 obj#=75338 tim=2013-01-17 14:29:55:176020      
Again process is reading file 257 (backup data file) and it is recording that activity as "Disk file operations I/O". Based on that observation I can say that Oracle is looking for data first on NFS and then in backup file. 

But it there any caching / learning algorithm in place ? I don’t think so cause when I run this query select * from dba_source where line like 'marcin' twice with flushing buffer cache between runs and execution pattern looks similar – there was NFS activity fallowed by reading backup disk file. 

This is end of part one and I will try to publish next post about same topic soon.

regards,
Marcin 

2 comments:

Anonymous said...

Cant wait for part two :).
Meantime could You provide some info about Your test enviroment ?
Regards
GregG

Kyle Hailey said...

cool analysis! Thanks for documenting your work here. Much appreciated.

- Kyle Hailey
dboptimizer.com