Friday, April 27, 2012

HCC on non-Exadata - How Oracle is detecting storage type

This is next part of HCC compression series on non-Exadata. When 11.2.0.3 has been released Oracle announced that HCC compression will be possbile on ZFS Appliance and Axiom Pillar storage and patch 13041324 has been released as well. I have blogged about it and was able to run HCC on ZFS Appliance simulator and on default Linux NFS as well. After some time Oracle raised bug "Bug 13362079  HCC compression should not be allowed on dNFS with non ZFS or Pillar" and it has been fixed in PSU 11.2.0.3.1. (patch 13343438). After applying that PSU I was unable to create HCC compressed table anymore.

I was wondering how Oracle is checking storage type as far as I know NFS doesn't have that functionality. I have compared wireshark network dumps from old and new Oracle version and there was no difference. At that point I was thinking that maybe new firmware upgrade for ZFS Appliance is required but unfortunately I couldn't download it and apply to my simulator. After discussion on Twitter with @GregRahn @alexgorbachev @AlexFatkulin @kevinclosson I have been told that Oracle can use SNMP to check storage type. That's a new idea and I recalled that I have seen SNMP related error in trace files but I have ignored it as is appear for 11.2.0.3 with patch and for 11.2.0.3.1 as well.

This time I decided to dig it out. So this is a first error I have seen from DBWR trace:
test_dbw0_11626.trc: [1332587314469093] skgnfs_setup_snmp:250: dlopen errno = 0, errstr = libnetsnmp.so: cannot open shared object file: No such file or directory
So there is no shared library in system (server is running OEL 5.6) or some symlinks are missing. Let's try with new symlink
ln -s /usr/lib64/libnetsnmp.so.10 /usr/lib64/libnetsnmp.so
Let's check DBWR trace now - looks better library has been found but target host is not responding.
test_dbw0_3575.trc: [1334661892496086] skgnfs_query_snmp:1831: Timeout error 2 for server 10.10.10.60
I have connected to my ZFS Appliance simulator and I have enabled SNMP using network=0.0.0.0/0 as network filter. I have restarted Oracle and there was no SNMP errors in DBWR trace anymore. I have enabled wireshark again and this what have been captured.
12:44:14.276823 IP 10.10.10.51.20671 > 10.10.10.60.snmp:  GetRequest(33)  E:sun.2.225.1.4.2.0
12:44:14.289691 IP 10.10.10.60.snmp > 10.10.10.51.20671:  GetResponse(59)  E:sun.2.225.1.4.2.0=[|snmp]
Now we can use snmpwalk to check what ZFS Appliance simulator is responding to SNMP request.
[root@dg1 mibs]# snmpwalk -O n -v 1 -c public 10.10.10.60 .1 | grep 225
.1.3.6.1.4.1.42.2.225.1.4.1.0 = STRING: "sunstore"
.1.3.6.1.4.1.42.2.225.1.4.2.0 = STRING: "Sun ZFS Storage VirtualBox"
.1.3.6.1.4.1.42.2.225.1.4.3.0 = STRING: "2011.04.24.1.0,1-1.8"
.1.3.6.1.4.1.42.2.225.1.4.4.0 = Timeticks: (938601700) 108 days, 15:13:37.00
.1.3.6.1.4.1.42.2.225.1.4.5.0 = Timeticks: (938601700) 108 days, 15:13:37.00
.1.3.6.1.4.1.42.2.225.1.4.6.0 = Timeticks: (329800) 0:54:58.00
.1.3.6.1.4.1.42.2.225.1.4.7.0 = STRING: "f2513e14-f8c2-6d7e-fc29-bbd8078aad24"
.1.3.6.1.4.1.42.2.225.1.4.8.0 = STRING: "unknown"
.1.3.6.1.4.1.42.2.225.1.4.9.0 = STRING: "Oracle 000-0000"
.1.3.6.1.4.1.42.2.225.1.5.1.0 = STRING: "AKCS_UNCONFIGURED"
.1.3.6.1.4.1.42.2.225.1.6.1.2.1 = STRING: "zfspool/default/zfstest"
.1.3.6.1.4.1.42.2.225.1.6.1.3.1 = STRING: "zfspool"
.1.3.6.1.4.1.42.2.225.1.6.1.4.1 = STRING: "default"
.1.3.6.1.4.1.42.2.225.1.6.1.5.1 = STRING: "zfstest"
.1.3.6.1.4.1.42.2.225.1.6.1.6.1 = STRING: "/export/zfstest"
.1.3.6.1.4.1.42.2.225.1.6.1.7.1 = Counter32: 7
.1.3.6.1.4.1.42.2.225.1.6.1.8.1 = Counter32: 0
.1.3.6.1.4.1.42.2.225.1.6.1.9.1 = Counter32: 7
Oracle is looking for sun.2.225.1.4.2.0 and this id .1.3.6.1.4.1.42.2.225.1.4.2.0. is matching all numbers. It is returning "Sun ZFS Storage VirtualBox" value and this is a type of NFS server. I think that word "VirtualBox" in name is a non matching one. To confirm that I have googled for screen shots and I found this link. On page 11 I found a information from physical ZFS Appliance which I was looking for and name looks like "Sun ZFS Storage 7xxx".
I hope Oracle will include simulator name as proper name for HCC. In my opinion it can be used to HCC evaluation like dbms_compression package but in different scale. From businesses perspective no one will use simulator for production workload.

But is there any way to test it now ? ZFS Appliance type is not configurable during simulator installation process but there are two options:
- libnetsnm.so is a open source library so it can be amended to return proper value but this is a hard way
- Other possibility to realize that Sun ZFS Appliance simulator is just Solaris box. And there is a chance that as many of other parameters in UNIX type name is saved in text file and if you are able to access this file you can change it.After a few tries and was able to change it and now my simulator is returning following values. I did it only for education purpose and I'm not sure if I can share all steps I did.
At the end SNMP is returning different name
[root@dg1 ~]# snmpwalk -O n -v 1 -c public 10.10.10.60 .1 | grep 225
.1.3.6.1.4.1.42.2.225.1.4.1.0 = STRING: "sunstore"
.1.3.6.1.4.1.42.2.225.1.4.2.0 = STRING: "Sun ZFS Storage 7420"
Now it is time to try HCC in version 11.2.0.3.1 on new configured simulator ZFS and YES it is working again.

Disclaimer:
According to Oracle license Hybrid Column Compression can be run on ZFS Appliance and Axiom Pillar storage only. This post is for eduction purposes only to understand how DB is detecting storage type and how to enable it if you have proper hardware.


regards,
Marcin

Thursday, April 19, 2012

Oracle and HugePages

I have got some very bad experience of running Oracle with quite huge SGA ( 60 GB ) on RedHat 5.6 without HugePages. Host has been completely blocked and I was wondering what was a root cause.
I have used following test configuration:
Host : 96 GB, 2 sockets 12 cores 24 threads

Oracle: 11.2.0.2
SGA_TARGET = 60 GB
I have also set "pre_page_sga" to be sure that all memory will be allocated during instance startup.

I have started instance without HugePages and this is a output from meminfo
testbox1$ cat /proc/meminfo
MemTotal:     98999832 kB
MemFree:      21527276 kB
Buffers:        933668 kB
Cached:       69720980 kB
SwapCached:          0 kB
Active:       64168548 kB
Inactive:      6802180 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     98999832 kB
LowFree:      21527276 kB
SwapTotal:     2096472 kB
SwapFree:      2096472 kB
Dirty:            1548 kB
Writeback:           0 kB
AnonPages:      316056 kB
Mapped:       62708540 kB
Slab:           599964 kB
PageTables:    3679796 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  50572388 kB
Committed_AS: 64086964 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    264440 kB
VmallocChunk: 34359473527 kB
HugePages_Total:  1000
HugePages_Free:   1000
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
As you can see Linux kernel used around 3.5 GB only to create all internal structures for default pages (parameter PageTables). Now let check same with HugePages
testbox1$ cat /proc/meminfo
MemTotal:     98999832 kB
MemFree:      14360908 kB
Buffers:        916216 kB
Cached:        9453272 kB
SwapCached:          0 kB
Active:        2191376 kB
Inactive:      8286108 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     98999832 kB
LowFree:      14360908 kB
SwapTotal:     2096472 kB
SwapFree:      2096472 kB
Dirty:            3696 kB
Writeback:           0 kB
AnonPages:      230380 kB
Mapped:          76648 kB
Slab:           594156 kB
PageTables:      14368 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  14841956 kB
Committed_AS:  1102392 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    264404 kB
VmallocChunk: 34359473527 kB
HugePages_Total: 35893
HugePages_Free:   5245
HugePages_Rsvd:     73
Hugepagesize:     2048 kB
Now it is much better - PageTable has around 15 MB. If we compare this to 3.5 GB from previous output this a proof that non HugePage environment is wasting lot of memory. But wasting memory is not a biggest issue here. Let’s try to connect to database and run simple query
testbox1$ time sqlplus -s / as sysdba << EOF 
select * from dual;
> exit; 
> EOF

D
-
X


real    0m9.645s
user    0m0.006s
sys     0m0.007s
It took 9.645 sec to connect and run select – so where whole time has been spent?
When I used strace to find a solution it wasn’t 100 % successful attempt – there is a matching gap between chdir and mmap calls – so it looks like time is spend on CPU
15:35:57.494002 stat("/opt/app/oracle/product/11.2.0.2/db1/lib", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0 <0.000008>
15:35:57.494044 chdir("/opt/app/oracle/product/11.2.0.2/db1/dbs") = 0 <0.000009>
15:36:07.105693 mmap(NULL, 143360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b5320e95000 <0.000014>
15:36:07.105758 mmap(NULL, 143360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b5320eb8000 <0.000007>
15:36:07.105831 mmap(NULL, 143360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b5320edb000 <0.000008>
Maybe this is problem for local connection only. Let’s try wit listener:
testbox1$ time sqlplus -s user1/user1@log2perf << EOF 
select * from dual;
> exit;
> EOF

D
-
X

real    0m14.340s
user    0m2.300s
sys     0m0.023s
This time it was even worse.

Same tests with HugePages:
testbox1$ time sqlplus -s / as sysdba << EOF 
select * from dual;
> exit;
> EOF

D
-
X

real    0m0.437s
user    0m0.014s
sys     0m0.001s

testbox1$ time sqlplus -s user1/user1@log2perf << EOF 
select * from dual;
> exit;
> EOF

D
-
X

real    0m2.547s
user    0m2.247s
sys     0m0.016s
With HugePage's connection time is much (7 do 20 times) faster. So first pitfall with not HugePages configuration with big SGA is a connection time which is much longer than connection time for configuration with HugePages.

For more information about connection time see update below.

I decided to go further and check DB performance after session has been established. Do to so I have used Kevin Closson SLOB  script to test number of logical reads per sec but I have to do some modification to this great tool.
First modification was related to non huge page environment. SLOB is generating AWR snapshots between and after running workload and at the end it end up with AWR report. As connection time is an issue for non Huge Page environment I have to be sure that SLOB will run two AWR snapshots from one session. I have added one more script which is started like any other workers by semaphore and this script is taking two AWR snapshots with 10 sec gap between them. To increase workload time I have increased number of SQL executions in reader loop from 5000 to 25000. It increased average running time from 4 to 18 – 20 seconds and allow my session to grab two AWR snapshots.

New stats.sql script
set serveroutput off

HOST ./mywait
exec dbms_lock.sleep(2);
exec dbms_workload_repository.create_snapshot
exec dbms_lock.sleep(10);
exec dbms_workload_repository.create_snapshot
exit
runit.sh
#!/bin/bash

if [ -z "$2" ]
then

        echo "${0}: Usage : ${0}  "
        exit
else

        WU=$1
        RU=$2
fi

./create_sem > /dev/null 2>&1

cnt=1
until [ $cnt -gt $RU ]
do
        ( sqlplus -s user${cnt}/user${cnt} @reader > /dev/null 2>&1 ) &
        (( cnt = $cnt + 1 ))
done

until [ $cnt -gt $(( WU + RU )) ]
do
        ( sqlplus -s user${cnt}/user${cnt} @writer > /dev/null 2>&1 ) &
        (( cnt = $cnt + 1 ))
done

        ( sqlplus -L / as sysdba @stats.sql > /dev/null 2>&1 ) &
echo "start sleeping"
# sleep longer to allow all sessions to connect
sleep 120 
#sleep 20
# comment old awr_snap
#sqlplus -L '/as sysdba'  @awr/awr_snap > /dev/null
echo "running slob"

B=$SECONDS
./trigger > /dev/null 2>&1

wait

(( TM =  $SECONDS - $B ))

echo "Tm $TM"

# comment old awr_snap
#sqlplus -L '/as sysdba'  @awr/awr_snap > /dev/null
echo "running report"
sqlplus -L '/as sysdba'  @awr/create_awr_report > /dev/null
After tests with default SLOB default tables I decided to increase table size and effect force Oracle to use more memory for caching data blocks. I made 2 changes – one in setup.sh in cr_seed() function loop has been increased from 10000 to 200000 and similar change has been done in reader.sql I have extended random range from 10000 to 200000

setup.sh
function cr_seed () {

sqlplus -s user1/user1 <<EOF 
set echo on

CREATE TABLE seed
(
custid number(8),
c2 varchar2(128),
c3 varchar2(128),
c4 varchar2(128),
c5 varchar2(128),
c6 varchar2(128),
c7 varchar2(128),
c8 varchar2(128),
c9 varchar2(128),
c10 varchar2(128),
c11 varchar2(128),
c12 varchar2(128),
c13 varchar2(128),
c14 varchar2(128),
c15 varchar2(128),
c16 varchar2(128),
c17 varchar2(128),
c18 varchar2(128),
c19 varchar2(128),
c20 varchar2(128)
) PARALLEL PCTFREE 0 tablespace $TABLESPACE;

DECLARE
x      NUMBER :=1;
fluff  varchar2(128) := 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';

BEGIN
FOR i IN 1..200000 LOOP
        insert into seed values (x,fluff, NULL, NULL, NULL, NULL,
        NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, fluff);
        x := x + 1;

END LOOP;
COMMIT;
END;
/
exit;

EOF
}
reader.sql
set serveroutput off

HOST ./mywait

DECLARE
x NUMBER := 0;
v_r PLS_INTEGER;

BEGIN
dbms_random.initialize(UID * 7777);

FOR i IN 1..25000 LOOP
        v_r := dbms_random.value(257, 10000) ;
        SELECT COUNT(c2) into x FROM cf1 where custid >  v_r - 256 AND  custid < v_r;
END LOOP;

END;
/
exit 

After all these work I have run my tests. To have better results I have every test 40 times. You can see results in table and graph below.

Configuration Avg Median Std var
NonHuge Page small table8,865,085.948,892,305.40185,926.15
NonHuge Page big table8,276,726.188,298,340.70175,300.08
Huge Page small table9,398,380.949,377,956.80263,674.90
Huge Page big table 8,575,646.658,597,997.20180,219.52
LIO / s for 40 tests runs - linear

LIO / s for 40 tests runs - radar view


For small tables Oracle is able to perform 6 % Logical IO more in HugePage configuration than in NonHuge configuration and for big tables this number is dropped to 3.6 % but in both cases HugePage configuration is better.

At the end I have to point out that I didn't measure session private memory utilization (heap and PGA) during this test but it has be taken into consideration if you are going to implement HugePages on your system.

regards,
Marcin

Update:

After morning Twitter conversation with Yury Velikanov @yvelikanov  I learned that he didn't have connection problems with non-HugePage configuration. I used pstack during connection time and here is a output:
#0  0x000000000498a2cd in ksmprepage_granule ()
#1  0x0000000004990aca in ksmgapply_validgranules ()
#2  0x000000000498a01c in ksmprepage ()
#3  0x0000000000aff997 in ksmlsge_phasetwo ()
#4  0x0000000000aff1cf in ksmlsge ()
#5  0x0000000000aff1ab in ksmlsg ()
#6  0x00000000017b547b in opiino ()
#7  0x0000000009006cba in opiodr ()
#8  0x00000000017ac9ec in opidrv ()
#9  0x0000000001e61c93 in sou2o ()
#10 0x0000000000a07a65 in opimai_real ()
#11 0x0000000001e6713c in ssthrdmain ()
#12 0x0000000000a079d1 in main ()
KSMPREPAGE gave me a idea that long connection time is related to pre_page_sga and how Oracle is dealing with it. Connection time is proportional to number of pages to check - of course with nonHugePage configuration number of pages too check in loop in bigger. Why in loop ? You can find this in one of the notes on MOS.

When I have removed pre_page_sga parameter connection time for nonHugePage and HugePage configuration is similar. Now it is time to run test and compare number of LIO/s with and without pre_page_sga. I will update that post soon.

Update 2 - 25 Apr 2012
I have tested number of LIO/s with pre_page_sga set to true and false for nonHuge and Huge Pages environment.

Configuration Avg Median Std var
NonHuge Page Pre_page = false8,072,479.218,096,358.30180,844.09
NonHuge Page Pre_page = true 8,190,746.988,212,969.55141,426.52
Huge Page Pre_page = false8,105,724.408,105,408.50165,420.39
Huge Page Pre_page = true8,735,329.938,744,138.20187,104.66

There is not big difference between pre_page_sga set to true or false in NonHugePage environment - around 1.5 %. For HugePages difference is bigger and it was around 7.7 % in my test. So after all checks I think that HugePage plus pre_page_sga is a winner but remember to double check if you haven't problem with connection time when SGA
is per-allocated.


LIO / s for 40 tests runs - linear

LIO / s for 40 tests runs -radar