Wednesday, August 4, 2010

Column names and typing errors

Last few days I have hit myself into Oracle "feature" which was one of my favorite when I was helping developers solve strange Oracle issues. I recall one Oracle ANSI SQL syntax issue when Oracle didn't recognize that column names are duplicated in different tables and use a random one in output. This is why I always told people use column name with table name to avoid confusion. Anyway now I decided to blog about that and show why table.column_name syntax is so important.
Let's go through an example - In this syntax equal ?
select name1 from T1 where id1 in (select id1 from T2 where name2='NAME2');
select T1.name1 from T1 where T1.id1 in (select T2.id1 from T2 where name2='NAME2');
Looks OK isn't it ? But it really depend on how tables are defined. 
SQL> desc T1
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID1                                                NUMBER
 NAME1                                              VARCHAR2(100)

SQL> desc T2
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID2                                                NUMBER
 NAME2                                              VARCHAR2(100)

SQL> select id1, name1 from T1;

       ID1 NAME1
---------- -------------------------------------------------------
         1 NAME1

SQL> select id2, name2 from T2;

       ID2 NAME2
---------- -------------------------------------------------------
         2 NAME2
 
Now we can see that there is a typo in second query as there is no ID1 column in T2 table. But lets try to execute both. First we start a query with tablename.columnname syntax :
SQL> select T1.name1 from T1 where T1.id1 in (select T2.id1 from T2 where name2='NAME2');
select T1.name1 from T1 where T1.id1 in (select T2.id1 from T2 where name2='NAME2')
                                                *
ERROR at line 1:
ORA-00904: "T2"."ID1": invalid identifier
Looks OK Oracle found our error and query has not been executed. What about first example ?
SQL> select name1 from T1 where id1 in (select id1 from T2 where name2='NAME2');

NAME1
--------------------------------------------------------------------------------
NAME1

SQL>
It return one row and Oracle didn't find our typing error. Even worse IN filter is true for every T1 row but it should not - if you check rows in T1 and T2 where is no common value between  ID1 and ID2.
Why ? We are going to check if ID1 from T1 is included in set of ID1 from T2. But there is no ID1 in T2 so Oracle is using ID1 from T1 as in every correlated subquery. What this subquery return ? It will return ID1 from table T1 for every row in table T2 where name is equal to 'NAME2'. This is not what was excepted.

But is this subquery can be executed as separate query ? Of course not
SQL> select id1 from T2 where name2='NAME2';
select id1 from T2 where name2='NAME2'
       *
ERROR at line 1:
ORA-00904: "ID1": invalid identifier

I'm not sure if that problem can be classified as bug but at least is good to know how Oracle is using columns names and use tablename.columnname syntax to avoid confusion and increase code quality. Especially if there are tables with long column names and there is one letter difference between them.

regards,
Marcin

Thursday, July 29, 2010

Process hung on library cache lock

Customer reported a system slowdown/process hung during small and well defined period of time and it was related to one specific process only. Standard Oracle 9i performance analyze tool - Statspack didn't show anything unusual in that time but shapshots were taking very 30 min and this issue took about 5 minutes.
As we could not replicate issue I thought about SASH (a free simulator of Oracle Active Session History) and I have installed it and configured. From that time I was able to gather and keep information about active sessions.
When next time I got detailed information about similar issue I was able to use collected data and found a problem. After investigation I figured out that process used for that business functionality stuck on “library cache lock” event. When I compared my findings from SASH with Statspack I realize why I didn't spotted it before - "library cache lock" was about 1.8 % of all locks between statspack snapshots. From overall database review there was no problem but from process (and business) perspective that lock hung important process for almost 5 minutes.

Below I described all major steps of my investigation and what was a root cause.
Let's start with SASH output (I just left 3 lines but there was many more similar lines):
SQL> select sample_time,program, session_id, session_state, event, seq#, sql_id, '0' || trim(to_char(p1,'XXXXXXXXXXXXXXXXX')) "p1raw", p3
  2  from v$active_session_history
  3  where 
  4  sample_time >= to_date('18-7-2010 13:50:10','dd-mm-yyyy hh24:mi:ss')
  5  and sample_time <= to_date('18-7-2010 13:54:58','dd-mm-yyyy hh24:mi:ss')
  6  and event like 'library%'
  7  order by sample_time desc
  8  /

SAMPLE_TI PROGRAM        SESSION_ID SESSION_STATE EVENT              SEQ#  SQL_ID     p1raw             P3
--------- -------------- ---------- ------------- ------------------ ----- -------    -----------       ---
18-JUL-10 importantp.exe        472 WAITING       library cache lock  214  1548901218 07000002AC789A98  210
18-JUL-10 otherprogr.exe        436 WAITING       library cache lock    8  2855865705 07000002AC789A98  210
18-JUL-10                       375 WAITING       library cache lock   33   261624711 07000002AC789A98  310

First idea was about parsing problem but when I took a look on SQL it looked OK but it was just entry point to PL/SQL procedure. Next thing was to check event parameters and there was first surprise. According to Oracle documentation and Metalink note 34579.1 that event has following parameters:
  • p1 is an address of required object - in my case 07000002AC789A98
  • p3 is a lock type and namespace type (lock mode * 100 + namespace)  - in my case 310 or 210
In Oracle definitions (metalink or decode in v$librarycache) namespace number 10 is not defined. See whole list below
•    0 SQL Area
•    1 Table / Procedure / Function / Package Header
•    2 Package Body
•    3 Trigger
•    4 Index
•    5 Cluster
•    6 Object
•    7 Pipe
•    13 Java Source
•    14 Java Resource
•    32 Java Data
So what object it is ? P1 had a address and X$KGLOB has a list of all objects (I checked that table next day but there was no restart and it was a chance that there is still that same object in shared pool as session was running 24/7):
SQL> select kglnaown "Owner", kglnaobj "Object" from x$kglob where kglhdadr='07000002AC789A98'; 

Owner  Object
------ --------
SOFT   SNDQ
What type of objects is it ?
SQL> select object_type, object_name from dba_objects where owner='SOFT' and object_name = 'SNDQ';

OBJECT_TYPE OBJECT_NAME
----------- -----------
QUEUE       SNDQ
Bingo - object was a queue there was a dequeue in that PL/SQL code. My process called importantp.exe was waiting for lock on queue. Let's check lock type:
  • 2    - Share mode 
  • 3    - Exclusive mode 
For importantp.exe lock mode in p3 parameter was 210 which mean wait for shared lock. But queuing and dequeueing operations are running with shared locks so reason for waiting had to be related to none of them. I found it in last line of my listing there was another processes waiting for exclusive lock on that same queue. Using SQL_ID (261624711)  I discovered that there was another PL/SQL piece of code and add_subscriber function was called inside. 
That was a explanation of importantp.exe hungs - as dequeue function was called after add_subscriber function it had to wait until DDL command was finished. Add subscriber function had to wait until all previous operations in shared mode will be completed and there was typical chain of locks.

There are bugs in Metalink raised by other people describing similar situations and there is only one workaround do not try any DDL on queues during traffic peaks. This recommendation is well know and most DBAs preventing database from typical DDL operation ( like index rebuild or schema changes) during business hours. The problems starts with hidden  DDL commands used in application.


Sunday, July 18, 2010

OEM performance tab and Active Session History data.

Graphic representation of Active Session History data in Database Console or in Grid Control is causing some misunderstanding of how to read it. I have lot of questions about Y scale on OEM performance tab and now I decide to blog about it. I want to describe now OEM is using a ASH date and what we can do with this information. Anyway using SQL instead of OEM is more useful in performance tuning and can be easily focused on existing issue.

For other point of view knowledge of ASH give you opportunity to use free implementation of it developed by Kyle Hailey. (Thanks Kyle for your excellent scripts – this is great staff to learn from). By the way recently I have deployed SASH on big production database running on Oracle 9i and it help me to identify and address performance problems. As soon as I finish all these work I will blog about it too.

CPU utilization
CPU usage on performance tab is in green and in my case it has upper limit set to 2 – as I have two core CPU.


It is quite clear how to read CPU usage and what is a maximum CPU utilization. Scope of time which sessions can spend on CPU has an upper limit which is 100 % utilization of all CPU. It mean that maximum CPU time for session in sampling time can be calculated using following formula

Number of CPU x Sampling Time

So if we want to calculate a CPU usage we need to use following formula:

time consumed per all sessions / (Number of CPU x Sampling Time) * 100 %

ex.
time consumed per all sessions = 16 s
Number of CPU = 2
Sampling time = 15 s 

Util of all CPUs = 15 / (2 * 15) * 100 % = 53.3 %

How to calculate CPU utilisation using Active Session History data instead of OEM performance tab ?

Active Session History table is a results of sampling V$SESSION view every second. That mean there is no “real” waiting or executing time in V$ACTIVE_SESSION_HISTORY but just samples. We are assuming that every state in ASH view (WAITING or ON CPU) took exactly 1 s. This is important as we don’t sum waiting time from ASH view but we have to count how many 1 s events were in observed time.
Be aware that DBA_HIST_ACTIVE_SESS_HISTORY has only 1/10 data of  V$ACTIVE_SESSIN_HISTORY (only sample_id modulo 10 are copied into it) and each sample is now 10s long if we want to use it for our calculations.

Following query will calculate average CPU utilization for two minutes:
SQL> select trunc(count(*)/
  2  (min(p.value)*(max(sample_id) - min (sample_id)))*100,2) "CPU_UTIL",
  3  max(sample_id) - min (sample_id) "sampling time [s]",
  4  count(*) "time consumed per sessions [s]",
  5  min(p.value) * (max(sample_id) - min (sample_id)) "maximum possible time [s]"
  6  from v$active_session_history,
  7  (select value from v$parameter  where name = 'cpu_count') p
  8  where 1=1
  9  and sample_time >= to_date('18-07-2010 12:23:00','dd-mm-yyyy hh24:mi:ss')
 10  and sample_time <= to_date('18-07-2010 12:25:00','dd-mm-yyyy hh24:mi:ss')
 11  and session_state = 'ON CPU';

  CPU_UTIL sampling time [s] time consumed per sessions [s] maximum possible time [s]
---------- ----------------- ------------------------------ -------------------------
     29.23               118                             69                       236


But this doesn’t help us in understanding OEM graph. Refresh rate is set to 15 sec so we need to display data from that period split into 15 sec slices.
SQL> select trunc(to_char(sample_time,'SSSSS')/15) "15s_sample", trunc(count(*)/
  2  (min(p.value)*15)*100,2) "CPU_UTIL",
  3  count(*) "time consumed per sessions [s]"
  4  from v$active_session_history,
  5  (select value from v$parameter  where name = 'cpu_count') p
  6  where 1=1
  7  and sample_time >= to_date('18-07-2010 12:23:00','dd-mm-yyyy hh24:mi:ss')
  8  and sample_time <= to_date('18-07-2010 12:25:00','dd-mm-yyyy hh24:mi:ss')
  9  and session_state = 'ON CPU'
 10  group by trunc(to_char(sample_time,'SSSSS')/15)
 11  order by 1;

15s_sample   CPU_UTIL time consumed per sessions [s]
---------- ---------- ------------------------------
      2972       6.66                              2
      2973       6.66                              2
      2974      53.33                             16
      2975      23.33                              7
      2976      36.66                             11
      2977      26.66                              8
      2978      33.33                             10
      2979      43.33                             13

8 rows selected.

Now if you use any graphic tool to create a chart using 15s_sample column as X axis and "CPU util" column as Y axis you should get very similar picture to OEM chart.

Wait events

There is no easily calculated upper limit for waiting sessions. Why ? Number of waits per sample is depended on session state and number of waiting sessions. So what is a maximum for that ? Should we use only active sessions or all connected sessions ? The worst case scenario is all sessions are active and all are in waiting state. Both values multiplied each other give us a potential upper limit. 

Performance tab in OEM is calculating a average of waits in sample time divided by groups and all groups are displayed using stacked area graph.

Following SQL return data used to create following graphs
SQL> select trunc(to_char(sample_time,'SSSSS')/15) "15s_sample", wait_class, count(*)/15
  2  from v$active_session_history
  3  where sample_time >= to_date('18-07-2010 12:23:00','dd-mm-yyyy hh24:mi:ss')
  4  and sample_time <= to_date('18-07-2010 12:25:00','dd-mm-yyyy hh24:mi:ss')
  5  and session_state = 'WAITING'
  6  group by trunc(to_char(sample_time,'SSSSS')/15), wait_class order by 1,3
  7  ;

15s_sample WAIT_CLASS                                                       COUNT(*)/15
---------- ---------------------------------------------------------------- -----------
      2974 Other                                                                     .2
      2974 System I/O                                                        .533333333
      2974 Commit                                                                     2
      2974 User I/O                                                          2.06666667
      2974 Concurrency                                                       4.66666667
      2975 Application                                                       .066666667
      2975 System I/O                                                        .933333333
      2975 User I/O                                                                 3.8
      2975 Commit                                                            4.93333333
      2976 Concurrency                                                       .066666667
      2976 Application                                                       .066666667
      2976 Other                                                                     .4
      2976 System I/O                                                               1.2
      2976 Commit                                                                   5.4
      2976 User I/O                                                                 5.8
      2977 System I/O                                                        .866666667
      2977 User I/O                                                          2.93333333
      2977 Commit                                                                   5.6
      2978 Concurrency                                                       .066666667
      2978 System I/O                                                        .933333333
      2978 User I/O                                                          3.26666667
      2978 Commit                                                                     4
      2979 System I/O                                                        .866666667
      2979 User I/O                                                          1.46666667
      2979 Commit                                                            2.53333333

25 rows selected.
As results we have average time spend by all sessions divided per wait class for a 1 second.  What it mean that in sample 2976 wait class “Commit” average time was 5.4 ? It exactly mean that in that 15 seconds sampling time all active sessions were 5.4*15 = 81 times waiting for events in commit class. Is it bad or not ? This is a different question and will not be answered in that post.

Now if we take a look on sample nr 2976
2976 Application                                                       .066666667
2976 Other                                                                     .4
2976 System I/O                                                               1.2
2976 Commit                                                                   5.4
2976 User I/O                                                                 5.8

Sum of all waits give us 12.8 and this is more less what we can see as a top value on graph. A differences between graph and our calculation are related to observation period start time and how we split results into 15 sampling time. We are using a group by function based on sample_time truncated into 4 periods of seconds - using group by to_char function (based on Kyle idea).These periods are: 0 – 15, 15- 30 , 30-45, 45-60 but OEM can display data based on different 15 sec groups ex. 10 - 25, 25 - 40 so small differences can occure.

Now the question is – is this a bad thing to have high numbers of OEM graph ? And like always when we are talking about Oracle performance an answer is depends. 
What is bad for sure if our CPU usage is going over 85-90 % and stay there for a long time. In that case CPU response time will have good opportunity to increase and whole system performance can go down. What to do with waits ? In my opinion a best way is to create a baseline (even on paper) with average waiting time and react only when system is working over our baseline or what is more important when system users are claiming a performance problems.

Next time I will use drill-into functionality of OEM and will try to explain what other graphs means – unfortunately most of them are changing a scale between main performance graph and detailed ones so if you are not aware of that it can cause a lot of misunderstandings.


Regards,
Marcin

Wednesday, July 14, 2010

Oracle and AIX

Hi,

Very useful document about AIX configuration for Oracle Database provided IBM.

regards,
Marcin

Sunday, June 20, 2010

Grid Infrastructure - Reconfiguration / host name change

In 11g Release 2 Oracle introduced a Grid Infrastructure (GI) which can be also called Oracle Restart if it is implemented for single instance only. If database is based on Oracle ASM there is no other option and GI has to be installed. Since version 11gR2 ASM is not longer part of RDBMS but part of Grid Infrastructure.
I would like to present how to perform reconfiguration of Grid Infrastructure on one server together with change of server name and IP. Of course reconfiguration can be done in case of other problems and I want to show that it isn't very complicated.
Part 1 - Gathering current configuration

Information about current configuration of ASM and LISTENER should be well documented so in case of reconfiguration it can be easily used.

[oracle@piorovm ~]$ srvctl config asm
ASM home: /u01/app/11.2.0/grid
ASM listener: LISTENER
Spfile: +DATA_SOURCE/ASM/ASMPARAMETERFILE/REGISTRY.253.722024685
ASM diskgroup discovery string: /dev/sd*

[oracle@piorovm ~]$ srvctl config listener
Name: LISTENER
Home: /u01/app/11.2.0/grid
End points: TCP:1521 

In addition all ASM disk group names should be documented. If there is no information about group names this information could be read from ASM disks even when ASM instance is down.
In that example information from header of all disks matching string '/dev/sd*' will be displayed.

[oracle@piorovm ~]$ asmcmd
Connected to an idle instance.
ASMCMD> lsdsk -I -p /dev/sd*
Disk_Num  Disk_group   Header_Stat  Path
       0  DATA_SOURCE  MEMBER       /dev/sdb1
ASMCMD>

Part 2- Grid Control deconfiguration
Current Grid Infrastructure configuration has to be deleted. To achieve it roothas.pl script with proper attributes has to be executed. If Grid Infrastructure was working all resources will be stopped automatically before deletion. This script has to be run as root user.

[root@piorovm install]# cd /u01/app/11.2.0/grid/crs/install
[root@piorovm install]# ./roothas.pl -delete -force
2010-06-18 14:40:33: Checking for super user privileges
2010-06-18 14:40:33: User has super user privileges
2010-06-18 14:40:33: Parsing the host name
Using configuration parameter file: ./crsconfig_params
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Stop failed, or completed with errors.
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Delete failed, or completed with errors.
CRS-4544: Unable to connect to OHAS
CRS-4000: Command Stop failed, or completed with errors.
ACFS-9200: Supported
2560+0 records in
2560+0 records out
10485760 bytes (10 MB) copied, 0.077012 seconds, 136 MB/s
Successfully deconfigured Oracle Restart stack

Part 3- Host rename and Grid Infrastructure configuration
In that part any required changes to host can be made. If reconfiguration was a result of errors
next new configuration of Grid Infrastructure can be done now.

In this example both IP and host name have  been changed.

[oracle@piorovm2 ~]$ cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1       localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6
10.10.30.150    piorovm2.localdomain piorovm2
[oracle@piorovm2 ~]$ hostname
piorovm2.localdomain

As a first step of Grid Infrastructure configuration "roothas.pl" script has to be executed.
It will create all local resources and allow future configuration of whole stack. This script has to be run as root user.

[root@piorovm install]# cd /u01/app/11.2.0/grid/crs/install
[root@piorovm2 install]# ./roothas.pl -verbose
2010-06-19 13:38:35: Checking for super user privileges
2010-06-19 13:38:35: User has super user privileges
2010-06-19 13:38:35: Parsing the host name
Using configuration parameter file: ./crsconfig_params
LOCAL ADD MODE
Creating OCR keys for user 'oracle', privgrp 'oinstall'..
Operation successful.
CRS-4664: Node piorovm2 successfully pinned.
Adding daemon to inittab
CRS-4123: Oracle High Availability Services has been started.
ohasd is starting

piorovm2     2010/06/19 13:41:12     /u01/app/11.2.0/grid/cdata/piorovm2/backup_20100619_134112.olr
Successfully configured Oracle Grid Infrastructure for a Standalone Server
[root@piorovm2 install]#

Next part of GI configuration will be executed from "oracle" account who is a owner of GI home.
First of all current status of all resources will be displayed.

[oracle@piorovm2 ~]$ crsctl status resource -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        OFFLINE OFFLINE
ora.diskmon
      1        OFFLINE OFFLINE
[oracle@piorovm2 ~]$

Both resources CSSD and DISKMON are offline and are not automatically started. It has to be changed and auto start mode has to be enabled.
[oracle@piorovm2 ~]$ crsctl modify resource "ora.cssd" -attr "AUTO_START=1"
[oracle@piorovm2 ~]$ crsctl modify resource "ora.diskmon" -attr "AUTO_START=1"
[oracle@piorovm2 ~]$ crsctl start resource "ora.cssd"
CRS-2672: Attempting to start 'ora.cssd' on 'piorovm2'
CRS-2672: Attempting to start 'ora.diskmon' on 'piorovm2'
CRS-2676: Start of 'ora.diskmon' on 'piorovm2' succeeded
CRS-2676: Start of 'ora.cssd' on 'piorovm2' succeeded
[oracle@piorovm2 ~]$ crsctl status resource -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       piorovm2
ora.diskmon
      1        ONLINE  ONLINE       piorovm2
[oracle@piorovm2 ~]$

In that step both cluster resources (in single node configuration both resources are not really cluster one) are started and other GI resources like Oracle Listener and Oracle ASM instance can be added. 
In this example host name has been changed and appropriate changes has to be done for Oracle Listener configuration. After changes configuration file should look like this.

[oracle@piorovm2 ~]$ cat /u01/app/11.2.0/grid/network/admin/listener.ora
# listener.ora Network Configuration File: /u01/app/11.2.0/grid/network/admin/listener.ora
# Generated by Oracle configuration tools.

LISTENER =
  (DESCRIPTION_LIST =
    (DESCRIPTION =
      (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1521))
      (ADDRESS = (PROTOCOL = TCP)(HOST = piorovm2.localdomain)(PORT = 1521))
    )
  )

ADR_BASE_LISTENER = /u01/app/oracle

Listener and Oracle ASM has to be added to Grid Infrastructure and can be managed by it
Listener name LISTENER has been taken from information gathered in first part.
[oracle@piorovm2 ~]$ srvctl add listener -l LISTENER
[oracle@piorovm2 ~]$ srvctl start listener
[oracle@piorovm2 ~]$ crsctl status resource -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       piorovm2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       piorovm2
ora.diskmon
      1        ONLINE  ONLINE       piorovm2

When Listener is up and running Oracle ASM instance can be added to GI. 
Listener name, disk string path and path and name of ASM parameter file has been taken from part 1.
[oracle@piorovm2 ~]$ srvctl add asm -l LISTENER -d "/dev/sd*" -p "+DATA_SOURCE/ASM/ASMPARAMETERFILE/REGISTRY.253.722024685"
[oracle@piorovm2 ~]$ srvctl start asm
[oracle@piorovm2 ~]$ crsctl status resource -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA_SOURCE.dg
               ONLINE  ONLINE       piorovm2
ora.LISTENER.lsnr
               ONLINE  ONLINE       piorovm2
ora.asm
               ONLINE  ONLINE       piorovm2                 Started
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       piorovm2
ora.diskmon
      1        ONLINE  ONLINE       piorovm2
[oracle@piorovm2 ~]$ 

When both - Listener and Oracle ASM instance has been successfully added Grid Infrastructure configuration has been completed.
Oracle Grid Infrastructure and all management commands related in previous version for Oracle RAC only are now required for single instance environment with ASM and in my opinion this is a new direction of Oracle software. All differences between Oracle RAC and single instance will be minimized from maintenance and configuration point of view.

regards,
Marcin

Monday, May 24, 2010

Rename file and BAAG

Hi,
There are two reasons why I'm posting this. First I have read a interview
with Alex Gorbachev
about BAAG and second I run into BAAG related problem myself.
This was not pure BAAG problem as I used my knowledge but I forgot about one very important step - VERIFY old knowledge (based on Oracle 8i) with current release. What I did ?

A scenario was very simple - I have a "production" database (thanks God it was only my VM) running on ASM with all files in group called DATA. I cloned that database using a storage snapshot functionality and then I have renamed ASM group to different name and mounted it. So far so good. Next step was to mount a cloned database and rename a files in control file using old fashion command
ALTER DATABASE RENAME FILE '+DATA\xxxx\redo01.log' to '+DATACLONE\xxxx\redo02.log';
Command was successfully finished and control file has been updated. I have checked a alert log and what was my surprise when I saw warning that file source can't be deleted.
WARNING: Cannot delete Oracle managed file 
I didn't want to delete a source file - this file is a part of source database. It was very interested and I did another test - I have stopped a source database and run rename command for other file.
And ? Yes a source file has been deleted. What if you do it on production environment without test ?
The most worry thing is that latest SQL reference there is no information about that behavior.
This clause renames only files in the control file. It does not actually rename them on your operating system. The operating system files continue to exist, but Oracle Database no longer uses them.
This is not about ASM but if there is a different behavior for ASM it should be mentioned as well.
So there are three important things to remember:
- ALTER DATABASE RENAME FILE can delete files
- always check what you are going to do and not guess a results
- check relevant of vendor docs, White papers and other well know stuff with current release of software.

regards,
Marcin

Tuesday, May 11, 2010

Persistent ISCSI disks name

Last few days I was working on Oracle cloning possibility using storage snapshots (post will be blogged soon) and I hit a ISCSI issue at the very beginning.
I have been playing with restarting storage and Linux box and from time to time ISCSI disk has been mapped to different block devices (/dev/sdb1 instead of /dev/sdc1). I really need to keep this persistent and I found this instruction for RedHat Linux (so it should work for OEL and CentOS).
Additionally I was looking for possibility to check how IQN shared on storage is mapped to Linux block devices. There is a nice directory layout where you can check that.
[root@piorovm by-path]# pwd
/dev/disk/by-path
[root@piorovm by-path]# ls -l
total 0
lrwxrwxrwx 1 root root  9 May 11 13:01 ip-10.10.20.241:3260-iscsi-iqn.1986-03.com.sun:02:6fe1e73d-8d7a-6ee7-8446-db844ae0d0c7-lun-0 -> ../../sdc
lrwxrwxrwx 1 root root 10 May 11 13:01 ip-10.10.20.241:3260-iscsi-iqn.1986-03.com.sun:02:6fe1e73d-8d7a-6ee7-8446-db844ae0d0c7-lun-0-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  9 May 11 13:01 ip-10.10.20.241:3260-iscsi-iqn.1986-03.com.sun:02:f6746dbb-c95f-e2dc-8bb6-c82d3ef5169a-lun-0 -> ../../sdb
lrwxrwxrwx 1 root root 10 May 11 13:01 ip-10.10.20.241:3260-iscsi-iqn.1986-03.com.sun:02:f6746dbb-c95f-e2dc-8bb6-c82d3ef5169a-lun-0-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  9 May 11 13:00 pci-0000:00:07.1-ide-0:0 -> ../../hda
lrwxrwxrwx 1 root root 10 May 11 13:00 pci-0000:00:07.1-ide-0:0-part1 -> ../../hda1
lrwxrwxrwx 1 root root 10 May 11 13:00 pci-0000:00:07.1-ide-0:0-part2 -> ../../hda2
lrwxrwxrwx 1 root root  9 May 11 13:00 pci-0000:00:07.1-ide-1:0 -> ../../hdc
lrwxrwxrwx 1 root root  9 May 11 13:00 pci-0000:00:10.0-scsi-0:0:0:0 -> ../../sda
lrwxrwxrwx 1 root root 10 May 11 13:00 pci-0000:00:10.0-scsi-0:0:0:0-part1 -> ../../sda1
[root@piorovm by-path]#
When you know which IQN is pointed to which block devices we can use mentioned RedHat instruction with some additional remarks.

Here is output from my configuration
  1. Check WWID for block devices
  2. [root@piorovm ~]# scsi_id -g -s /block/sdb
    3600144f04be80ec000000c2936224200
    [root@piorovm ~]# scsi_id -g -s /block/sdc
    3600144f04be7ffdf00000c2936224200
    
  3. Go to
    [root@piorovm ~]# cd /etc/udev/rules.d/
    
    and create a file called 20-names.rules Replace my WWID with results from scsi_id command and put correct names
    [root@piorovm rules.d]# cat 20-name.rules
    KERNEL=="sd*", BUS=="scsi", PROGRAM="/sbin/scsi_id -g -s", RESULT=="3600144f04be7ffdf00000c2936224200", NAME="sdc"
    KERNEL=="sd*", BUS=="scsi", PROGRAM="/sbin/scsi_id -g -s", RESULT=="3600144f04be80ec000000c2936224200", NAME="sdb"
    
    Here is a change with RedHat doc – use double equal mark "==" instead of one equal mark "=" for compare (KERNEL, BUS,RESULT).

  4. Save file and start
    [root@piorovm ~]# start_udev
    Starting udev:                                             [  OK  ]
    [root@piorovm ~]#
    

regards,
Marcin