Saturday, March 28, 2009

PM and updates

In last few days I had planned perform two updates. Both really simple.
In one project I was responsible for everything and Oracle Application Server upgrade took me
1 day plus one day for test.
In second one, we have got two PM's and after 20 hrs I'm still at work :-(
We have started our work 8 hrs later because PM decide to make security research ... yeah
at the day of migration, very wise isn't it ? After that we have run into some issues
(stupid once as always) and now I'm wondering when I will be able to go home.

BTW - if RAC ClusterWare is up and running and in logs you have all nodes active in cluster,
but you crs_stat hung ... check how many packages are gone on interconnect interface.
If it is loosing about 25 % of packages Oracle ClusterWare in going crazy.

regards,
Marcin

Thursday, February 26, 2009

RAC and Linux

Hi for many years I have been supported Linux and Oracle combination.

After few last experiences I’m thinking that newest RedHat 5 and Oracle 10g required more attention and work and you can run into some very strange issues.

1. Bonding

A new card (Intel NIC 4 port) has been added to server. All new interfaces have been configured on modprobe.conf. After that a new configuration for bonding has been created. As far looks good. Restart network interfaces and hurray we have bonding interfaces.

So simple RAC reconfiguration and we have ClusterWare using bonding interface.

But unfortunately until server reboot.

After reboot a order of “ethx” interfaces has been changed L

and Eth0 become Eth 4 and so on.

Ok lets change a configurations to new one.

Next restart and .... ?

Next order of interfaces !!!

Hopefully it was not my task, anyway my colleague found a
solution on RH website

After that is was OK.

Bonding was working ...

but RAC ....

see next point



2. RAC

So we have bounding up and running, according to Metalink we have to change interfaces using oifcfg and then change a VIP configuration.
I have stoped a cluster make all changes, restart cluster ...

Hurray is working ... yes ... until reboot of servers.

After reboot only on node was up and running, on second one ClusterWare didn’t start.

Fast check of logs and in ocssd.log I have found

[ CSSD]2009-02-20 19:43:44.758 [1115699552] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14467) LATS(11888034) Disk lastSeqNo(14467)
[ CSSD] 2009-02-20 19:43:44.758 [1220598112] >TRACE: clssnmRcfgMgrThread: Local Join

WTF ? Node 1 is up and running and why local join ?

Another review and ...

[ CSSD] 2009-02-20 19:43:44.758 [1147169120] >TRACE: clsc_send_msg: (0x7271b0) NS err
(12571, 12560), transport (530, 113, 0)

My favourite error – no network. Quick check with ping – both interfaces private and public are up. So what is with that box ?
A problem is that bounding need some time after interfaces are up to start working.
If ClusterWare is starting BEFORE binding is 100 % operational, Oracle check only once if there is a other node using network connection and then decide to start ClusterWare in local configuration. But after that it realize that on vote disk other node is already registered so local join is impossible.
BTW in my opinion ClusterWare should be more flexible and check other node more then once.

A solution from Metalink is to add sleep a few second before ClusterWare will be started using /etc/init.d/init.crs

'start')
CMD=`$BASENAME $0`

# If we are being invoked by the user, perform manual startup.
# If we are being invoked as an RC script, check for autostart.
if [ "$CMD" = "init.crs" ]; then
$LOGMSG "Oracle Cluster Ready Services starting by user request."
$ID/init.cssd manualstart
else
$ID/init.cssd autostart
fi
;;


Add 'sleep ' before '$ID/init.cssd autostart', for example, sleep 20 seconds:

else
sleep 20
$ID/init.cssd autostart
fi


At the end of that my personal opinion - previous versions of Linux and Oracle was more stable.
I hope Linux and Oracle will not finish like MS products.

regards,
Marcin

Thursday, February 12, 2009

Strange error

Hi,

Short note to not forget about it.

I have to investigate it more deeply and replicate it, but today I hit into very strange issue.
I have started a standby DB in read only mode, but unfortunately audit_trail was set to DB.
I have had a error that I could not open database in read only with that parameter.
I have changed that and after restart I have got information that my db files need to be recovered to be consistent :-(

Hmmm right now I'm waiting for a airplane but I will come back to that issue.


Update - Friday 13th :)
RTFM ...
It is a mixed environment between PA-RISC and Itanium and standby can't be open in Read Only mode - see metalink note 413484.1

BTW
Another example of very useful error messages !!!

Wednesday, February 4, 2009

RMAN - random errors from years

Hi,


I have been work with RMAN from 8 years and I'm still wondering why some of RMAN errors are taking from /dev/random ;)

Last example:

Environment : Linux 32 bit - Oracle 10g 10.2.04 on ASM

Performed steps:
  1. Drop existing test DB from ASM - using drop database
  2. Copy backup from production server into test server
  3. restore controlfile from new location
  4. mount database
After that I wanted to restore a database. So I have catalog all necessary backup pieces
in controlfile and check it using list backupset command.
There was one backupset with all datafile with correct status. So it is simple let try to restore
DB.

RMAN>restore database;

creating datafile No=1 name=+DATA/oracle/orcl/datafile/o1_mf_system_3n5w1nky_.dbf
released channel: t1
released channel: t2
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 07/04/2008 10:54:29
ORA-01180: can not create datafile 1
ORA-01110: data file 1: '+DATA/dataprd/ORCL/datafile/o1_mf_system_3n5w1nky_.dbf'

Yeah, nice error.
There is some notes on metalink related to duplicated incarnation and corrupted controlfile
(BTW there is a solution to recreate a controlfile from command line before you restore datafiles - it is possible to recreate a controlfile without datafiles ???)

Anyway there was not my case.

A solution is very simple - I have found out that during catalog phase RMAN is scaning existing flash recovery area and I found archive logs in backupset from previous (droped) database and
bacuse there was differect incarnation of that archive log ... incarnation of my new database has been changed too. And now we have a strange behaviour of RMAN.

RMAN> list backupset;

still display a valid backups for that incarnation

RMAN> restore database;

raise error (see above)

Solution:

RMAN> reset database to incarnation xxx;

where xxx is a previous incarnation of database.

I can understand that Oracle could use a backup from previous incarnation in new (but why ?)
but why there is so stuip error about datafile number 1 ?

Is is impossible to display something more useful like there is no backup for that incarnation ?

ps.
All databases have this same DBID - there are clones
I know there is a bad idea to keep one DBID for many databases but I have thought that with RMAN catalog there is no issue.

regards,
Marcin

Tuesday, January 27, 2009

Graceful switchover in standard edition

I have implemented a lot of standby databases in Oracle Standard Edition version. Until now anyone of our customer asked for Graceful switchover, but at least it happen.



I spend a few hours thinking if it is possible and when a draft of solution came into my mind I have made a research on Oracle Metalink and I have found article dated 1999 about Graceful Switchover in Oracle 8 and 8i – it is interesting because it was before DataGuard and this functionality had been establish (Metalink Doc ID: 76450.1 Graceful Switchover and Switchback of Oracle Standby Databases).



Findings and my original idea are very close – to switchover a database it is required to copy an online redo logs and control file. Everything looks straight forward when we have a database file, online redo logs and control files on filesystem. But what in case of database placed on ASM? Is it possible? In ASM there is no possibility to copy redo logs. Yes, it is. This is some kind of workaround and it required a little more work but I was able to perform a switchover between two databases using ASM.



A solution for ASM based databases is using a mirroring feature for online redo logs. A new member of each group has to be placed on filesystem and not on ASM disk group. After that change we are able to perform a graceful switchover using steps described in Oracle document. At the end an additional (temporary) member of each redo group can be deleted.

Wednesday, January 7, 2009

Nice bug

Today I was trying to enable automatic patch update in 11g.
I opened a configuration page and typed my email and password, when I pressed Apply
button I saw a nice information in red:

Invalid Data - Error: apply failed ORA-12899: value too large for column
"SYSMAN"."MGMT_ARU_CREDENTIALS"."ARU_USERNAME" (actual: 80, maximum: 64)
ORA-06512: at "SYSMAN.MGMT_CREDENTIAL", line 1482 ORA-06512: at line 1

On Metalink your account is your email, so all DBA's with long first or surnames have a problem ;)
I have make a manual research and I have found a solution - use shorter account name.

BTW this is solved in 11.1.0.7
From Metalink note 470696.1

The next Release of DB Control (11.1.0.7) will include the fix. The maximum length of the Metalink Username will be 255 characters (as Metalink username can have up to 255 characters).

I will try that.

update:
Even in 11.1.0.7 on Linux I cound not use my metalink account.

Wednesday, July 16, 2008

RAC on Teide


During my last vacation I was asked to make some phone consultation
for Oracle 10g RAC database creation. As far it was "highest" Oracle instalation
because I was almost on the top of Pico de Teide - about 3550 m ;)
See my picture taken by my wife.