After reading the book about Oracle RAC (see previous post) I decided to test recovery of OCR (Oracle Cluster Registry) file and vote disk placed on ASM disk groups. This is an operation I never did before. I was lucky enough and my ASM based RAC installation never had similar issue but I did it before when OCR and vote disk were placed on raw or block devices.
According to Oracle doc it should be straight forward and it really is but you need to careful otherwise you can hit strange errors, like "segmentation fault" for ocrconfig when Grid Infrastructure is down.
Here is step by step instruction:
- Check if you can access OCR files
[root@node1 ~]# /u01/app/11.2.0/grid/bin/ocrcheck PROT-602: Failed to retrieve data from the cluster registry PROC-26: Error while accessing the physical storage
- Stop CRS on all nodes
[root@node1 ~]# export CRS_HOME=/u01/app/11.2.0/grid [root@node1 ~]# $CRS_HOME/bin/crsctl stop crs CRS-2796: The command may not proceed when Cluster Ready Services is not running CRS-4687: Shutdown command has completed with errors. CRS-4000: Command Stop failed, or completed with errors.
If you got error like that try force option
[root@node1 ~]# $CRS_HOME/bin/crsctl stop crs -f CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node1' CRS-2673: Attempting to stop 'ora.ctssd' on 'node1' CRS-2673: Attempting to stop 'ora.evmd' on 'node1' CRS-2673: Attempting to stop 'ora.asm' on 'node1' CRS-2673: Attempting to stop 'ora.mdnsd' on 'node1' CRS-2677: Stop of 'ora.asm' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'node1' CRS-2677: Stop of 'ora.mdnsd' on 'node1' succeeded CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'node1' succeeded CRS-2677: Stop of 'ora.evmd' on 'node1' succeeded CRS-2677: Stop of 'ora.ctssd' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.cssd' on 'node1' CRS-2677: Stop of 'ora.cssd' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.diskmon' on 'node1' CRS-2673: Attempting to stop 'ora.crf' on 'node1' CRS-2677: Stop of 'ora.crf' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.gipcd' on 'node1' CRS-2677: Stop of 'ora.diskmon' on 'node1' succeeded CRS-2677: Stop of 'ora.gipcd' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.gpnpd' on 'node1' CRS-2677: Stop of 'ora.gpnpd' on 'node1' succeeded CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'node1' has completed CRS-4133: Oracle High Availability Services has been stopped. [root@node1 ~]#
- When OCR file and vote disk are located on ASM disk group running ASM instance is required for recovery. Current version of Oracle Grid Infrastructure (GI) can be started in exclusive mode on one node to allow checks and restore.
There is a slight change between version 11.2.0.1 and 11.2.0.2 - in first release we need to start GI in exclusive mode and then stop CRS service (as showed below). In 11.2.0.2 we can start GI without CRS but old steps are still working.
Steps for 11.2.0.1
[root@node1 ~]# $CRS_HOME/bin/crsctl start crs -excl CRS-4123: Oracle High Availability Services has been started. CRS-2672: Attempting to start 'ora.mdnsd' on 'node1' CRS-2676: Start of 'ora.mdnsd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.gpnpd' on 'node1' CRS-2676: Start of 'ora.gpnpd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.cssdmonitor' on 'node1' CRS-2672: Attempting to start 'ora.gipcd' on 'node1' CRS-2676: Start of 'ora.cssdmonitor' on 'node1' succeeded CRS-2676: Start of 'ora.gipcd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.cssd' on 'node1' CRS-2672: Attempting to start 'ora.diskmon' on 'node1' CRS-2676: Start of 'ora.diskmon' on 'node1' succeeded CRS-2676: Start of 'ora.cssd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.ctssd' on 'node1' CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'node1' CRS-2676: Start of 'ora.ctssd' on 'node1' succeeded CRS-5017: The resource action "ora.cluster_interconnect.haip start" encountered the following error: Start action for HAIP aborted CRS-2674: Start of 'ora.cluster_interconnect.haip' on 'node1' failed CRS-2679: Attempting to clean 'ora.cluster_interconnect.haip' on 'node1' CRS-2681: Clean of 'ora.cluster_interconnect.haip' on 'node1' succeeded CRS-2672: Attempting to start 'ora.asm' on 'node1' CRS-2674: Start of 'ora.asm' on 'node1' failed CRS-2673: Attempting to stop 'ora.ctssd' on 'node1' CRS-2677: Stop of 'ora.ctssd' on 'node1' succeeded CRS-4000: Command Start failed, or completed with errors. [root@node1 ~]#
Errors are expected so now we need to stop CRS (if it has been started)
[root@node1 ~]# $CRS_HOME/bin/crsctl stop resource ora.crsd -init [root@node1 ~]# $CRS_HOME/bin/crsctl stop resource ora.crsd -init CRS-2500: Cannot stop resource 'ora.crsd' as it is not running CRS-4000: Command Stop failed, or completed with errors.
Steps for 11.2.0.2 - this allow us to avoid starting CRS which is trying to start resources. Issues related to CRS resource will bring down HAIP and ASM - and this is root cause of errors shown in 11.2.0.1 example.
[root@node1 ~]# $CRS_HOME/bin/crsctl start crs -excl -nocrs CRS-4123: Oracle High Availability Services has been started. CRS-2672: Attempting to start 'ora.mdnsd' on 'node1' CRS-2676: Start of 'ora.mdnsd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.gpnpd' on 'node1' CRS-2676: Start of 'ora.gpnpd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.cssdmonitor' on 'node1' CRS-2672: Attempting to start 'ora.gipcd' on 'node1' CRS-2676: Start of 'ora.cssdmonitor' on 'node1' succeeded CRS-2676: Start of 'ora.gipcd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.cssd' on 'node1' CRS-2672: Attempting to start 'ora.diskmon' on 'node1' CRS-2676: Start of 'ora.diskmon' on 'node1' succeeded CRS-2676: Start of 'ora.cssd' on 'node1' succeeded CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'node1' CRS-2672: Attempting to start 'ora.ctssd' on 'node1' CRS-2676: Start of 'ora.ctssd' on 'node1' succeeded CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'node1' succeeded CRS-2672: Attempting to start 'ora.asm' on 'node1' CRS-2676: Start of 'ora.asm' on 'node1' succeeded
No errors has been displayed and GI has been started in exclusive mode.
- Now it is time to check if ASM instance is running and recreate ASM disk groups.
[oracle@node1 ~]$ ps aux | grep pmon oracle 8009 0.0 0.9 495804 18564 ? Ss 20:37 0:00 asm_pmon_+ASM1 oracle 8260 0.0 0.0 61184 760 pts/3 S+ 20:38 0:00 grep pmon
ASM is running. Now we can connect and create a disk group for OCR and vote disk. Be sure that your new group will have proper COMPATIBLE.ASM value. You can put OCR and vote on groups with COMPATIBLE.ASM greater than 11.2.
[oracle@node1 ~]$ export ORACLE_HOME=/u01/app/11.2.0/grid [oracle@node1 ~]$ export PATH=$PATH:$ORACLE_HOME/bin [oracle@node1 ~]$ export ORACLE_SID=+ASM1 [oracle@node1 ~]$ sqlplus / as sysasm SQL*Plus: Release 11.2.0.2.0 Production on Tue Aug 30 20:39:14 2011 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options SQL> show parameter asm_di NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ asm_diskgroups string asm_diskstring string SQL> alter system set asm_diskstring = '/dev/asmdisk*'; System altered. SQL> create diskgroup DATA external redundancy disk '/dev/asmdisk11' attribute 'COMPATIBLE.ASM' = '11.2'; Diskgroup created.
- In next step we need to create spfile for ASM instance as it will be used for rest of cluster to find out location of ASM disks using 'asm_diskstring'.
First text init.ora file has been created and then new binary spfile has been created on ASM disk group.
[oracle@node1 rac-cluster]$ cat /tmp/init.ora *.asm_power_limit=1 *.diagnostic_dest='/u01/app/oracle' *.instance_type='asm' *.large_pool_size=12M *.remote_login_passwordfile='EXCLUSIVE' *.asm_diskstring = '/dev/asmdisk*' [oracle@node1 rac-cluster]$ sqlplus / as sysasm SQL*Plus: Release 11.2.0.2.0 Production on Tue Aug 30 21:23:36 2011 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options SQL> create spfile='+DATA' from pfile='/tmp/init.ora'; File created.
- When disk group is ready it is time now to restore OCR from backup file.
Oracle Grid Infrastructure is doing OCR backup automatically into following location $GRID_HOME/cdata/. Let's restore and check if OCR file is recognized properly after it.
[root@node1 ~]# $CRS_HOME/bin/ocrconfig -restore /u01/app/11.2.0/grid/cdata/rac-cluster/backup_20110830_201118.ocr [root@node1 ~]# $CRS_HOME/bin/ocrcheck Status of Oracle Cluster Registry is as follows : Version : 3 Total space (kbytes) : 262120 Used space (kbytes) : 3132 Available space (kbytes) : 258988 ID : 660203047 Device/File Name : +DATA Device/File integrity check succeeded Device/File not configured Device/File not configured Device/File not configured Device/File not configured Cluster registry integrity check succeeded Logical corruption check succeeded
- Now it is time to restore vote disk. This process will read asm_diskstring from ASM instance and will place vote files on these disks. See what happen when asm_diskstring is empty.
[root@node1 ~]# $CRS_HOME/bin/crsctl replace votedisk +DATA CRS-4602: Failed 27 to add voting file 28652f742fc44f28bfc6d12d1412a604. Failed to replace voting disk group with +DATA. CRS-4000: Command Replace failed, or completed with errors.
Error message from log file.
[cssd(7894)]CRS-1638:Unable to locate voting file with ID 1b37b25b-686c4fb4-bfb82eac-357f48df that is being added to the list of configured voting files; details at (:CSSNM00022:) in /u01/app/11.2.0/grid/log/node1/cssd/ocssd.log 2011-08-30 20:44:12.256
When asm_diskstring is set up properly it is looking much better.
[root@node1 ~]# $CRS_HOME/bin/crsctl replace votedisk +DATA Successful addition of voting disk 4ca8c2b58d394ff1bf7a9b88dd9f5fc3. Successfully replaced voting disk group with +DATA. CRS-4266: Voting file(s) successfully replaced [root@node1 ~]# $CRS_HOME/bin/crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE 4ca8c2b58d394ff1bf7a9b88dd9f5fc3 (/dev/asmdisk11) [DATA] Located 1 voting disk(s).
- In last step Grid Infrastructure need to be restarted on all nodes.
[root@node1 ~]# $CRS_HOME/bin/crsctl stop crs -f CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node1' CRS-2673: Attempting to stop 'ora.mdnsd' on 'node1' CRS-2673: Attempting to stop 'ora.ctssd' on 'node1' CRS-2673: Attempting to stop 'ora.asm' on 'node1' CRS-2677: Stop of 'ora.asm' on 'node1' succeeded CRS-2677: Stop of 'ora.mdnsd' on 'node1' succeeded CRS-2677: Stop of 'ora.ctssd' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.cssd' on 'node1' CRS-2677: Stop of 'ora.cssd' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.gipcd' on 'node1' CRS-2673: Attempting to stop 'ora.diskmon' on 'node1' CRS-2677: Stop of 'ora.diskmon' on 'node1' succeeded CRS-2677: Stop of 'ora.gipcd' on 'node1' succeeded CRS-2673: Attempting to stop 'ora.gpnpd' on 'node1' CRS-2677: Stop of 'ora.gpnpd' on 'node1' succeeded CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'node1' has completed CRS-4133: Oracle High Availability Services has been stopped. [root@node1 ~]# $CRS_HOME/bin/crsctl start crs CRS-4123: Oracle High Availability Services has been started.
If everything went well similar result is expected
[root@node1 ~]# $CRS_HOME/bin/crsctl check cluster -all ************************************************************** node1: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online ************************************************************** node2: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online ************************************************************** node3: CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online **************************************************************
Now Grid Infrastructure is working and if database ASM disk groups has been destroyed as well or we have everything in one ASM disk groups (like in my example) it is time to restore database from backup using Oracle Recovery Manager. But this is a different story.
Hope it help someone to restore Grid Infrastructure after disk crash.
regards,
Marcin