Troubleshoot Grid Infrastructure Startup Issues (Doc ID
1050908.1) To Bottom
Case
1: OHASD does not start
Case 2: OHASD Agents do not start
Case 3: OCSSD.BIN does not start
Case 4: CRSD.BIN does not start
Case 5: GPNPD.BIN does not start
Case 6: Various other daemons do
not start
Case 7: CRSD Agents do not start
Case 8: HAIP does not start
Network and Naming Resolution
Verification
Log File Location, Ownership and
Permission
In Grid Infrastructure cluster
environment:
In Oracle Restart environment:
Network Socket File Location,
Ownership and Permission
In Grid Infrastructure cluster
environment:
In Oracle Restart environment:
Diagnostic file collection
References
________________________________________
APPLIES TO:
Oracle Database - Enterprise Edition - Version 11.2.0.1 and
later
Information in this document applies to any platform.
PURPOSE
This note is to provide reference to troubleshoot 11gR2 and
12c Grid Infrastructure clusterware startup issues. It applies to issues in
both new environments (during root.sh or rootupgrade.sh) and unhealthy existing
environments. To look specifically at
root.sh issues, see note 1053970.1 for more information.
SCOPE
This document is intended for Clusterware/RAC Database
Administrators and Oracle support engineers.
DETAILS
Start up sequence:
In a nutshell, the operating system starts ohasd, ohasd
starts agents to start up daemons (gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd,
evmd asm etc), and crsd starts agents that start user resources (database,
SCAN, listener etc).
For detailed Grid Infrastructure clusterware startup
sequence, please refer to note 1053147.1
Cluster status
To find out cluster and daemon status:
$GRID_HOME/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
$GRID_HOME/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME
TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE
ONLINE rac1 Started
ora.crsd
1 ONLINE
ONLINE rac1
ora.cssd
1 ONLINE
ONLINE rac1
ora.cssdmonitor
1 ONLINE
ONLINE rac1
ora.ctssd
1 ONLINE
ONLINE rac1 OBSERVER
ora.diskmon
1 ONLINE
ONLINE rac1
ora.drivers.acfs
1 ONLINE
ONLINE rac1
ora.evmd
1 ONLINE
ONLINE rac1
ora.gipcd
1 ONLINE
ONLINE rac1
ora.gpnpd
1 ONLINE
ONLINE rac1
ora.mdnsd
1 ONLINE
ONLINE rac1
For 11.2.0.2 and above, there will be two more processes:
ora.cluster_interconnect.haip
1 ONLINE
ONLINE rac1
ora.crf
1 ONLINE
ONLINE rac1
For 11.2.0.3 onward in non-Exadata, ora.diskmon will be
offline:
ora.diskmon
1 OFFLINE
OFFLINE rac1
For 12c onward, ora.storage is introduced:
ora.storage
1 ONLINE ONLINE racnode1 STABLE
To start an offline daemon - if ora.crsd is OFFLINE:
$GRID_HOME/bin/crsctl start res ora.crsd -init
Case 1: OHASD does not start
As ohasd.bin is responsible to start up all other cluserware
processes directly or indirectly, it needs to start up properly for the rest of
the stack to come up. If ohasd.bin is not up, when checking it's status,
CRS-4639 (Could not contact Oracle High Availability Services) will be
reported; and if ohasd.bin is already up, CRS-4640 will be reported if another
start up attempt is made; if it fails to start, the following will be reported:
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
Automatic ohasd.bin start up depends on the following:
1. OS is at appropriate run level:
OS need to be at specified run level before CRS will try to
start up.
To find out at which run level the clusterware needs to come
up:
cat /etc/inittab|grep init.ohasd
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null
2>&1 </dev/null
Above example shows CRS suppose to run at run level 3 and 5;
please note depend on platform, CRS comes up at different run level.
To find out current run level:
who -r
2. "init.ohasd run" is up
On Linux/UNIX, as "init.ohasd run" is configured
in /etc/inittab, process init (pid 1, /sbin/init on Linux, Solaris and hp-ux,
/usr/sbin/init on AIX) will start and respawn "init.ohasd run" if it
fails. Without "init.ohasd run" up and running, ohasd.bin will not
start:
ps -ef|grep init.ohasd|grep -v grep
root 2279 1 0
18:14 ? 00:00:00 /bin/sh
/etc/init.d/init.ohasd run
Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has
deprecated inittab, rather, init.ohasd will be configured in upstart in
/etc/init, however, the process ""/etc/init.d/init.ohasd run"
should still be up.
If any rc Snncommand script (located in rcn.d, example
S98gcstartup) stuck, init process may not start "/etc/init.d/init.ohasd
run"; please engage OS vendor to find out why relevant Snncommand script
stuck.
Error "[ohasd(<pid>)] CRS-0715:Oracle High
Availability Service has timed out waiting for init.ohasd to be started."
may be reported of init.ohasd fails to start on time.
If SA can not identify the reason why init.ohasd is not
starting, the following can be a very short term workaround:
cd
<location-of-init.ohasd>
nohup ./init.ohasd
run &
3. Cluserware auto start is enabled - its enabled by default
By default CRS is enabled for auto start upon node reboot,
to enable:
$GRID_HOME/bin/crsctl enable crs
To verify whether its currently enabled or not:
$GRID_HOME/bin/crsctl config crs
If the following is in OS messages file
Feb 29 16:20:36 racnode1 logger: Oracle Cluster Ready
Services startup disabled.
Feb 29 16:20:36 racnode1 logger: Could not access /var/opt/oracle/scls_scr/racnode1/root/ohasdstr
The reason is the file does not exist or not accessible,
cause can be someone modified it manually or wrong opatch is used to apply a GI
patch(i.e. opatch for Solaris X64 used to apply patch on Linux).
Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for
autostart.
If you don't see above message, the other possibility is
syslogd(/usr/sbin/syslogd) is not fully up. Grid may fail to come up in that
case as well. This may not apply to AIX.
To find out whether OS is able to execute S96ohasd while
node is coming up, modify S96ohasd:
From:
case `$CAT
$AUTOSTARTFILE` in
enable*)
$LOGERR
"Oracle HA daemon is enabled for autostart."
To:
case `$CAT
$AUTOSTARTFILE` in
enable*)
/bin/touch
/tmp/ohasd.start."`date`"
$LOGERR
"Oracle HA daemon is enabled for autostart."
After a node reboot, if you don't see
/tmp/ohasd.start.timestamp get created, it means OS stuck with some other Snn
script. If you do see /tmp/ohasd.start.timestamp but not "Oracle HA daemon
is enabled for autostart" in messages, likely syslogd is not fully up. For
both case, you will need engage System Administrator to find out the issue on
OS level. For latter case, the workaround is to "sleep" for about 2
minutes, modify ohasd:
From:
case `$CAT
$AUTOSTARTFILE` in
enable*)
$LOGERR
"Oracle HA daemon is enabled for autostart."
To:
case `$CAT
$AUTOSTARTFILE` in
enable*)
/bin/sleep 120
$LOGERR
"Oracle HA daemon is enabled for autostart."
5. File System that GRID_HOME resides is online when init
script S96ohasd is executed; once S96ohasd is executed, following message
should be in OS messages file:
Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for
autostart.
..
Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl
-I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl
/ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin
"reboot"
If you see the first line, but not the last line, likely the
filesystem containing the GRID_HOME was not online while S96ohasd is executed.
6. Oracle Local Registry (OLR,
$GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid
ls -l $GRID_HOME/cdata/*.olr
-rw------- 1 root
oinstall 272756736 Feb 2 18:20
rac1.olr
If the OLR is inaccessible or corrupted, likely ohasd.log
will have similar messages like following:
..
2010-01-24 22:59:10.470: [ default][1373676464] Initializing
OLR
2010-01-24 22:59:10.472: [
OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk
/ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.472: [
OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2,
os err string=No such file or directory
2010-01-24 22:59:10.473: [
OCRRAW][1373676464]proprinit: Could not open raw device
2010-01-24 22:59:10.473: [
OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 22:59:10.473: [
CRSOCR][1373676464] OCR context init failure. Error: PROCL-26: Error while accessing the
physical storage Operating System error [No such file or directory] [2]
2010-01-24 22:59:10.473: [ default][1373676464] OLR
initalization failured, rc=26
2010-01-24 22:59:10.474: [ default][1373676464]Created alert
: (:OHAS00106:) : Failed to initialize
Oracle Local Registry
2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD
exiting; Could not init OLR
OR
..
2010-01-24 23:01:46.275: [
OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096
retval 0 phy_offset 102400 retry 5
2010-01-24 23:01:46.275: [
OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock.
Assumes invalid format.
2010-01-24 23:01:46.275: [
OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted
2010-01-24 23:01:46.275: [
OCRRAW][1228334000]proprinit: Could not open raw device
2010-01-24 23:01:46.275: [
OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 23:01:46.276: [
CRSOCR][1228334000] OCR context init failure. Error: PROCL-26: Error while accessing the
physical storage
2010-01-24 23:01:46.276: [ default][1228334000] OLR
initalization failured, rc=26
2010-01-24 23:01:46.276: [ default][1228334000]Created alert
: (:OHAS00106:) : Failed to initialize Oracle
Local Registry
2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD
exiting; Could not init OLR
OR
..
2010-11-07 03:00:08.932: [ default][1] Created alert :
(:OHAS00102:) : OHASD is not running as privileged user
2010-11-07 03:00:08.932: [ default][1][PANIC] OHASD exiting:
must be run as privileged user
OR
ohasd.bin comes up but output of "crsctl stat res -t
-init"shows no resource, and "ocrconfig -local -manualbackup"
fails
OR
..
2010-08-04 13:13:11.102: [
CRSPE][35] Resources parsed
2010-08-04 13:13:11.103: [
CRSPE][35] Server [] has been registered with the PE data model
2010-08-04 13:13:11.103: [
CRSPE][35] STARTUPCMD_REQ = false:
2010-08-04 13:13:11.103: [
CRSPE][35] Server [] has changed state from [Invalid/unitialized] to
[VISIBLE]
2010-08-04 13:13:11.103: [
CRSOCR][31] Multi Write Batch processing...
2010-08-04 13:13:11.103: [ default][35] Dump State Starting
...
..
2010-08-04 13:13:11.112: [
CRSPE][35] SERVERS:
:VISIBLE:address{{Absolute|Node:0|Process:-1|Type:1}};
recovered state:VISIBLE. Assigned to no pool
------------- SERVER POOLS:
Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED
2010-08-04 13:13:11.113: [
CRSPE][35] Dumping ICE contents...:ICE operation count: 0
2010-08-04 13:13:11.113: [ default][35] Dump State Done.
The solution is to restore a good backup of OLR with
"ocrconfig -local -restore <ocr_backup_name>".
By default, OLR will be backed up to
$GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr once installation is complete.
7. ohasd.bin is able to access network socket files:
2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten:
Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))
2010-06-29 10:31:01.571: [
OCRSRV][1217390912]th_listen: CLSCLISTEN failed clsc_ret= 3, addr=
[(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]
2010-06-29 10:31:01.571: [
OCRSRV][3267002960]th_init: Local listener did not reach valid state
In Grid Infrastructure cluster environment, ohasd related
socket files should be owned by root, but in Oracle Restart environment, they
should be owned by grid user, refer to "Network Socket File Location,
Ownership and Permission" section for example output.
8. ohasd.bin is able to access log file location:
OS messages/syslog shows:
Feb 20 10:47:08 racnode1 OHASD[9566]: OHASD exiting;
Directory /ocw/grid/log/racnode1/ohasd not found.
Refer to "Log File Location, Ownership and
Permission" section for example output, if the expected directory is
missing, create it with proper ownership and permission.
9. After node reboot, ohasd may fail to start on SUSE Linux
after node reboot, refer to note 1325718.1 - OHASD not Starting After Reboot on
SLES
10. OHASD fails to start, "ps -ef| grep ohasd.bin"
shows ohasd.bin is started, but nothing in
$GRID_HOME/log/<node>/ohasd/ohasd.log for many minutes, truss shows it is
looping to close non-opened file handles:
..
15058/1:
0.1995 close(2147483646) Err#9 EBADF
15058/1:
0.1996 close(2147483645) Err#9 EBADF
..
Call stack of ohasd.bin from pstack shows the following:
_close
sclssutl_closefiledescriptors
main ..
The cause is bug 11834289 which is fixed in 11.2.0.3 and
above, other symptoms of the bug is clusterware processes may fail to start
with same call stack and truss output (looping on OS call "close").
If the bug happens when trying to start other resources, "CRS-5802: Unable
to start the agent process" could show up as well.
11. Other potential causes/solutions listed in note
1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device
12. ohasd.bin started fine, however, "crsctl check
crs" shows only the following and nothing else:
CRS-4638: Oracle High Availability Services is online
And "crsctl stat res -p -init" shows nothing
The cause is that OLR is corrupted, refer to note 1193643.
to restore.
13. If ohasd still fails to start, refer to ohasd.log in
<grid-home>/log/<nodename>/ohasd/ohasd.log and ohasdOUT.log
Case 2: OHASD Agents do not start
OHASD.BIN will spawn four agents/monitors to start resource:
oraagent:
responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc
orarootagent:
responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc
cssdagent /
cssdmonitor: responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for
cssdmonitor itself)
If ohasd.bin can not start any of above agents properly,
clusterware will not come to healthy state.
1. Common causes of agent failure are that the log file or
log directory for the agents don't have proper ownership or permission.
Refer to below section "Log File Location, Ownership
and Permission" for general reference.
2. If agent binary (oraagent.bin or orarootagent.bin etc) is
corrupted, agent will not start resulting in related resources not coming up:
2011-05-03 11:11:13.189
[ohasd(25303)]CRS-5828:Could not start agent
'/ocw/grid/bin/orarootagent_grid'. Details at (:CRSAGF00130:) {0:0:2} in
/ocw/grid/log/racnode1/ohasd/ohasd.log.
2011-05-03 12:03:17.491: [
AGFW][1117866336] {0:0:184} Created alert : (:CRSAGF00130:) : Failed to
start the agent /ocw/grid/bin/orarootagent_grid
2011-05-03 12:03:17.491: [
AGFW][1117866336] {0:0:184} Agfw Proxy Server sending the last reply to
PE for message:RESOURCE_START[ora.diskmon 1 1] ID 4098:403
2011-05-03 12:03:17.491: [
AGFW][1117866336] {0:0:184} Can not stop the agent: /ocw/grid/bin/orarootagent_grid
because pid is not initialized
..
2011-05-03 12:03:17.492: [
CRSPE][1128372576] {0:0:184} Fatal Error from AGFW Proxy: Unable to
start the agent process
2011-05-03 12:03:17.492: [
CRSPE][1128372576] {0:0:184} CRS-2674: Start of 'ora.diskmon' on
'racnode1' failed
..
2011-06-27 22:34:57.805: [
AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) : Failed to
start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail
Loc: canexec2 OS error: 0 Other : no exe permission, file
[/ocw/grid/bin/cssdagent]
2011-06-27 22:34:57.805: [
AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00126:) : Agent start failed
..
2011-06-27 22:34:57.806: [
AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process:
/ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error:
0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]
The solution is to compare agent binary with a
"good" node, and restore a good copy.
Case 3: OCSSD.BIN does not start
Successful cssd.bin startup depends on the following:
1. GPnP profile is accessible - gpnpd needs to be fully up
to serve profile
If ocssd.bin is able to get the profile successfully, likely
ocssd.log will have similar messages like following:
2010-02-02 18:00:16.251: [
GPnP][408926240]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling
"ipc://GPNPD_rac1", try 4 of 500...
2010-02-02 18:00:16.263: [
GPnP][408926240]clsgpnp_profileVerifyForCall: [at clsgpnp.c:1867]
Result: (87) CLSGPNP_SIG_VALPEER. Profile verified. prf=0x165160d0
2010-02-02 18:00:16.263: [
GPnP][408926240]clsgpnp_profileGetSequenceRef: [at clsgpnp.c:841]
Result: (0) CLSGPNP_OK. seq of p=0x165160d0 is '6'=6
2010-02-02 18:00:16.263: [
GPnP][408926240]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2186] Result:
(0) CLSGPNP_OK. Successful get-profile CALL to remote
"ipc://GPNPD_rac1" disco ""
Otherwise messages like following will show in ocssd.log
2010-02-03 22:26:17.057: [
GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1100] GIPC
gipcretConnectionRefused (29) gipcConnect(ipc-ipc://GPNPD_rac1)
2010-02-03 22:26:17.057: [
GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1101] Result: (48)
CLSGPNP_COMM_ERR. Failed to connect to call url "ipc://GPNPD_rac1"
2010-02-03 22:26:17.057: [
GPnP][3852126240]clsgpnp_getProfileEx: [at clsgpnp.c:546] Result: (13)
CLSGPNP_NO_DAEMON. Can't get GPnP service profile from local GPnP daemon
2010-02-03 22:26:17.057: [ default][3852126240]Cannot get
GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2010-02-03 22:26:17.057: [
CSSD][3852126240]clsgpnp_getProfile failed, rc(13)
2. Voting Disk is accessible
In 11gR2, ocssd.bin discover voting disk with setting from
GPnP profile, if not enough voting disks can be identified, ocssd.bin will
abort itself.
2010-02-03 22:37:22.212: [
CSSD][2330355744]clssnmReadDiscoveryProfile: voting file discovery
string(/share/storage/di*)
..
2010-02-03 22:37:22.227: [
CSSD][1145538880]clssnmvDiskVerify: Successful discovery of 0 disks
2010-02-03 22:37:22.227: [
CSSD][1145538880]clssnmCompleteInitVFDiscovery: Completing initial
voting file discovery
2010-02-03 22:37:22.227: [
CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found
2010-02-03 22:37:22.228: [
CSSD][1145538880]###################################
2010-02-03 22:37:22.228: [
CSSD][1145538880]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread
ocssd.bin may not come up with the following error if all
nodes failed while there's a voting file change in progress:
2010-05-02 03:11:19.033: [
CSSD][1197668093]clssnmCompleteInitVFDiscovery: Detected voting file add
in progress for CIN 0:1134513465:0, waiting for configuration to complete
0:1134513098:0
The solution is to start ocssd.bin in exclusive mode with
note 1364971.1
If the voting disk is located on a non-ASM device, ownership
and permissions should be:
-rw-r----- 1 ogrid oinstall 21004288 Feb 4 09:13 votedisk1
3. Network is functional and name resolution is working:
If ocssd.bin can't bind to any network, likely the ocssd.log
will have messages like following:
2010-02-03 23:26:25.804:
[GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any
interfaces in clsinet, ret gipcretFail (1)
2010-02-03 23:26:25.804:
[GIPCGMOD][1206540320]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret
gipcretFail (1) ] failed to determine
host from clsinet, using default
..
2010-02-03 23:26:25.810: [
CSSD][1206540320]clsssclsnrsetup: gipcEndpoint failed, rc 39
2010-02-03 23:26:25.811: [
CSSD][1206540320]clssnmOpenGIPCEndp: failed to listen on gipc addr
gipc://rac1:nm_eotcs- ret 39
2010-02-03 23:26:25.811: [
CSSD][1206540320]clssscmain: failed to open gipc endp
If there's connectivity issue on private network (including
multicast is off), likely the ocssd.log will have messages like following:
2010-09-20 11:52:54.014: [
CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk
HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 453, LATS 328297844,
lastSeqNo 452, uniqueness 1284979488, timestamp 1284979973/329344894
2010-09-20 11:52:54.016: [
CSSD][1078421824]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
.. >>>>
after a long delay
2010-09-20 12:02:39.578: [
CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk
HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 1037, LATS 328883434,
lastSeqNo 1036, uniqueness 1284979488, timestamp 1284980558/329930254
2010-09-20 12:02:39.895: [
CSSD][1107286336]clssgmExecuteClientRequest: MAINT recvd from proc 2
(0xe1ad870)
2010-09-20 12:02:39.895: [
CSSD][1107286336]clssgmShutDown: Received abortive shutdown request from
client.
2010-09-20 12:02:39.895: [
CSSD][1107286336]###################################
2010-09-20 12:02:39.895: [
CSSD][1107286336]clssscExit: CSSD aborting from thread GMClientListener
2010-09-20 12:02:39.895: [
CSSD][1107286336]###################################
To validate network, please refer to note 1054902.1
$GRID_HOME/bin/lsnodes -n
racnode1 1
racnode1 0
If vendor clusterware is not fully up, likely ocssd.log will
have similar messages like following:
2010-08-30 18:28:13.207: [
CSSD][36]clssnm_skgxninit: skgxncin failed, will retry
2010-08-30 18:28:14.207: [
CSSD][36]clssnm_skgxnmon: skgxn init failed
2010-08-30 18:28:14.208: [
CSSD][36]###################################
2010-08-30 18:28:14.208: [
CSSD][36]clssscExit: CSSD signal 11 in thread skgxnmon
Before the clusterware is installed, execute the command
below as grid user:
$INSTALL_SOURCE/install/lsnodes -v
5. Command "crsctl" being executed from wrong
GRID_HOME
Command "crsctl" must be executed from correct
GRID_HOME to start the stack, or similar message will be reported:
2012-11-14 10:21:44.014: [
CSSD][1086675264]ASSERT clssnm1.c 3248
2012-11-14 10:21:44.014: [
CSSD][1086675264](:CSSNM00056:)clssnmvStartDiscovery: Terminating
because of the release version(11.2.0.2.0) of this node being lesser than the
active version(11.2.0.3.0) that the cluster is at
2012-11-14 10:21:44.014: [
CSSD][1086675264]###################################
2012-11-14 10:21:44.014: [
CSSD][1086675264]clssscExit: CSSD aborting from thread
clssnmvDDiscThread#
Case 4: CRSD.BIN does not start
Successful crsd.bin startup depends on the following:
1. ocssd is fully up
If ocssd.bin is not fully up, crsd.log will show messages
like following:
2010-02-03 22:37:51.638: [
CSSCLNT][1548456880]clssscConnect: gipc request failed with 29 (0x16)
2010-02-03 22:37:51.638: [
CSSCLNT][1548456880]clsssInitNative: connect failed, rc 29
2010-02-03 22:37:51.639: [
CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS.
Waiting for good status ..
2. OCR is accessible
If the OCR is located on ASM, ora.asm resource (ASM
instance) must be up and diskgroup for OCR must be mounted, if not, likely the
crsd.log will show messages like:
2010-02-03 22:22:55.186: [
OCRASM][2603807664]proprasmo: Error in open/create file in dg [GI]
[
OCRASM][2603807664]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077,
loc=kgfokge
ORA-15077: could not locate ASM instance serving a required
diskgroup
2010-02-03 22:22:55.189: [
OCRASM][2603807664]proprasmo: kgfoCheckMount returned [7]
2010-02-03 22:22:55.189: [
OCRASM][2603807664]proprasmo: The ASM instance is down
2010-02-03 22:22:55.190: [
OCRRAW][2603807664]proprioo: Failed to open [+GI]. Returned proprasmo()
with [26]. Marking location as UNAVAILABLE.
2010-02-03 22:22:55.190: [
OCRRAW][2603807664]proprioo: No OCR/OLR devices are usable
2010-02-03 22:22:55.190: [
OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [
OCRRAW][2603807664]proprinit: Could not open raw device
2010-02-03 22:22:55.190: [
OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [
OCRAPI][2603807664]a_init:16!: Backend init unsuccessful : [26]
2010-02-03 22:22:55.190: [
CRSOCR][2603807664] OCR context init failure. Error: PROC-26: Error while accessing the
physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required
diskgroup
] [7]
2010-02-03 22:22:55.190: [
CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26
Note: in 11.2 ASM starts before crsd.bin, and brings up the
diskgroup automatically if it contains the OCR.
If the OCR is located on a non-ASM device, expected
ownership and permissions are:
-rw-r----- 1 root
oinstall 272756736 Feb 3 23:24 ocr
If OCR is located on non-ASM device and its unavailable,
likely crsd.log will show similar message like following:
2010-02-03 23:14:33.583: [
OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2,
os err string=No such file or directory
2010-02-03 23:14:33.583: [
OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:33.583: [ default][2346668976]a_init:7!:
Backend init unsuccessful : [26]
2010-02-03 23:14:34.587: [
OCROSD][2346668976]utopen:6m':failed in stat OCR file/disk
/share/storage/ocr, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [
OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os
err string=No such file or directory
2010-02-03 23:14:34.587: [
OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:34.587: [ default][2346668976]a_init:7!:
Backend init unsuccessful : [26]
2010-02-03 23:14:35.589: [
CRSD][2346668976][PANIC] CRSD exiting: OCR device cannot be initialized,
error: 1:26
If the OCR is corrupted, likely crsd.log will show messages
like the following:
2010-02-03 23:19:38.417: [ default][3360863152]a_init:7!:
Backend init unsuccessful : [26]
2010-02-03 23:19:39.429: [
OCRRAW][3360863152]propriogid:1_2: INVALID FORMAT
2010-02-03 23:19:39.429: [
OCRRAW][3360863152]proprioini: all disks are not OCR/OLR formatted
2010-02-03 23:19:39.429: [
OCRRAW][3360863152]proprinit: Could not open raw device
2010-02-03 23:19:39.429: [ default][3360863152]a_init:7!:
Backend init unsuccessful : [26]
2010-02-03 23:19:40.432: [
CRSD][3360863152][PANIC] CRSD exiting: OCR device cannot be initialized,
error: 1:26
If owner or group of grid user got changed, even ASM is available,
likely crsd.log will show following:
2010-03-10 11:45:12.510: [
OCRASM][611467760]proprasmo: Error in open/create file in dg [SYSTEMDG]
[
OCRASM][611467760]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=1031,
loc=kgfokge
ORA-01031: insufficient privileges
2010-03-10 11:45:12.528: [
OCRASM][611467760]proprasmo: kgfoCheckMount returned [7]
2010-03-10 11:45:12.529: [
OCRASM][611467760]proprasmo: The ASM instance is down
2010-03-10 11:45:12.529: [
OCRRAW][611467760]proprioo: Failed to open [+SYSTEMDG]. Returned
proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-03-10 11:45:12.529: [
OCRRAW][611467760]proprioo: No OCR/OLR devices are usable
2010-03-10 11:45:12.529: [
OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [
OCRRAW][611467760]proprinit: Could not open raw device
2010-03-10 11:45:12.529: [
OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [
OCRAPI][611467760]a_init:16!: Backend init unsuccessful : [26]
2010-03-10 11:45:12.530: [
CRSOCR][611467760] OCR context init failure. Error: PROC-26: Error while accessing the
physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges
] [7]
If oracle binary in GRID_HOME has wrong ownership or
permission, even ASM shows it's up and running, likely crsd.log will show
following:
2012-03-04 21:34:23.139: [
OCRASM][3301265904]proprasmo: Error in open/create file in dg [OCR]
[
OCRASM][3301265904]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=12547,
loc=kgfokge
2012-03-04 21:34:23.139: [
OCRASM][3301265904]ASM Error Stack : ORA-12547: TNS:lost contact
2012-03-04 21:34:23.633: [
OCRASM][3301265904]proprasmo: kgfoCheckMount returned [7]
2012-03-04 21:34:23.633: [
OCRASM][3301265904]proprasmo: The ASM instance is down
2012-03-04 21:34:23.634: [
OCRRAW][3301265904]proprioo: Failed to open [+OCR]. Returned proprasmo()
with [26]. Marking location as UNAVAILABLE.
2012-03-04 21:34:23.634: [
OCRRAW][3301265904]proprioo: No OCR/OLR devices are usable
2012-03-04 21:34:23.635: [
OCRASM][3301265904]proprasmcl: asmhandle is NULL
2012-03-04 21:34:23.636: [
GIPC][3301265904] gipcCheckInitialization: possible incompatible
non-threaded init from [prom.c : 690], original from [clsss.c : 5326]
2012-03-04 21:34:23.639: [
default][3301265904]clsvactversion:4: Retrieving Active Version from local
storage.
2012-03-04 21:34:23.643: [
OCRRAW][3301265904]proprrepauto: The local OCR configuration matches
with the configuration published by OCR Cache Writer. No repair required.
2012-03-04 21:34:23.645: [
OCRRAW][3301265904]proprinit: Could not open raw device
2012-03-04 21:34:23.646: [
OCRASM][3301265904]proprasmcl: asmhandle is NULL
2012-03-04 21:34:23.650: [
OCRAPI][3301265904]a_init:16!: Backend init unsuccessful : [26]
2012-03-04 21:34:23.651: [
CRSOCR][3301265904] OCR context init failure. Error: PROC-26: Error while accessing the
physical storage
ORA-12547: TNS:lost contact
2012-03-04 21:34:23.652: [ CRSMAIN][3301265904] Created
alert : (:CRSD00111:) : Could not init
OCR, error: PROC-26: Error while accessing the physical storage
ORA-12547: TNS:lost contact
2012-03-04 21:34:23.652: [
CRSD][3301265904][PANIC] CRSD exiting: Could not init OCR, code: 26
The expected ownership and permission of oracle binary in
GRID_HOME should be:
-rwsr-s--x 1 grid oinstall 184431149 Feb 2 20:37 /ocw/grid/bin/oracle
If OCR or mirror is unavailable (could be ASM is up, but
diskgroup for OCR/mirror is unmounted), likely crsd.log will show following:
2010-05-11 11:16:38.578: [
OCRASM][18]proprasmo: Error in open/create file in dg [OCRMIR]
[ OCRASM][18]SLOS :
SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +OCRMIR.255.4294967295
ORA-17503: ksfdopn:2 Failed to open file
+OCRMIR.255.4294967295
ORA-15001: diskgroup "OCRMIR
..
2010-05-11 11:16:38.647: [
OCRASM][18]proprasmo: kgfoCheckMount returned [6]
2010-05-11 11:16:38.648: [
OCRASM][18]proprasmo: The ASM disk group OCRMIR is not found or not
mounted
2010-05-11 11:16:38.648: [
OCRASM][18]proprasmdvch: Failed to open OCR location [+OCRMIR] error
[26]
2010-05-11 11:16:38.648: [
OCRRAW][18]propriodvch: Error [8]
returned device check for [+OCRMIR]
2010-05-11 11:16:38.648: [
OCRRAW][18]dev_replace: non-master could not verify the new disk (8)
[
OCRSRV][18]proath_invalidate_action: Failed to replace [+OCRMIR] [8]
[
OCRAPI][18]procr_ctx_set_invalid_no_abort: ctx set to invalid
..
2010-05-11 11:16:46.587: [
OCRMAS][19]th_master:91: Comparing device hash ids between local and
master failed
2010-05-11 11:16:46.587: [
OCRMAS][19]th_master:91 Local dev (1862408427, 1028247821, 0, 0, 0)
2010-05-11 11:16:46.587: [
OCRMAS][19]th_master:91 Master dev (1862408427, 1859478705, 0, 0, 0)
2010-05-11 11:16:46.587: [
OCRMAS][19]th_master:9: Shutdown CacheLocal. my hash ids don't match
[
OCRAPI][19]procr_ctx_set_invalid_no_abort: ctx set to invalid
[
OCRAPI][19]procr_ctx_set_invalid: aborting...
2010-05-11 11:16:46.587: [
CRSD][19] Dump State Starting ...
3. crsd.bin pid file exists and points to running crsd.bin
process
If pid file does not exist,
$GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will
have similar like the following:
2010-02-14 17:40:57.927: [ora.crsd][1243486528] [check] PID
FILE doesn't exist.
..
2010-02-14 17:41:57.927: [
clsdmt][1092499776]Creating PID [30269] file for home /ocw/grid host
racnode1 bin crs to /ocw/grid/crs/init/
2010-02-14 17:41:57.927: [
clsdmt][1092499776]Error3 -2 writing PID [30269] to the file []
2010-02-14 17:41:57.927: [
clsdmt][1092499776]Failed to record pid for CRSD
2010-02-14 17:41:57.927: [
clsdmt][1092499776]Terminating process
2010-02-14 17:41:57.927: [ default][1092499776] CRSD exiting
on stop request from clsdms_thdmai
The solution is to create a dummy pid file
($GRID_HOME/crs/init/$HOST.pid) manually as grid user with "touch"
command and restart resource ora.crsd
If pid file does exist and the PID in this file references a
running process which is NOT the crsd.bin process,
$GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will
have similar like the following:
2011-04-06 15:53:38.777: [ora.crsd][1160390976] [check] PID
will be looked for in /ocw/grid/crs/init/racnode1.pid
2011-04-06 15:53:38.778: [ora.crsd][1160390976] [check] PID
which will be monitored will be 1535 >> 1535 is
output of "cat /ocw/grid/crs/init/racnode1.pid"
2011-04-06 15:53:38.965: [ COMMCRS][1191860544]clsc_connect:
(0x2aaab400b0b0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD))
[
clsdmc][1160390976]Fail to connect
(ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD)) with status 9
2011-04-06 15:53:38.966: [ora.crsd][1160390976] [check]
Error = error 9 encountered when connecting to CRSD
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check]
Calling PID check for daemon
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check]
Trying to check PID = 1535
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] PID
check returned ONLINE CLSDM returned OFFLINE
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check]
DaemonAgent::check returned 5
2011-04-06 15:53:39.203: [
AGFW][1160390976] check for resource: ora.crsd 1 1 completed with
status: FAILED
2011-04-06 15:53:39.203: [
AGFW][1170880832] ora.crsd 1 1 state changed from: UNKNOWN to: FAILED
..
2011-04-06 15:54:10.511: [
AGFW][1167522112] ora.crsd 1 1 state changed from: UNKNOWN to: CLEANING
..
2011-04-06 15:54:10.513: [ora.crsd][1146542400] [clean]
Trying to stop PID = 1535
..
2011-04-06 15:54:11.514: [ora.crsd][1146542400] [clean]
Trying to check PID = 1535
To verify on OS level:
ls -l /ocw/grid/crs/init/*pid
-rwxr-xr-x 1 ogrid oinstall 5 Feb 17 11:00
/ocw/grid/crs/init/racnode1.pid
cat /ocw/grid/crs/init/*pid
1535
ps -ef| grep 1535
root 1535 1 0
Mar30 ? 00:00:00 iscsid >> Note process 1535 is
not crsd.bin
The solution is to create an empty pid file and to restart
the resource ora.crsd, as root:
# > $GRID_HOME/crs/init/<racnode1>.pid
# $GRID_HOME/bin/crsctl stop res ora.crsd -init
# $GRID_HOME/bin/crsctl start res ora.crsd -init
4. Network is functional and name resolution is working:
If the network is not fully functioning, ocssd.bin may still
come up, but crsd.bin may fail and the crsd.log will show messages like:
2010-02-03 23:34:28.412: [
GPnP][2235814832]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=867,
tl=3, f=0
2010-02-03 23:34:28.428: [
OCRAPI][2235814832]clsu_get_private_ip_addresses: no ip addresses found.
..
2010-02-03 23:34:28.434: [
OCRAPI][2235814832]a_init:13!: Clusterware init unsuccessful : [44]
2010-02-03 23:34:28.434: [
CRSOCR][2235814832] OCR context init failure. Error: PROC-44: Error in network address and
interface operations Network address and interface operations error [7]
2010-02-03 23:34:28.434: [
CRSD][2235814832][PANIC] CRSD exiting: Could not init OCR, code: 44
Or:
2009-12-10 06:28:31.974: [
OCRMAS][20]proath_connect_master:1: could not connect to master clsc_ret1 = 9, clsc_ret2 = 9
2009-12-10 06:28:31.974: [
OCRMAS][20]th_master:11: Could not connect to the new master
2009-12-10 06:29:01.450: [ CRSMAIN][2] Policy Engine is not
initialized yet!
2009-12-10 06:29:31.489: [ CRSMAIN][2] Policy Engine is not
initialized yet!
Or:
2009-12-31 00:42:08.110: [ COMMCRS][10]clsc_receive:
(102b03250) Error receiving, ns (12535, 12560), transport (505, 145, 0)
To validate the network, please refer to note 1054902.1
5. crsd executable (crsd.bin and crsd in GRID_HOME/bin) has
correct ownership/permission and hasn't been manually modified, a simply way to
check is to compare output of "ls -l <grid-home>/bin/crsd
<grid-home>/bin/crsd.bin" with a "good" node.
6. To troubleshoot further, refer to note 1323698.1 -
Troubleshooting CRSD Start up Issue
Case 5: GPNPD.BIN does not start
1. Name Resolution is not working
gpnpd.bin fails with following error in gpnpd.log:
2010-05-13 12:48:11.540: [
GPnP][1171126592]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling
"tcp://node2:9393", try 1 of 3...
2010-05-13 12:48:11.540: [
GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1015] ENTRY
2010-05-13 12:48:11.541: [
GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1066] GIPC gipcretFail
(1) gipcConnect(tcp-tcp://node2:9393)
2010-05-13 12:48:11.541: [
GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1067] Result: (48)
CLSGPNP_COMM_ERR. Failed to connect to call url "tcp://node2:9393"
In above example, please make sure current node is able to
ping "node2", and no firewall between them.
2. bug 10105195
Due to bug 10105195, gpnp dispatch is single threaded and
could be blocked by network scanning etc, the bug is fixed in 11.2.0.2 GI PSU2,
11.2.0.3 and above, refer to note 10105195.8 for more details.
Case 6: Various other daemons do not start
Common causes:
1. Log file or directory for the daemon doesn't have
appropriate ownership or permission
If the log file or log directory for the daemon doesn't have
proper ownership or permissions, usually there is no new info in the log file
and the timestamp remains the same while the daemon tries to come up.
Refer to below section "Log File Location, Ownership
and Permission" for general reference.
2. Network socket file doesn't have appropriate ownership or
permission
In this case, the daemon log will show messages like:
2010-02-02 12:55:20.485: [ COMMCRS][1121433920]clsclisten:
Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))
2010-02-02 12:55:20.485: [
clsdmt][1110944064]Fail to listen to
(ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))
3. OLR is corrupted
In this case, the daemon log will show messages like (this
is a case that ora.ctssd fails to start):
2012-07-22 00:15:16.565: [ default][1]clsvactversion:4:
Retrieving Active Version from local storage.
2012-07-22 00:15:16.575: [
CTSS][1]clsctss_r_av3: Invalid active version [] retrieved from OLR.
Returns [19].
2012-07-22 00:15:16.585: [
CTSS][1](:ctss_init16:): Error [19] retrieving active version. Returns
[19].
2012-07-22 00:15:16.585: [
CTSS][1]ctss_main: CTSS init failed [19]
2012-07-22 00:15:16.585: [
CTSS][1]ctss_main: CTSS daemon aborting [19].
2012-07-22 00:15:16.585: [
CTSS][1]CTSS daemon aborting
The solution is to restore a good copy of OLR note
1193643.1
Case 7: CRSD Agents do not start
CRSD.BIN will spawn two agents to start up user resource
-the two agent share same name and binary as ohasd.bin agents:
orarootagent:
responsible for ora.netn.network, ora.nodename.vip, ora.scann.vip and ora.gns
oraagent:
responsible for ora.asm, ora.eons, ora.ons, listener, SCAN listener, diskgroup,
database, service resource etc
To find out the user resource status:
$GRID_HOME/crsctl stat res -t
If crsd.bin can not start any of the above agents properly,
user resources may not come up.
1. Common cause of agent failure is that the log file or log
directory for the agents don't have proper ownership or permissions.
Refer to below section "Log File Location, Ownership
and Permission" for general reference.
2. Agent may fail to start due to bug 11834289 with error
"CRS-5802: Unable to start the agent process", refer to Section
"OHASD does not start" #10 for
details.
Case 8: HAIP does not start
HAIP may fail to start with various errors, i.e.
[ohasd(891)]CRS-2807:Resource 'ora.cluster_interconnect.haip'
failed to start automatically.
Refer to note 1210883.1 for more details of HAIP
Network and Naming Resolution Verification
CRS depends on a fully functional network and name
resolution. If the network or name resolution is not fully functioning, CRS may
not come up successfully.
To validate network and name resolution setup, please refer
to note 1054902.1
Log File Location, Ownership and Permission
Appropriate ownership and permission of sub-directories and
files in $GRID_HOME/log is critical for CRS components to come up properly.
In Grid Infrastructure cluster environment:
Assuming a Grid Infrastructure environment with node name
rac1, CRS owner grid, and two separate RDBMS owner rdbmsap and rdbmsar, here's
what it looks like under $GRID_HOME/log in cluster environment:
drwxrwxr-x 5 grid oinstall 4096 Dec 6 09:20 log
drwxr-xr-x 2 grid oinstall 4096 Dec 6 08:36 crs
drwxr-xr-t 17
root oinstall 4096 Dec 6 09:22 rac1
drwxr-x--- 2 grid
oinstall 4096 Dec 6 09:20 admin
drwxrwxr-t 4
root oinstall 4096 Dec
6 09:20 agent
drwxrwxrwt 7
root oinstall 4096 Jan 26 18:15 crsd
drwxr-xr-t 2
grid oinstall 4096 Dec 6 09:40 application_grid
drwxr-xr-t 2
grid oinstall 4096 Jan 26 18:15
oraagent_grid
drwxr-xr-t 2
rdbmsap oinstall 4096 Jan 26 18:15 oraagent_rdbmsap
drwxr-xr-t 2
rdbmsar oinstall 4096 Jan 26 18:15 oraagent_rdbmsar
drwxr-xr-t 2
grid oinstall 4096 Jan 26 18:15
ora_oc4j_type_grid
drwxr-xr-t 2
root root 4096 Jan 26 20:09 orarootagent_root
drwxrwxr-t 6
root oinstall 4096 Dec 6 09:24 ohasd
drwxr-xr-t 2
grid oinstall 4096 Jan 26 18:14 oraagent_grid
drwxr-xr-t 2
root root 4096 Dec
6 09:24 oracssdagent_root
drwxr-xr-t 2
root root 4096 Dec
6 09:24 oracssdmonitor_root
drwxr-xr-t 2
root root 4096 Jan 26 18:14 orarootagent_root
-rw-rw-r-- 1 root
root 12931 Jan 26 21:30 alertrac1.log
drwxr-x--- 2 grid
oinstall 4096 Jan 26 20:44 client
drwxr-x--- 2 root
oinstall 4096 Dec 6 09:24 crsd
drwxr-x--- 2 grid
oinstall 4096 Dec 6 09:24 cssd
drwxr-x--- 2 root
oinstall 4096 Dec 6 09:24 ctssd
drwxr-x--- 2 grid
oinstall 4096 Jan 26 18:14 diskmon
drwxr-x--- 2 grid
oinstall 4096 Dec 6 09:25 evmd
drwxr-x--- 2 grid
oinstall 4096 Jan 26 21:20 gipcd
drwxr-x--- 2 root
oinstall 4096 Dec 6 09:20 gnsd
drwxr-x--- 2 grid
oinstall 4096 Jan 26 20:58 gpnpd
drwxr-x--- 2 grid
oinstall 4096 Jan 26 21:19 mdnsd
drwxr-x--- 2 root
oinstall 4096 Jan 26 21:20 ohasd
drwxrwxr-t 5 grid
oinstall 4096 Dec 6 09:34 racg
drwxrwxrwt 2
grid oinstall 4096 Dec 6 09:20 racgeut
drwxrwxrwt 2
grid oinstall 4096 Dec 6 09:20 racgevtf
drwxrwxrwt 2
grid oinstall 4096 Dec 6 09:20 racgmain
drwxr-x--- 2 grid
oinstall 4096 Jan 26 20:57 srvm
Please note most log files in sub-directory inherit
ownership of parent directory; and above are just for general reference to tell
whether there's unexpected recursive ownership and permission changes inside
the CRS home . If you have a working node with the same version, the working
node should be used as a reference.
In Oracle Restart environment:
And here's what it looks like under $GRID_HOME/log in Oracle
Restart environment:
drwxrwxr-x 5 grid oinstall 4096 Oct 31 2009 log
drwxr-xr-x 2 grid oinstall 4096 Oct 31 2009 crs
drwxr-xr-x 3 grid oinstall 4096 Oct 31 2009 diag
drwxr-xr-t 17
root oinstall 4096 Oct 31 2009 rac1
drwxr-x--- 2 grid
oinstall 4096 Oct 31 2009 admin
drwxrwxr-t 4
root oinstall 4096 Oct 31
2009 agent
drwxrwxrwt 2
root oinstall 4096 Oct 31 2009 crsd
drwxrwxr-t 8
root oinstall 4096 Jul 14 08:15 ohasd
drwxr-xr-x 2
grid oinstall 4096 Aug 5 13:40
oraagent_grid
drwxr-xr-x 2
grid oinstall 4096 Aug 2 07:11
oracssdagent_grid
drwxr-xr-x 2
grid oinstall 4096 Aug 3 21:13
orarootagent_grid
-rwxr-xr-x 1 grid
oinstall 13782 Aug 1 17:23 alertrac1.log
drwxr-x--- 2 grid
oinstall 4096 Nov 2 2009
client
drwxr-x--- 2
root oinstall 4096 Oct 31
2009 crsd
drwxr-x--- 2 grid
oinstall 4096 Oct 31 2009 cssd
drwxr-x--- 2
root oinstall 4096 Oct 31
2009 ctssd
drwxr-x--- 2 grid
oinstall 4096 Oct 31 2009 diskmon
drwxr-x--- 2 grid
oinstall 4096 Oct 31 2009 evmd
drwxr-x--- 2 grid
oinstall 4096 Oct 31 2009 gipcd
drwxr-x--- 2
root oinstall 4096 Oct 31
2009 gnsd
drwxr-x--- 2 grid
oinstall 4096 Oct 31 2009 gpnpd
drwxr-x--- 2 grid
oinstall 4096 Oct 31 2009 mdnsd
drwxr-x--- 2 grid
oinstall 4096 Oct 31 2009 ohasd
drwxrwxr-t 5 grid
oinstall 4096 Oct 31 2009 racg
drwxrwxrwt 2
grid oinstall 4096 Oct 31 2009 racgeut
drwxrwxrwt 2
grid oinstall 4096 Oct 31 2009 racgevtf
drwxrwxrwt 2
grid oinstall 4096 Oct 31 2009 racgmain
drwxr-x--- 2 grid
oinstall 4096 Oct 31 2009 srvm
Network Socket File Location, Ownership and Permission
Network socket files can be located in /tmp/.oracle,
/var/tmp/.oracle or /usr/tmp/.oracle
When socket file has unexpected ownership or permission,
usually daemon log file (i.e. evmd.log) will have the following:
2011-06-18 14:07:28.545: [ COMMCRS][772]clsclisten:
Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_EVMD))
2011-06-18 14:07:28.545: [
clsdmt][515]Fail to listen to
(ADDRESS=(PROTOCOL=ipc)(KEY=lena042DBG_EVMD))
2011-06-18 14:07:28.545: [
clsdmt][515]Terminating process
2011-06-18 14:07:28.559: [ default][515] EVMD exiting on
stop request from clsdms_thdmai
And the following error may be reported:
CRS-5017: The resource action "ora.evmd start"
encountered the following error:
CRS-2674: Start of 'ora.evmd' on 'racnode1' failed
..
The solution is to stop GI as root (crsctl stop crs -f),
clean up socket files and restart GI.
Assuming a Grid Infrastructure environment with node name
rac1, CRS owner grid, and clustername eotcs
In Grid Infrastructure cluster environment:
Below is an example output from cluster environment:
drwxrwxrwt 2 root
oinstall 4096 Feb 2 21:25 .oracle
./.oracle:
drwxrwxrwt 2 root
oinstall 4096 Feb 2 21:25 .
srwxrwx--- 1 grid oinstall
0 Feb 2 18:00 master_diskmon
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00 mdnsd
-rw-r--r-- 1 grid oinstall
5 Feb 2 18:00 mdnsd.pid
prw-r--r-- 1 root
root 0 Feb 2 13:33 npohasd
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00 ora_gipc_GPNPD_rac1
-rw-r--r-- 1 grid oinstall
0 Feb 2 13:34
ora_gipc_GPNPD_rac1_lock
srwxrwxrwx 1 grid oinstall
0 Feb 2 13:39 s#11724.1
srwxrwxrwx 1 grid oinstall
0 Feb 2 13:39 s#11724.2
srwxrwxrwx 1 grid oinstall
0 Feb 2 13:39 s#11735.1
srwxrwxrwx 1 grid oinstall
0 Feb 2 13:39 s#11735.2
srwxrwxrwx 1 grid oinstall
0 Feb 2 13:45 s#12339.1
srwxrwxrwx 1 grid oinstall
0 Feb 2 13:45 s#12339.2
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:01 s#6275.1
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:01 s#6275.2
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:01 s#6276.1
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:01 s#6276.2
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:01 s#6278.1
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:01 s#6278.2
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00 sAevm
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00 sCevm
srwxrwxrwx 1 root
root 0 Feb 2 18:01 sCRSD_IPC_SOCKET_11
srwxrwxrwx 1 root
root 0 Feb 2 18:01 sCRSD_UI_SOCKET
srwxrwxrwx 1 root
root 0 Feb 2 21:25 srac1DBG_CRSD
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00 srac1DBG_CSSD
srwxrwxrwx 1 root
root 0 Feb 2 18:00 srac1DBG_CTSSD
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00 srac1DBG_EVMD
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00 srac1DBG_GIPCD
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00 srac1DBG_GPNPD
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00 srac1DBG_MDNSD
srwxrwxrwx 1 root
root 0 Feb 2 18:00 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:01 sLISTENER
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:01 sLISTENER_SCAN2
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:01 sLISTENER_SCAN3
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00
sOCSSD_LL_rac1_eotcs
-rw-r--r-- 1 grid oinstall
0 Feb 2 18:00
sOCSSD_LL_rac1_eotcs_lock
-rw-r--r-- 1 grid oinstall
0 Feb 2 18:00
sOCSSD_LL_rac1__lock
srwxrwxrwx 1 root
root 0 Feb 2 18:00 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 root
root 0 Feb 2 18:00 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00
sOracle_CSS_LclLstnr_eotcs_1
-rw-r--r-- 1 grid oinstall
0 Feb 2 18:00
sOracle_CSS_LclLstnr_eotcs_1_lock
srwxrwxrwx 1 root
root 0 Feb 2 18:01 sora_crsqs
srwxrwxrwx 1 root
root 0 Feb 2 18:00 sprocr_local_conn_0_PROC
srwxrwxrwx 1 root
root 0 Feb 2 18:00 sprocr_local_conn_0_PROL
srwxrwxrwx 1 grid oinstall
0 Feb 2 18:00
sSYSTEM.evm.acceptor.auth
In Oracle Restart environment:
And below is an example output from Oracle Restart
environment:
drwxrwxrwt 2 root
oinstall 4096 Feb 2 21:25 .oracle
./.oracle:
srwxrwx--- 1 grid oinstall 0 Aug 1 17:23 master_diskmon
prw-r--r-- 1 grid oinstall 0 Oct 31 2009 npohasd
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 s#14478.1
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 s#14478.2
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.1
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.2
srwxrwxrwx 1 grid oinstall 0 Jul 7 10:59 s#2269.1
srwxrwxrwx 1 grid oinstall 0 Jul 7 10:59 s#2269.2
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.1
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.2
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.1
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.2
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sCRSD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 srac1DBG_CSSD
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sEXTPROC1521
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sOCSSD_LL_rac1_localhost
-rw-r--r-- 1 grid oinstall 0 Aug 1 17:23 sOCSSD_LL_rac1_localhost_lock
-rw-r--r-- 1 grid oinstall 0 Aug 1 17:23 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sgrid_CSS_LclLstnr_localhost_1
-rw-r--r-- 1 grid oinstall 0 Aug 1 17:23 sgrid_CSS_LclLstnr_localhost_1_lock
srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sprocr_local_conn_0_PROL
Diagnostic file collection
If the issue can't be identified with the note, as root,
please run $GRID_HOME/bin/diagcollection.sh on all nodes, and upload all .gz
files it generated in current directory.
REFERENCES
NOTE:1077094.1 - How to fix the "DiscoveryString in
profile.xml" or "asm_diskstring in ASM" if set wrongly
NOTE:1323698.1 - Troubleshooting CRSD Start up Issue
NOTE:1325718.1 - OHASD not Starting After Reboot on SLES
NOTE:1068835.1 - What to Do if 11gR2 Grid Infrastructure is
Unhealthy
NOTE:942166.1 - How to Proceed from Failed 11gR2 Grid
Infrastructure (CRS) Installation
NOTE:969254.1 - How to Proceed from Failed Upgrade to 11gR2
Grid Infrastructure on Linux/Unix
BUG:10105195 - PROC-32 ACCESSING OCR; CRS DOES NOT COME UP
ON NODE
NOTE:10105195.8 - Bug 10105195 - Clusterware fails to start
after reboot due to gpnpd fails to start
NOTE:1053147.1 - 11gR2 Clusterware and Grid Home - What You
Need to Know
NOTE:1053970.1 - Troubleshooting 11.2 Grid Infrastructure
root.sh Issues
NOTE:1054902.1 - How to Validate Network and Name Resolution
Setup for the Clusterware and RAC
NOTE:1069182.1 - OHASD Failed to Start: Inappropriate ioctl
for device
No comments:
Post a Comment