MC ServiceGuard - Reasons for TOC

quinta-feira, 29 de dezembro de 2011

Transfer of Control (TOC)

MCSG will TOC a system in order to release system resources and to ensure data integrity. There are several scenarios in which MCSG will TOC a system. They are:

  • A two-node cluster loses heartbeat at which time a single node cluster will form. The system that loses the race to the lock disk will TOC.

  • Checking that:

    # cmviewconf

    Cluster information:cluster name: testcluster
    version: 0
    flags: 12 (single cluster lock)
    heartbeat interval: 1.00 (seconds)
    node timeout: 8.00 (seconds)
    heartbeat connection timeout: 16.00 (seconds)
    auto start timeout: 600.00 (seconds)
    network polling interval: 2.00 (seconds)
    first lock vg name: /dev/vglock
    second lock vg name: (not configured)

    Cluster Node information:Node ID 1:
    Node name: node1
    first lock pv name: /dev/dsk/c0t4d4
    first lock disk interface type: c720
    Network ID 1:

     mac addr: 0x080009fd4375
    hardware path: 8/16/6
    network interface name: lan0
    subnet mask:
    ip address:
    flags: 1 (Heartbeat Network) 

    bridged net ID: 1

    # lanscan

    Hardware Station Crd Hdw Net-Interface NM MAC HP-DLPI DLPI
    Path Address In# State NamePPA ID Type Support Mjr#
    8/16/6 0x080009FD4375 0 UP lan0 snap0 1 ETHER Yes 119
    8/8/2/0 0x00108318AFEE 2 UP lan2 snap2 2 ETHER Yes 119
    8/8/1/0 0x00108318AFED 1 UP lan1 snap1 3 ETHER Yes 119

    # cmscancl -n node -o /tmp/scan.log

    Check the "link-level connectivity" in the log.

  • The cluster daemon, cmcld, dies for any reason

  • msg in the log:
    Serviceguard: Unable to maintain contact with cmcld daemon.Performing TOC to ensure data integrity.

  • NODE_FAIL_FAST=YES is set in a package configuration file.

  • The cluster lvm daemon, cmlvmd, dies for any reason.

  • System safety time is disabled via the cmsetsafety command.

  • SERVICE_FAIL_FAST_ENABLED = YES is set (causes reboot).

  • You can confirm TOC searching in the /etc/shutdownlog  for something like that..

    18:23 Thu Apr 24 2003. Reboot after panic: SafetyTimer expired, ...