Veritas Cluster Server on Solaris

      System Administration Guide

             by Brendan Choi

---------------------------------------------

VCS DEFAULT VALUES

Private Network Heartbeat frequency = 0.5 sec
  Can be modified in /etc/llttab in 1/100th secs.

Low-Pri Network Heartbeat frequency = 1.0 sec
  Can be modified in /etc/llttab in 1/100th secs.

Failover interval after reboot command
  (VCS 1.3 and up) = 60 sec
  Can be modified with hasys's ShutdownTimeout attribute.

Resource monitoring interval (by Resource Type) = 60 sec

Monitoring an offline Resource (by Resource Type) = 300 sec

LLT dead system declaration = 21 sec
  (16 sec peer inactive + 5 sec GAB stable timeout value)
  Peer inactive can be changed using "set-timer" in /etc/llttab
  in 1/100th secs.
  Stable timeout value can be changed using "gabconfig -t".

GAB-HAd heartbeat = 15 sec  (set by VCS_GAB_TIMEOUT environment variable in
                             milliseconds, requires restart or had)

Time GAB allows HAd to be killed before panic (IOFENCE) = 15 sec  (set by
                                                         gabconfig -f)

Max. Number of Network Heartbeats = 8

Max. Number of Disk Heartbeats = 4

VCS had engine port = 14141

VCS 2.0 Web Server port = 8181

LLT SAP value = 0xcafe

-------------------------------------------------------

GABCONFIG SETTINGS OPTIONS

Running "gabconfig -l" on a node will give you the current GAB settings
for that node. These values can be changed with the gabconfig command.

EXAMPLE:

draconis # gabconfig -l
GAB Driver Configuration
Driver state         : ConfiguredPartition arbitration: Disabled
Control port seed    : Enabled
Halt on process death: Disabled
Missed heartbeat halt: Disabled
Halt on rejoin       : Disabled
Keep on killing      : Disabled
Restart              : Enabled
Node count           : 2
Disk HB interval (ms): 1000
Disk HB miss count   : 4
IOFENCE timeout (ms) : 15000
Stable timeout (ms)  : 5000

Here is how each gabconfig option goes with each value:

Driver state            -c
Partition arbitration   -s
Control port seed       -n2 or -x
Halt on process death   -p
Missed heartbeat halt   -b
Halt on rejoin          -j
Keep on killing         -k
IOFENCE timeout (ms)    -f
Stable timeout (ms)     -t

To test "Halt on process death", do kill -9 on the had/hashadow PID. To
test "Missed heartbeat halt", do kill -23 on the had PID. In both tests,
GAB will panic your system.

-------------------------------------------------------

VCS PACKAGES SOLARIS

Here are the Solaris packages for VCS Version 1.1.1:

optional    VRTScsga       VERITAS Cluster Server Graphical Administrator
optional    VRTSgab        VERITAS Group Membership and Atomic Broadcast
optional    VRTSllt        VERITAS Low Latency Transport
optional    VRTSperl       VERITAS Perl for VRTSvcs
optional    VRTSvcs        VERITAS Cluster Server
optional    VRTSvcsor      VERITAS Cluster Server Oracle Enterprise Extension

NOTE: Veritas advises that version 1.1 and 1.1.1 not be used in production.

Here are the Solaris packages for VCS Version 1.3:

optional    VRTScscm       VERITAS Cluster Server Cluster Manager
optional    VRTSgab        VERITAS Group Membership and Atomic Broadcast
optional    VRTSllt        VERITAS Low Latency Transport
optional    VRTSperl       VERITAS Perl for VRTSvcs
optional    VRTSvcs        VERITAS Cluster Server
optional    VRTSvcsor      VERITAS Cluster Server Oracle Enterprise Extension

Here are the Solaris packages for VCS Version 2.0:

optional    VRTSgab        VERITAS Group Membership and Atomic Broadcast
optional    VRTSllt        VERITAS Low Latency Transport
optional    VRTSperl       VERITAS Perl for VRTSvcs
optional    VRTSvcs        VERITAS Cluster Server
optional    VRTSvcsdb      VERITAS Cluster Server Db2udb Enterprise Extension
optional    VRTSvcsdc      VERITAS Cluster Server Documentation
optional    VRTSvcsw       VERITAS Cluster Manager (Web Console)

Here are the Solaris packages for VCS QuickStart Version 2.0:

optional    VRTSappqw      VERITAS Cluster Server Application QuickStart Wizard
optional    VRTSgab        VERITAS Group Membership and Atomic Broadcast
system      VRTSlic        VERITAS Licensing Utilities
optional    VRTSllt        VERITAS Low Latency Transport
optional    VRTSperl       VERITAS Perl for VRTSvcs
optional    VRTSvcs        VERITAS Cluster Server
optional    VRTSvcsqw      VERITAS Cluster Server QuickStart Cluster Manager (We
b Console)

-----------------------------------------------

VCS INSTALLATION

Install packages in this order:

  VRTSllt
  VRTSgab
  VRTSvcs
  VRTSperl
  VRTScscm
  VRTSvcsor

Copy gabtab and llttab files from /opt/VRTSgab and /opt/VRTSllt
to /etc. Configure both files.

Create /etc/llthosts (MANDATORY for version 1.3).

-----------------------------------------------

PROCESSES ON VCS SERVER

Some processes commonly found on a VCS node include:

 root   577     1  0   Sep 14 ?       16:53 /opt/VRTSvcs/bin/had
    root   582     1  0   Sep 14 ?        0:00 /opt/VRTSvcs/bin/hashadow
    root   601     1  0   Sep 14 ?        2:33 /opt/VRTSvcs/bin/DiskGroup/DiskGroupAgent -type DiskGroup
    root   603     1  0   Sep 14 ?        0:56 /opt/VRTSvcs/bin/IP/IPAgent -type IP
    root   605     1  0   Sep 14 ?       10:17 /opt/VRTSvcs/bin/Mount/MountAgent -type Mount
    root   607     1  0   Sep 14 ?       11:23 /opt/VRTSvcs/bin/NIC/NICAgent -type NIC
    root   609     1  0   Sep 14 ?       31:14 /opt/VRTSvcs/bin/Oracle/OracleAgent -type Oracle
    root   611     1  0   Sep 14 ?        3:34 /opt/VRTSvcs/bin/SPlex/SPlexAgent -type SPlex
    root   613     1  0   Sep 14 ?        8:06 /opt/VRTSvcs/bin/Sqlnet/SqlnetAgent -type Sqlnet
    root 20608 20580  0 12:04:03 pts/1    0:20 /opt/VRTSvcs/bin/../gui/jre1.1.6/bin/../bin/sparc/green_threads/jre -mx128m VCS

----------------------------------------------------

GENERAL NOMENCLATURE

VCS is concerned with the following components:

  Cluster Attributes
  Agents
  Systems
  Service Groups
  Resource Types
  Resources

You can cluster many basic UNIX services, such as:

  User home directories
  NIS services
  NTP time services

To cluster an application, you cluster individual services
into Service Groups.

Resource types can be "On-Off", "OnOnly" or "Persistent".
An NFS resource is "OnOnly" because NFS may be needed by
filesystems outside of VCS control. A NIC (network card)
resource is "Persistent" because VCS cannot stop or start a NIC.


-----------------------------------------------------

QUICKSTART SCRIPT

Earlier versions of VCS included a Quickstart Wizard.
After installing the VCS packages, execute the Quickstart
script from an Xterm.

   /opt/VRTSvcs/wizards/config/quick_start

------------------------------------------------------

VCS STARTUP CONFIGURATION FILES

VCS startup and stop files include:

/etc/rc2.d/S70llt                  
/etc/rc2.d/S92gab
/etc/rc3.d/S99vcs
/etc/rc0.d/K10vcs

Important VCS config files include:

/etc/VRTSvcs/conf/config/main.cf
/etc/VRTSvcs/conf/config/types.cf
/etc/llttab
/etc/gabtab
/etc/llthosts

You should not edit config files in /etc/VRTSvcs/conf/config
without bringing the Cluster down first.

------------------------------------------------------

SHUTDOWN VCS HASTOP

To shutdown VCS on all the systems without bringing any
user services down:

  /opt/VRTSvcs/bin/hastop -all -force

NOTE: This is the way to stop the Cluster if it is open read-
      write. If the Cluster is open, you will have a .stale file.
      The next hastart will not startup service groups that were
      already offline.

To shutdown VCS and Service Groups only on the local server:

  /opt/VRTSvcs/bin/hastop -local

To shutdown VCS only on the local server, and keep services up
on the current node:

  /opt/VRTSvcs/bin/hastop -local -force

NOTE: This is the way to stop the Cluster if it is open read-
      write. If the Cluster is open, you will have a .stale file.

To shutdown locally and move Service Groups to another machine:

  /opt/VRTSvcs/bin/hastop -local -evacuate

Stopping VCS on any node will write the configuration to the
main.cf file on each node if -force is not used and Cluster
is closed (read-only).

------------------------------------------------------

LLT GAB COMMON INFORMATIONAL COMMANDS

/sbin/gabconfig -a        Verify LLT and GAB are up and running.

/sbin/lltstat -n          Show heartbeat status
/sbin/lltstat -nvv        Show the heartbeats with MAC addresses
                          for up to 32 nodes.
  NOTE: lltstat displays "Link" using the tag you use in /etc/llttab.

/sbin/lltstat -p          Show port status

/sbin/lltconfig -a list   See MAC addresses on LLT links.
/sbin/lltconfig -T query  Display heartbeat frequencies.

To test and watch LLT traffic between 2 nodes:

/opt/VRTSllt/llttest -p 1
  >transmit -n <name of other node> -c 5000
/opt/VRTSllt/llttest -p 1  (on other node)
  >receive -c 5000

/opt/VRTSllt/lltdump -f <network link device>
                          Show LLT traffic.

/opt/VRTSllt/lltshow -n <node name>   Show LLT kernel structures.

/opt/VRTSllt/dlpiping -vs <network link device>
                          Turn on your dlpiping server.

/opt/VRTSllt/dlpiping -c <network link device> <MAC address of other node>
                          Send LLT packet to other node and see response.
                          Other node must have dlpiping server running.

------------------------------------------------------

GABCONFIG LLTCONFIG SEEDING CLUSTER STARTUP LLTSTAT

GAB and LLT operate at Layer 2 of the TCP/IP OSI stack. LLT is a
Data Link Provider Interface (DLPI) protocol.

GAB deals with:  (1) Cluster memberships
                 (2) Monitoring hearbeats
                 (3) Distributing information throughout the Cluster

LLT deals with:  (1) System ID's in the Cluster
                 (2) Setting Cluster ID's for multiple clusters.
                 (3) Tuning network heartbeat frequencies.

Heartbeat frequency is 0.5 seconds on a private network, and
1.0 seconds on a low-pri network.

Use "/sbin/lltconfig -T query" to find out your heartbeat
frequencies.

Use gabconfig to control Cluster seeding and startup.

EXAMPLE:
                                      
If the Cluster normally has 4 nodes, then /etc/gabtab should
contain:

  /sbin/gabconfig -c -n 4

VCS will then not start until all 4 nodes are up. You should execute this
on each node of the Cluster.

To start VCS with less number of nodes, execute gabconfig with a lower
node count.

To seed the Cluster manually if no other nodes are available, execute:

  /sbin/gabconfig -c -x

NOTE: If no other nodes are available, you must do this if you want
      to start VCS on the current node.

To see that LLT and GAB are up and running:

  /sbin/gabconfig -a

  GAB Port Memberships
  ===============================================================
  Port a gen 4b2f0011 membership 01                              
  Port h gen a6690001 membership 01

  The port "a" indicates GAB is communicating, port "h" indicates
  VCS is communicating. The "01" indicates node 0 and node 1.
  The gen strings are random generated numbers.


  GAB Port Memberships
  ===================================
  Port a gen a36e0003 membership 01
  Port a gen a36e0003 jeopardy 1
  Port h gen fd570002 membership 01
  Port h gen fd570002 jeopardy 1

  This output indicates one of the heartbeat links is down, so
  VCS is in jeopardy mode.


  GAB Port Memberships
  ===============================================================
  Port a gen 3a24001f membership 01                              
  Port h gen a10b0021 membership 0                               
  Port h gen a10b0021    visible ;1    

  This output indicates that GAB on node 1 has lost contact
  with its VCS daemons.


  GAB Port Memberships
  ===============================================================
  Port a gen 3a240021 membership 01                  

  This output indicates that VCS daemons are down on the current
  node, but GAB and LLT are still up.


To see the current LLT configuration:

  /sbin/lltconfig -a list

To shutdown GAB:

  /sbin/gabconfig -U

To unload GAB (or LLT) kernel module:
                                      
  modinfo | grep <gab | llt>  (to find the module number)               
                                      
  modunload -i <module number>        
                                      
To shutdown the LLT:                  

  lltconfig -U        

Commands to monitor LLT status:

  /sbin/lltstat -n        Shows heartbeat status
  /sbin/lltstat -nvv      Shows the heartbeats with MAC addresses
  /sbin/lltstat -p        Shows port status

------------------------------------------------------

NETWORK INTERFACES

For 2 Nodes, you need at least 4 interfaces on each server.
The LLT interfaces use a VCS protocol, not IP, on their own
private networks. You can have up to 8 LLT network links.

Here's a common configuration on Sun:

  hme0 ----> VCS Private LAN 0      LLT connection
  qfe0 ----> VCS Private LAN 1      LLT connection
  qfe1 ----> Server's IP
  qfe2 ----> Cluster Virtual IP (managed by VCS)

The VIP and server IP must belong on the same subnet.

Do not create /etc/hostname.hme0 or /etc/hostname.qfe0
files if those are the LLT interfaces.

Important VCS files in /etc:

   /etc/rc2.d/S70llt
   /etc/rc2.d/S92gab
   /etc/rc3.d/S99vcs

   /etc/llttab
   /etc/gabtab
   /etc/llthosts

EXAMPLES:

  Low Latency Transport configuration /etc/llttab:

     set-node        cp01
     set-cluster     3
     link hme1 /dev/hme:1 - ether - -
     link qfe0 /dev/qfe:0 - ether - -
     link-lowpri qfe4 /dev/qfe:4 - ether - -
     start

NOTE: Each VCS Cluster on a LAN must have its own ID. The "set-cluster"
      value in the /etc/llttab is the Cluster's ID number.
      The first string after "link" is a tag you can name anyway you want.
      It is shown in the "lltstat" command.


  Group Membership Atomic Broadcast configuration /etc/gabtab:

    /sbin/gabconfig -c -n3

  Low Latency Hosts Table /etc/llthosts:

    1       cp01
    2       cp02
    3       cp03

These files start the LLT and GAB communications:

   /etc/rc2.d/S70llt
   /etc/rc2.d/S92gab

This symlink in /dev must exist:

   ln -s ../devices/pseudo/clone@0:llt llt

In /devices/pseudo :

crw-rw-rw-   1 root     sys       11,109 Sep 21 10:38 clone@0:llt
crw-rw-rw-   1 root     sys      143,  0 Sep 21 10:39 gab@0:gab_0
crw-rw-rw-   1 root     sys      143,  1 Feb  1 16:59 gab@0:gab_1
crw-rw-rw-   1 root     sys      143,  2 Sep 21 10:39 gab@0:gab_2
crw-rw-rw-   1 root     sys      143,  3 Sep 21 10:39 gab@0:gab_3
crw-rw-rw-   1 root     sys      143,  4 Sep 21 10:39 gab@0:gab_4
crw-rw-rw-   1 root     sys      143,  5 Sep 21 10:39 gab@0:gab_5
crw-rw-rw-   1 root     sys      143,  6 Sep 21 10:39 gab@0:gab_6
crw-rw-rw-   1 root     sys      143,  7 Sep 21 10:39 gab@0:gab_7
crw-rw-rw-   1 root     sys      143,  8 Sep 21 10:39 gab@0:gab_8
crw-rw-rw-   1 root     sys      143,  9 Sep 21 10:39 gab@0:gab_9
crw-rw-rw-   1 root     sys      143, 10 Sep 21 10:39 gab@0:gab_a
crw-rw-rw-   1 root     sys      143, 11 Sep 21 10:39 gab@0:gab_b
crw-rw-rw-   1 root     sys      143, 12 Sep 21 10:39 gab@0:gab_c
crw-rw-rw-   1 root     sys      143, 13 Sep 21 10:39 gab@0:gab_d
crw-rw-rw-   1 root     sys      143, 14 Sep 21 10:39 gab@0:gab_e
crw-rw-rw-   1 root     sys      143, 15 Sep 21 10:39 gab@0:gab_f

/etc/name_to_major (numbers differ on each system):

llt 109
gab 143


------------------------------------------------------

STARTUP VCS HASTART HACONF MAIN.CF DUMP

VCS only starts up locally on a machine. You must manually start
or reboot other nodes in sequence if the main.cf files are different.
Startup the node with the main.cf that you want for the Cluster.

To startup VCS:

  /opt/VRTSvcs/bin/hastart

If another node has already started and been seeded, VCS will load
that other node's main.cf into the memory of the current node.

To start VCS and treat the configuration as stale even if it is valid:

  /opt/VRTSvcs/bin/hastart -stale

This will make a .stale file throughout the Cluster.

If VCS fails to start normally, the configuration might be stale. If a .stale
file exists and you really need to start the cluster now, use the "force"
option to override the stale file and start the cluster:

  /opt/VRTSvcs/bin/hastart -force

After you start VCS on all the nodes, you must tell VCS to write
the Cluster configuration to main.cf on disks on all nodes. This will
remove the .stale file. In VCS 2.0, the .stale file is automatically
removed on a forced startup.

  /opt/VRTSvcs/bin/haconf -dump -makero

NOTE: The node that was started first will just have its main.cf file
      reloaded with the same information. The other nodes will have
      theirs updated.

Main.cf, types.cf and any include files are written to automatically when
a node joins a cluster and when the cluster changes configuration while 
it is online.

--------------------------------------------------------

HASTATUS HASYS CHECK STATUS

To verify that the Cluster is up and running:

  /opt/VRTSvcs/bin/hastatus  (will show a real-time output of
                               VCS events)

  /opt/VRTSvcs/bin/hastatus -sum

  /opt/VRTSvcs/bin/hasys -display

--------------------------------------------------------

START AND STOP SERVICE GROUPS

You can manually start (online) and stop (offline)
Service Groups on a given server.

  hagrp -online <service group> -sys <host name>

  hagrp -offline <service group> -sys <host name>

--------------------------------------------------------

HAGRP SWITCH MIGRATE FREEZE SERVICE GROUPS FAILOVER

To move (evacuate) Service Groups from one Node to another:

   hagrp -switch <Group name> -to <Hostname of other Node>

To freeze a Service Group:

   hagrp -freeze <Service Group> -presistent

--------------------------------------------------------

BINARIES MAN PAGES PATH

The man pages for VCS are in the following directories:

  /opt/VRTSllt/man
  /opt/VRTSgab/man
  /opt/VRTSvcs/man

Most of the binaries are stored in:

  /opt/VRTSvcs/bin

------------------------------------------------------

COMMON MONITORING DISPLAY COMMANDS

hastatus -summary                  Show current status of the VCS Cluster

hasys -list                        List all Systems in the Cluster
hasys -display                     Get detailed information about each System

hagrp -list                        List all Service Groups
hagrp -resources <Service Group>   List all Resources of a Service Group
hagrp -dep <Service Group>         List a Service Group's dependencies
hagrp -display <Service Group>     Get detailed information about a Service
                                   Group
haagent -list                      List all Agents
haagent -display <Agent>           Get information about an Agent

hatype -list                       List all Resource Types
hatype -display <Resource Type>    Get detailed information about a Resource
                                   Type
hatype -resources <Resource Type>  List all Resources of a Resource Type

hares -list                        List all Resources
hares -dep <Resource>              List a Resource's dependencies
hares -display <Resource>          Get detailed information about a Resource

haclus -display                    List attributes and attribute values of the
                                   Cluster

------------------------------------------------------

VCS COMMAND SET PROCESSES

Most commands are stored in /opt/VRTSvcs/bin.


hagrp           Evacuate Service Groups from a Node
                Check groups, group resources, dependencies,
                  attributes
                Start, stop, switch, freeze, unfreeze, disable, enable,
                  flush, disable and enable resources in a group

hasys           Check Node parameters
                List Nodes in the Cluster, attributes, resource types,
                  resources, attributes of resources
                Freeze, thaw node

haconf          Dump HA configuration

hauser          Manage VCS user accounts

hastatus        Check Cluster status

haclus          Check Cluster attributes

hares           Check resources
                Online and offline a resource, offline and propagate to
                children, probe, clear faulted resource

haagent         List agents, agent status, start and stop agents

hastop          Stop VCS

hastart         Start VCS

hagui           Change Cluster configuration

hacf            Generate main.cf file. Verify the local configuration

haremajor       Change Major number on shared block device

gabconfig       Check status of the GAB

gabdisk         Control GAB Heartbeat Disks  (VCS 1.1.x)

gabdiskx         Control GAB Heartbeat Disks
gabdiskhb        Control GAB Heartbeat Disks

lltstat         Check status of the link

rsync           Distribute agent code to other Nodes


Other processes:

had             The VCS engine itself. This is a high priority
                real time (RT) process. It might still get swapped
                out or sleeping in a kernel system call.

hashadow        Monitors and restarts the VCS engine.

halink          Monitors communication links in the Cluster.

------------------------------------------------------

HACF CONFIGURATION

To verify the current configuration (works even if VCS is down):

  cd /etc/VRTSvcs/conf/config
  hacf -verify .

To generate a main.cf file:

  hacf -generate

To generate a main.cmd from a main.cf:

  hacf -cftocmd .

To generate a main.cf from a main.cmd:

  hacf -cmdtocf .

------------------------------------------------------

HACONF CONFIGURATION FILE MAIN.CF

To set the VCS configuration file (main.cf) to read-write:

  haconf -makerw

NOTE: This will create the .stale file.

To set the VCS configuration file to read-only.

  haconf -dump -makero

EXAMPLES:

To add a VCS user:

  haconf -makerw

  hauser -add <username>

  haconf -dump -makero

To add a new system called "sysa" to a group's SystemList
with a priority of 2:

  haconf -makerw

  hagrp -modify group1 SystemList -add sysa 2

  haconf -dump -makero

------------------------------------------------------

RHOSTS ROOT ACCESS

Add .rhosts files for user root to Nodes for transparent
rsh access between Nodes.

To add the root user to VCS:

  haconf -makerw
  hauser -add root
  haconf -dump -makero

To change the VCS root password:

  haconf -makerw
  hauser -update root
  haconf -dump -makero

------------------------------------------------------

HASYS SHUTDOWN REBOOT FAILOVER

Starting with VCS 1.3, a reboot will cause a failover
if the server goes offline (completes the shutdown) within 
a specified amount of time (default is 60 seconds).

To change this amount of time, execute for each node:

  haconf -makerw
  hasys -modify <system name> ShutdownTimeout <seconds>
  haconf -dump -makero

If you don't want a failover during a reboot, set the
time to 0.

------------------------------------------------------

HAREMAJOR VERITAS VOLUMES MAJOR MINOR NUMBERS

If a disk partition or volume is to be exported over NFS (e.g.,
for a high availability NFS server), then the major and minor numbers
on all nodes must match.

Veritas Volumes:

To change the Major numbers of a Veritas Volume to be the same
as the Major numbers on the other node:

  haremajor -vx <vxio major number> <vxspec major number>

You can find these Major numbers by doing 'grep vx /etc/name_to_major'
on the other node.

If the minor numbers of a Veritas Volume do not match, you must
use "vxdg" with the "reminor" option.

Disk Partitions:

To make the major numbers match, execute:

  haremajor -sd <new major number>

If the minor numbers of a disk partition do not match, you must
make the instance numbers in /etc/path_to_inst match.

After doing all this, execute:

  reboot -- -rv

------------------------------------------------------

VCS AGENTS

Agents are stored under /opt/VRTSvcs/bin.

Typical agents may include:

CLARiiON  (commercial)
Disk
DiskGroup
ElifNone
FileNone
FileOnOff
FileOnOnly
IP
IPMultiNIC
Mount
MultiNICA
NFS      (used by NFS server)
NIC
Oracle  (Part of Oracle Agent - commercial)
Phantom
Process
Proxy
ServiceGroupHB
Share    (used by NFS server)
Sqlnet (Part of Oracle Agent - commercial)
Volume

These agents can appear in the process table running like this:

/opt/VRTSvcs/bin/Volume/VolumeAgent -type Volume
/opt/VRTSvcs/bin/MultiNICA/MultiNICAAgent -type MultiNICA
/opt/VRTSvcs/bin/Sqlnet/SqlnetAgent -type Sqlnet
/opt/VRTSvcs/bin/Oracle/OracleAgent -type Oracle
/opt/VRTSvcs/bin/IPMultiNIC/IPMultiNICAgent -type IPMultiNIC
/opt/VRTSvcs/bin/DiskGroup/DiskGroupAgent -type DiskGroup
/opt/VRTSvcs/bin/Mount/MountAgent -type Mount
/opt/VRTSvcs/bin/Wig/WigAgent -type Wig

-------------------------------------------------------

SUN OBP SEND BREAK SERIAL PORT CORRUPTION

A Sun machine will halt the processor and be sent into the OBP
prompt if a "STOP-A" or a BREAK signal is sent from the serial console.
This can cause VCS to corrupt data when the machine is brought back
online.

To prevent this from happening, add the following line in
/etc/default/kbd:

  KEYBOARD_ABORT=disable

Also, on some Sun Enterprise machines, you can switch the Key to
the Padlock position to secure it from dropping accidentally to OBP.

-------------------------------------------------------

SHARING DISKS INITIATOR IDS

<<< THIS SECTION TO BE UPDATED >>>

If 2 Nodes are sharing the same disks on the same SCSI
bus, their SCSI host adapters must be assigned unique
SCSI "initiator" ID's.

The default SCSI initiator ID is 7.

To set the SCSI initiator ID on a system to 5, do the
following at the OBP:

  ok setenv scsi-initiator-id 5
  ok boot -r

-------------------------------------------------------

REMOVE VCS SOFTWARE

To remove the VCS packages, execute these commands:

   /opt/VRTSvcs/wizards/config/quick_start -b
   rsh <Node hostname> 'sh /opt/VRTSvcs/wizards/config/quick_start -b'
   pkgrm <VCS packages>
   rm -rf /etc/VRTSvcs /var/VRTSvcs
   init 6

------------------------------------------------------

SYNTAX MAIN.CF FILE

The main.cf is structured like this:

  * include clauses

  * cluster definition

  * system definitions

  * snmp definition

  * service group definitions

    * resource type definitions

      * resource definitions

    * resource dependency clauses

    * service group dependency clauses


Here's a template of what main.cf looks like:

####

include "types.cf"
include "<Another types file>.cf"
.
.
.

cluster <Cluster name> (
        UserNames = { root = <Encrypted password> }
        CounterInterval = 5
        Factor = { runque = 5, memory = 1, disk = 10, cpu = 25,
                 network = 5 }
        MaxFactor = { runque = 100, memory = 10, disk = 100, cpu = 100,
                 network = 100 }
        )

system <Hostname of the primary node>

system <Hostname of the failover node>

snmp vcs (
        TrapList = { 1 = "A new system has joined the VCS Cluster",
                 2 = "An existing system has changed its state",
                 3 = "A service group has changed its state",
                 4 = "One or more heartbeat links has gone down",
                 5 = "An HA service has done a manual restart",
                 6 = "An HA service has been manually idled",
                 7 = "An HA service has been successfully started" }
        )

group <Service Group Name> (
        SystemList = { <Hostname of primary node>, <Hostname of failover node> }
        AutoStartList = { <Hostname of primary node> }
        )

        <Resource Type> <Resource> (
                    <Attribute of Resource> = <Attribute value>
                    <Attribute of Resource> = <Attribute value>
                    <Attribute of Resource> = <Attribute value>
                    .
                    .
                    .
                    )
         .
         .
         .


         <Resource Type> requires <Resource Type>
         .
         .
         .

         
         // resource dependency tree
         //
         //    group <Service Group name>
         //    {
         //    <Resource Type> <Resource>
         //        {
         //        <Resource Type> <Resource>
         //        .
         //        .
         //        .
         //            {
         //            <Resource Type> <Resource>
         //            }
         //        }
         //    <Resource Type> <Resource>
         //    }

---------------------------------------------------

MAIN.CF RESOURCES ATTRIBUTE VALUES

By default, VCS monitors online resources every 60 seconds
and offline resources every 300 seconds. These are user
configurable.

Each Resource Type has Attributes and Values you can set.
Resource can be added in any order within a Service Group.
Here are some examples of common values in main.cf.


*** Service Groups:

group oragrpa (
        SystemList = { cp01, cp02 }
        AutoStart = 0
        AutoStartList = { cp01 }
        PreOnline = 1
        )

  + AutoStart determines whether the Service Group will start automatically
    after the machines in the AutoStartList are rebooted.
  + PreOnline determines whether a preonline script is executed.


*** Veritas Volume Manager Disk Groups:

DiskGroup external00 (
                DiskGroup = external00
                )

  + DiskGroup is the Veritas Volume Manager Disk Group name.


*** IP MultiNIC Virtual IP:

IPMultiNIC ip_cpdb01 (
                Address = "151.144.128.107"
                NetMask = "255.255.0.0"
                MultiNICResName = mnic_oragrpa
                )

  + Address is the VIP address.


*** Mount points:

Mount u_u01_a (
                MountPoint = "/u/u01"
                BlockDevice = "/dev/vx/dsk/external00/u_u01"
                FSType = vxfs
                MountOpt = rw
                )


*** MultiNIC IP's:

MultiNICA mnic_oragrpa (
                Device @cp01 = { hme0 = "151.144.128.101",
                         qfe4 = "151.144.128.101" }
                Device @cp02 = { hme0 = "151.144.128.102",
                         qfe4 = "151.144.128.102" }
                NetMask = "255.255.0.0"
                NetworkHosts @cp01 = { "151.144.128.1", "151.144.128.102", "151.
144.128.104" }
                NetworkHosts @cp02 = { "151.144.128.1", "151.144.128.101", "151.
144.128.104" }
                )

  + MultiNICA is the Resource Type that comes with VCS.

  + These are the interfaces and IP's for the nodes in this Service Group.

  + NetworkHosts includes IP's of interfaces to ping to see if the NIC is up.
    One of the IP's is the default router, 151.144.128.1.

  + The "@" sign indicates this attribute will only be applied to this
    specific system.


*** Normal IP's:

IP group1_ip1 (
       Device = hme0
       Address = "192.168.1.1"
       )


*** Normal NIC's:

NIC group1_nic1 (
        Device = hme0
        NetWorkType = ether
        )


*** Oracle Database:

Oracle cawccs02 (
                Sid = cawccs02
                Owner = oracle
                Home = "/usr/apps/oracle/product/8.1.6"
                Pfile = "/usr/apps/oracle/product/8.1.6/dbs/initcawccs02.ora"
                User = vcs
                Pword = vcs
                Table = vcs
                MonScript = "./bin/Oracle/SqlTest.pl"
                )

  + The Resource here is "cawccs02" which is the Oracle instance name.
  + VCS requires that the Oracle DBA create a table called "vcs" for VCS
    to monitor the Database instance.


*** Sqlnet Listener:

Sqlnet listenera (
                Owner = oracle
                Home = "/usr/apps/oracle/product/8.1.6"
                TnsAdmin = "/usr/apps/oracle/product/8.1.6/network/admin"
                Listener = LISTENER
                MonScript = "./bin/Sqlnet/LsnrTest.pl"
                )

  + The values for the Oracle and Sqlnet resources are items you have
    to get from the Oracle DBA.

---------------------------------------------------

SYNTAX TYPES.CF FILE

Here's a template of what types.cf looks like:

######

type <Resource Type> (
             static str ArgList[] = { <attribute>, <attribute>, ... }
             NameRule = resource.<attribute>
             static str Operations = <value>
             static int NumThreads = <value>
             static int OnlineRetryLimit = <value>
             str <attribute>
             str <attribute> = <value>
             int <attribute> = <value>
             int <attribute>
)

.
.
.

---------------------------------------------------

GROUP TYPES

A Failover Group can only be online on one system.

A Parallel Group can be online on multiple systems.

Groups can be brought online in three ways:

  1. Command was issued
  2. Reboot
  3. Failover

---------------------------------------------------

COMMUNICATIONS CHANNELS

VCS nodes communicate to each other in several ways.

1. Network channels (up to 8).
2. Communication or Service Group Heartbeat Disks
   GAB controls diskbased communications.

NOTE: Heartbeat disks CANNOT carry cluster state information.

---------------------------------------------------

NETWORK AND DISK HEARTBEATS

Here's a matrix of what happens when you are losing
network and disk heartbeat links.

1 net
0 dhb
      Jeopardy and Regular Memberships.

1 net-low-pri
0 dhb
      Jeopardy and Regular Memberships. This is the only time a
      low-pri link carries cluster status information.

0 net
1 dhb
      Jeopardy and VCS splits your Cluster into mini-Clusters
      That's because disk heartbeats can't carry cluster status
      information.

1 net
1 dhb
      No Jeopardy, since you have 2 links. Everything is fine for now.

1 net
1 net-low-pri
      No Jeopardy, since you have 2 links. Everything is fine for now.

NOTES: VCS versions before 1.3.0 may panic a system upon rejoin
       from a cluster split. Here were the rules:

       1. On a 2 node cluster, the node with the highest LLT host ID
          panics.
       2. On multinode cluster, the largest mini-cluster stays up.
       3. On multinode cluster splitting into equal size
          clusters, cluster with lowest node number stays up.

       In VCS 1.3.0, had is restarted on the node with the highest
       LLT host ID.

       4. If low-pri or regular links are on the same VLAN or network,
          unique Service Access Point (SAP) values must be used.

          EXAMPLE:

          set-node 0 
          set-cluster 0 
          link-lowpri mylink0 /dev/hme:0 - ether - 
          link-lowpri mylink1 /dev/hme:1 - ether 0xcaf0 - 
          link mylink4 /dev/qfe:4 - ether - - 
          link mylink5 /dev/qfe:5 - ether - - 
          start

          To change the SAP value while VCS is online, freeze the Service
          Groups then do this:

          lltconfig -u mylink1
          lltconfig -l -t mylink1 -d /dev/hme:1 -b ether -s 0xcaf0
          
          The other low-pri link can stay at the default 0xcafe value
          or it can be changed to another value also.

---------------------------------------------------

SEEDING HEARTBEATS FAILOVER JEOPARDY

Seeding is the bringing in of a system into the Cluster.
Systems by default are not seeded when they come up. They
must be seeded by other nodes. Otherwise, they must seed themselves (if
that is allowed by the System Administrator).

1. Assume ALL nodes boot with "gabconfig -c -nX" where X is number of nodes in
your cluster. 

If all nodes are booting up, and ALL X nodes come up AND send heartbeats,
"auto-seeding" begins on the first node that came up, VCS starts on that node,
loads that node's main.cf into memory, and VCS starts on all other nodes. In
other words, ALL nodes must be up before VCS starts automatically. Once all
nodes are up and send heartbeats, VCS starts.

If less than X nodes come up (e.g. one of the nodes crashes), the cluster
does not auto-seed, and VCS does not start on any node. You must manually
seed the cluster. Run "gabconfig -c -x", or "gabconfig -c -nY", where Y < X,
on the node you want to load the main.cf from. That node will seed itself, VCS
will start and load that node's main.cf, and other nodes will begin seeding.
VCS will then start on the other nodes.


2. Assume a node boots with "gabconfig -c -x" and all others have
"gabconfig -c -nX".

Assume these other nodes are down. The  "gabconfig -c -x" node will come up,
seed itself, and start VCS. The other nodes will come up and be seeded by this
cluster.

If the "gabconfig -c -nX" nodes come up first, they will wait for the
"gabconfig -c -x" node to come up. If the "gabconfig -c -x" node can't come
up, you must do "gabconfig -c -x" on one of the nodes that is up.


3. A node is seeded if and only if (1) there is another node that is already
seeded, or (2) all nodes in the cluster are up, or (3) it boots with
"gabconfig -c -x" or that command is executed on some node.


**** Heartbeats and Jeopardy ****

If only 1 heartbeat link remains, automatic failover after a system crash
is disabled. The cluster is in "jeopardy". You can still do manual failovers,
and VCS will still failover on individual service group failures. But it
is still very important that you replace or fix the broken link.

That's because in this condition, VCS cannot decide, when the next and last
heartbeat failure occurs, whether a system actually crashed or
whether its just the heartbeat that had failed (e.g., another loose cable).
When the next and last heartbeat is dead (because of cable failure or server
failure), VCS is designed to split the cluster into mini-clusters, each
capable of operating on its own. So, if there is no pro-active measure 
taken, when the last heartbeat is dead, each node will fire up the service
groups, thinking the other node is dead. In this case, a split-brain will
develop, with more than one system writing to the same storage, causing data
corruption. VCS must therefore act pro-actively to prevent a split-brain.
It does this by disabling automatic failover between the nodes until the
second heartbeat link is restored.

When an application crashes, VCS agents will detect this and failover.
When a system crashes, VCS will detect all heartbeats as being down
at the same time, the agents will sense the application is down, and
VCS will then do a failover. When only one or a few heartbeat links are down,
but more than one heartbeats are still up, and VCS agents sense the
application is still up, VCS will do automatic failover. But if only one
heartbeat remains, and the agents say the application is up, VCS will then
disable automatic failover if a system crashes, (until a heartbeat is restored).
This is a pro-active measure by VCS.

SO, if your heartbeats consist of only 2 network links, and one link fails,
automatic failover due to a system crash is disabled. Its better to add
one of the public links as a low-priority 3rd heartbeat, or add
a 3rd dedicated heartbeat/communications cable.                                

If the remaining link goes down, VCS partitions the nodes into mini-clusters.   
All failover types are disabled. You MUST, at this point, shutdown VCS on all
nodes BEFORE restoring ANY heartbeat links otherwise VCS will panic your nodes
(you will get a core dump to debug from). You can disable this with
a "/sbin/gabconfig -r". Starting with VCS 1.30, VCS will restart on the
node with higher ID instead of panicing. To enable halt on rejoin, do
"/sbin/gabconfig -j".

---------------------------------------------------

AGENT ENTRY POINTS

Entry points can be in C++ or scripts in perl or shell. If your entry points
are scripts, you can use /opt/VRTSvcs/bin/ScriptAgent to build your agent.

Mandatory entry points:

VCSAgStartup (must be in C++)
monitor (return 100 if offline, 110 if online)

Optional entry points:

online (return value 0 or 1)
offline  (return value 0 or 1)
clean (return 0 clean, 1 not clean)
attr_changed
open
close
shutdown

Example agent binary source code:

  #include "VCSAgApi.h"
  void my_shutdown() {
  ...
  }
  void VCSAgStartup() {
  VCSAgEntryPointStruct ep;
  ep.open = NULL;
  ep.online = NULL;
  ep.offline = NULL;
  ep.monitor = NULL;
  ep.attr_changed = NULL;
  ep.clean = NULL;
  ep.close = NULL;
  ep.shutdown = NULL;
  VCSAgSetEntryPoints(ep);
  }

  The entry point scripts are located in
  /opt/VRTSvcs/bin/<agent>/<entry point>.

----------------------------------------------------

SERVICE GROUP CANNOT START TRICKS

To start a Service Group, execute:

   hagrp -online <service group> -sys <host name>

If a Service Group cannot start up on any nodes, try one of these
tricks:

1. Check out which systems have the AutoDisabled set to 1.

   hasys -display

   Execute the following for each of those systems:
  
   hagrp -autoenable <service group> -sys <host name>

2. Clear any faults for the service group.

   hagrp -clear <service group> -sys <host name>

3. Clear any faulted resource.

   hares -clear <resource> -sys <host name>

--------------------------------------------------------------

MAINTENANCE OFFLINE RESOURCE

Here's an example of how to do some filesystem maintenance,
while keeping the Service Group up (partially online). This will
involve offlining a resource without affecting other resources or
bringing the service group down.

This procedure is detailed in Veritas TechNote ID: 232192

  1.  haconf -makerw
  2.  hagrp -freeze <service group> -persistent
  3.  haconf -dump -makero

      Now do maintenance, e.g. unmount a filesystem.

      If you don't want resources monitored during maintenance,
      just do before maintainance:

        hagrp -disableresources <service group>

      After maintenance, remount your filesystems.

  4. haconf -makerw

  5. hagrp -unfreeze <service group> -persistent

     If you disabled resources,

     hagrp -enableresources <service group>

  6. haconf -dump -makero

     Find out which resources are still down.

  7. hastatus -sum
  8. hares -clear <mount resource>
  9. hares -online <mount resource> -sys <host name>

     Verify the Service Group is completely up.

  10. hastatus -sum

----------------------------------------------------

FILEONOFF CLUSTER MAIN.CF

Heres a main.cf for simple 3-node Cluster where the 2
Service Groups bring a file online and checks to see if that file
exists.


############
include "types.cf"
include "OracleTypes.cf"

cluster cpdb (
        UserNames = { veritas = cD9MAPjJQm6go }
        CounterInterval = 5
        Factor = { runque = 5, memory = 1, disk = 10, cpu = 25,
                 network = 5 }
        MaxFactor = { runque = 100, memory = 10, disk = 100, cpu = 100,
                 network = 100 }
        )

system cp01

system cp02

system cp03

snmp vcs (
        TrapList = { 1 = "A new system has joined the VCS Cluster",
                 2 = "An existing system has changed its state",
                 3 = "A service group has changed its state",
                 4 = "One or more heartbeat links has gone down",
                 5 = "An HA service has done a manual restart",
                 6 = "An HA service has been manually idled",
                 7 = "An HA service has been successfully started" }
        )

group oragrpa (
        SystemList = { cp01, cp02 }
        AutoStart = 0
        AutoStartList = { cp01 }
        PreOnline = 1
        )

        FileOnOff filea (
                PathName = "/var/tmp/tempa"
                )



        // resource dependency tree
        //
        //      group oragrpa
        //      {
        //      FileOnOff filea
        //      }


group oragrpb (
        SystemList = { cp02, cp03 }
        AutoStart = 0
        AutoStartList = { cp03 }
        PreOnline = 1
        )

        FileOnOff fileb (
                PathName = "/var/tmp/tempb"
                )



        // resource dependency tree
        //
        //      group oragrpb
        //      {
        //      FileOnOff fileb
       //      }

---------------------------------------------------------------------------

FILEONOFF CLUSTER MAIN.CF CLI

Here's an example of how to build a simple 2-node cluster using
FileOnOff agent entirely from the command line. Service Group is
"bchoitest", and the systems are "anandraj" and "bogota".

  hagrp -add bchoitest
  hagrp -modify bchoitest SystemList anandraj 0 bogota 1
  hagrp -modify bchoitest AutoStartList anandraj
  hares -add filea FileOnOff bchoitest
  hares -modify filea PathName "/tmp/brendan"
  hagrp -enableresources bchoitest
    or
  hares -modify filea Enabled 1
  hagrp -online bchoitest -sys anandraj

To add another FileOnOff resource:
  
  hares -add fileb FileOnOff bchoitest
  hares -modify fileb Enabled 1
  hares -modify fileb PathName "/tmp/brendan"
  hares -online fileb -sys anandraj

To link the resources in a resource dependancy (making
fileb depend on filea):

  hares link fileb filea

    The syntax here is "hares link <parent> <child>".

---------------------------------------------------------------------------

VIOLATIONS

Concurrency Violation   This occurs when a resource in a failover
                        service group is online on more than one system.

---------------------------------------------------------------------------

TRIGGER SCRIPTS

Event trigger scripts have to be stored in this directory in order
for them to work:

  /opt/VRTSvcs/bin/triggers

Sample trigger scripts are stored in:

  /opt/VRTSvcs/bin/sample_triggers

Trigger scripts include:

nfs_restart
preonline
violation
injeopardy
nofailover
postoffline
postonline
resfault
resnotoff
sysoffline
resstatechange (VCS 1.3.0P1 and VCS 2.0)

VCS 1.3.0 Patch1 introduced the resstatechange event trigger. It can be
enabled with:

  hagrp -modify <service group> TriggerResStateChange 1

Or modify the script and place it in /opt/VRTSvcs/bin/triggers.

The following is from the VCS 1.3.0 User's Guide:

Event triggers are invoked on the system where the event occurred, with the
following exceptions:

*  The InJeopardy, SysOffline, and NoFailover event triggers are invoked
   from the lowest-numbered system in RUNNING state.

*  The Violation event trigger is invoked from all systems on which the
   service group was brought partially or fully online.

---------------------------------------------------------------------------

PARENT CHILD RESOURCES GROUPS DEPENDENCIES

NOTE: The Veritas definitions for "Parent" and "Child" are quite
      backward. They are not intuitive.

Resources and their dependencies form a "graph", with Parent resources
at the "root", and Child resources as "leaves".


   System A   Parent (Root)
       |
       |
       |
       |
   System B   Child (Leaf)
       |
       |
       |
       |
   System C

In this diagram, System A depends on B (A is the parent and B is the
child), and B depends on C (B is the parent relative to C).

Under Online Group Dependencies, the Parent resources or group
must wait for the Child to come online first before bringing itself
online.

Under Offline Group Dependencies, the Parent must wait for the Child to
go offline before bringing itself online.

From the Veritas Cluster Server Users Guide:

* Online local                        
The child group must be online on a system before the parent group can go
online on the same system.            

* Online global                       
The child group must be online on a system before the parent group can go
online on any system.                 

* Online remote                       
The child group must be online on a system before the parent group can go
online on a different system.         
                                      
* Offline local                       
The child group must be offline on a system before the parent group can go
online on that same system, and vice versa.

-------------------------------------------------------------

NOTES ON SERVICE GROUP DEPENDENCIES

Rules:
1. Parent can have ONLY one Child, but Child can have multiple Parents.
2. An SGD tree can have up to 3 levels maximum.

GroupA depends on GroupB = GroupA is parent of Group B, GroupB is child of
GroupA

EXAMPLE: application (parent) depend on database (child)

Categories of Dependencies = online, offline
Locations of Dependencies = local, global, remote
Types of Dependencies = soft, firm

Online SGD = parent must wait for child to come up before it can come up
Offline SGD = parent must wait for child to be offline before it can come
up, and vice versa

Local = Parent on same system
Global = Parent on any system
Remote = Parent on any other system (on any system other than the local one)

Soft = parent may or may not automatically failover if child dies

* Online Local Soft = Parent will failover to the same system child failovers
to. If Parent dies and Child does not, Parent cannot failover to another
system. If Child cannot failover,Parent stays where it is.
*Online Global Soft = Parent will NOT failover if child failovers. If Parent
dies and Child does not, Parent can failover to another system.
* Online Remote Soft = If child failsover to the parent system, the parent will
failover to another system. If Child failsover to a system other than the
parent, the parent stays where it is.

Firm = parent MUST be taken offline if child dies. Child cannot be taken
offline while Parent is online. Parent remains offline and cannot failover
if Child dies and cannot come online

*Online Local Firm = Parent failovers to the system the child failovers to
*Online Global Firm = Parent failovers to any system when child failovers
*Online Remote Firm = Parent will failover to a system, but NOT the one the
*child failovers to

Offline Local = If Child failsover to Parent's system, Parent will failover
to another system. Parent can only be online on a system that the Child is
offline on.

EXAMPLES: main.cf syntax

The "requires..." statements must come after the resource declarations
and before the resource dependency statements.

Online local firm: "requires group GroupB online local firm"
Online global soft: "requires group GroupB online global soft"
Online remote soft: "requires group GroupB online remote soft"
Offline local: "requires group GroupB offline local"

"Online remote" & "Offline local" are very similar, except "Offline
local" doesn't require the child to be online anywhere.

NOTE: * Parallel parent/parallel child not supported in online global
        or online remote.
      * Parallel parent/failover child not supported in online local.
      * Parallel child/failover parent is supported in online local, but
        the failover parent group's name must be lexically before the
        child group name. That's because VCS onlines service groups in
        alphabetical order. See Veritas Support TechNote 237239.


EXAMPLE:

      hagrp -link bchoitest apache1 offline local

      hagrp -dep
      #Parent      Child      Relationship
      bchoitest    apache1    offline local

      Inside the main.cf in the Parent Group:

      requires group apache1 offline local

------------------------------------------------------------

LOGS TAGS ERROR MESSAGES

VCS logs are stored in:

  /var/VRTSvcs/log

The logs show errors for the VCS engine and Resource Types.

EXAMPLE:

-rw-rw-rw-   1 root     other      22122 Aug 29 08:03 Application_A.log
-rw-rw-rw-   1 root     root        9559 Aug 15 13:02 DiskGroup_A.log
-rw-rw-rw-   1 root     other        296 Jul 17 17:55 DiskGroup_ipm_A.log
-rw-rw-rw-   1 root     root         746 Aug 17 16:27 FileOnOff_A.log
-rw-rw-rw-   1 root     root         609 Jun 19 18:55 IP_A.log
-rw-rw-rw-   1 root     root        1130 Jul 21 14:33 Mount_A.log
-rw-rw-rw-   1 root     other       5218 May 14 13:16 NFS_A.log
-rw-rw-rw-   1 root     root        7320 Aug 15 12:59 NIC_A.log
-rw-rw-rw-   1 root     other    1042266 Aug 23 10:46 Oracle_A.log
-rw-rw-rw-   1 root     root         149 Mar 20 13:10 Oracle_ipm_A.log
-rw-rw-rw-   1 root     other        238 Jun  1 13:07 Process_A.log
-rw-rw-rw-   1 root     other       2812 Mar 21 11:45 ServiceGroupHB_A.log
-rw-rw-rw-   1 root     root        6438 Jun 19 18:55 Sqlnet_A.log
-rw-rw-rw-   1 root     root         145 Mar 20 13:10 Sqlnet_ipm_A.log
-rw-r--r--   1 root     other    16362650 Aug 31 08:58 engine_A.log
-rw-r--r--   1 root     other        313 Mar 20 13:11 hacf-err_A.log
-rw-rw-rw-   1 root     root        1615 Jun 29 16:30 hashadow-err_A.log
-rw-r--r--   1 root     other    2743342 Aug  1 17:12 hashadow_A.log
drwxrwxr-x   2 root     sys         3072 Aug 27 12:41 tmp

These tags appear in the engine log.

  TAG_A: VCS internal message. Contact Customer Support.
  TAG_B: Messages indicating errors and exceptions.
  TAG_C: Messages indicating warnings.
  TAG_D: Messages indicating normal operations.
  TAG_E: Messages from agents indicating status, etc.

You can increase the log levels (get TAG F-Z messages) by changing
the LogLevel Resource Type attribute. Default is "error". You can
choose "none", "all", "debug", or "info".

  hatype -modify <Resource Type> LogLevel <option>

------------------------------------------------------------

VCS MISCELLANEOUS INFORMATION

Here are some miscellaneous information about VCS:

1. VCS does not officially support both PCI and SBus on the same
   shared device. This could be more of a Sun restriction.

2. VCS Web Edition comes with an Apache agent.

3. The VCS VIP can only be created on an interface that is already plumbed.

4. VCS 1.3 supports Solaris 8.

5. Resources (not Resource Types) cannot have the same name within a Cluster.
   So if you have 2 Service Groups in a Cluster and they both use similar
   resources, you must name them differently.

6. Only manually edit the main.cf file when the Cluster has stopped.
   
7. Use "hastop -force -all" incase you mess up the Cluster before writing
   to main.cf; this will keep the main.cf file unchanged. This will also
   give you a .stale file if your cluster was in read-write mode.

8. Until you tell VCS to write to main.cf, it will update a backup of
   the main.cf instead (as you make changes).

9. Sun Trunking 1.1.2 using IP Source/IP Destination policy should work
    with VCS.

10. VCS has a current limit of 32 nodes per cluster.

11. GAB in version 1.1.2 is incompatible with GAB in 1.3, so if you
    are ugrading VCS, upgrade all nodes.

12. Oracle DBA's need to edit the following Oracle files during VCS setup:

    $ORACLE_HOME/network/admin/tnsnames.ora
    $ORACLE_HOME/network/admin/listener.ora

    If there are multiple instances within the cluster, each instance needs
    its own listener name in listener.ora.

13. VCS 2.0 supports Solaris eri FastEthernet driver.

----------------------------------------------------------------

UPGRADE MAINTENANCE PROCEDURE

Here's a procedure to upgrade VCS or shutdown VCS during
hardware maintenance.

1. Open, freeze each Service Group, and close the VCS config.

   haconf -makerw
   hagrp -freeze <Service Group> -persistent
   haconf -dump makero

2. Shutdown VCS but keep services up.

   hastop -all -force

3. Confirm VCS has shut down on each system.

   gabconfig -a

4. Confirm GAB is not running on any disks.

   gabdisk -l  (use this if upgrading from VCS 1.1.x)

   gabdiskhb -l
   gabdiskx -l

   If it is, remove it from the disks on each system.

   gabdisk -d  (use this if upgrading from VCS 1.1.x)

   gabdiskhb -d
   gabdiskx -d

5. Shutdown GAB and confirm it's down on each system.

   gabconfig -U
   gabconfig -a

6. Identify the GAB kernel module number and unload it
   from each system.

   modinfo | grep gab
   modunload -i <GAB module number>

7. Shutdown LLT. On each system, type:

   lltconfig -U

   Enter "y" if any questions are asked.

8. Identify the LLT kernel module number and unload it from
   each system.

   modinfo | grep llt
   modunload -i <LLT module number>

9. Rename VCS startup and stop scripts on each system.

   cd /etc/rc2.d
   mv S70llt s70llt
   mv S92gab s92gab
   cd /etc/rc3.d
   mv S99vcs s99vcs
   cd /etc/rc0.d
   mv K10vcs k10vcs

10. Make a backup copy of /etc/VRTSvcs/conf/config/main.cf.
    Make a backup copy of /etc/VRTSvcs/conf/config/types.cf.

    Starting with VCS 1.3.0, preonline and other trigger scripts must
    be in /opt/VRTSvcs/bin/triggers. Also, all preonline scripts in
    previous versions (such as VCS 1.1.2) must now be combined in one
    preonline script.

11. Remove old VCS packages.

    pkgrm VRTScsga VRTSvcs VRTSgab VRTSllt VRTSperl VRTSvcswz

    If you are upgrading from 1.0.1 or 1.0.2, you must also remove the package
    VRTSsnmp, and any packages containing a .2 extension, such as VRTScsga.2,
    VRTSvcs.2, etc.

    Also remove any agent packages such as VRTSvcsix (Informix),
    VRTSvcsnb (NetBackup), VRTSvcssor (Oracle), and VRTSvcssy (Sybase).

    Install new VCS packages.

    Restore your main.cf and types.cf files.

12. Start LLT, GAB and VCS.

    cd /etc/rc2.d
    mv s70llt S70llt
    mv s92gab S92gab
    cd /etc/rc3.d
    mv s99vcs S99vcs
    cd /etc/rc0.d
    mv k10vcs K10vcs

    /etc/rc2.d/S70llt start
    /etc/rc2.d/S92gab
    /etc/rc3.d/S99vcs start

13. Check on status of VCS.

    hastatus
    hastatus -sum

14. Unfreeze all Service Groups.

    haconf -makerw
    hagrp -unfreeze <Service Group> -persistent
    haconf -dump -makero

------------------------------------------------

USING VXEXPLORER SCRIPT

When you call Veritas Tech Support for a VCS problem,
they may have you download a script to run and send back
information on your cluster. Here's the procedure:

The URL is ftp://ftp.veritas.com/pub/support/vxexplore.tar.Z

but you can also get it this way.

1. ftp ftp.veritas.com
2. Login as anonymous.
3. Use your e-mail address as your password.
4. cd /pub/support
5. bin
6. get vxexplore.tar.Z
7. Once downloaded, copy the file to all nodes.
   On each node, uncompress and un-tar the file:
   zcat vxexplore.tar.Z | tar xvf -

8. cd VRTSexplorer
   Read the README file.

   Run the Vxexplorer script on each node.
   ./VRTSexplorer

   Make sure the output filename has the CASE ID number.

   EXAMPLE:   VRTSexplorer_999999999.tar.Z

   Now upload the file to ftp.veritas.com.

9. ftp ftp.veritas.com
10. Login as anonymous.
11. Use your e-mail address as your password.
12. cd /incoming
13. bin
14. put <Vxexplorer output filename>
    Upload the other file also.

--------------------------------------------------

BUNDLED AGENTS ATTRIBUTES

The following are Agents bundled with VCS 1.3.0
and the Resource attributes for each
Resource Type. These attributes are listed in types.cf.

Application
  User
  StartProgram
  StopProgram
  CleanProgram
  MonitorProgram
  PidFiles
  MonitorProcesses

Disk
  Partition (required)

DiskGroup
  DiskGroup (required)
  StartVolumes
  StopVolumes

DiskReservation
  Disks (required)
  FailFast
  ConfigPercentage
  ProbeInterval

ElifNone
  PathName (required)

FileNone
  PathName (required)

FileOnOff
  PathName (required) 

FileOnOnly
  PathName (required)

IP
  Address (required)
  Device (required)
  ArpDelay
  IfconfigTwice
  NetMask
  Options

IPMultiNIC
  Address (required)
  MultiNICResName (required)
  ArpDelay
  IfconfigTwice
  NetMask
  Options

Mount
  BlockDevice (required)
  MountPoint (required)
  FSType (required)
  FsckOpt
  MountOpt
  SnapUmount

MultiNICA
  Device (required)
  ArpDelay
  HandshakeInterval
  IfconfigTwice
  NetMask
  NetworkHosts
  Options
  PingOptimize
  RouteOptions

NFS
  Nservers

NIC
  Device (required)
  PingOptimize
  NetworkHosts
  NetworkType

Phantom

Process
  PathName (required)
  Arguments

Proxy
  TargetResName (required)
  TargetSysName

ServiceGroupHB
  Disks (required)
  AllOrNone

Share
  PathName (required)
  Options

Volume
  DiskGroup (required)
  Volume

--------------------------------------------

ENTERPRISE STORAGE AGENTS ATTRIBUTES

Here are Agents not included with VCS that
you have to purchase, and their resource
attributes. These attributes are also listed
in types.cf.

Apache
  ServerRoot
  PidFile
  IPAddr
  Port
  TestFile

Informix
  Server (required)
  Home (required)
  ConfigFile (required)
  Version (required)
  MonScript

NetApp (storage agent)

NetBackup

Oracle
  Oracle Agent
    Sid (required)
    Owner (required)
    Home (required)
    Pfile (required)
    User
    PWord
    Table
    MonScript
  Sqlnet Agent
    Owner (required)
    Home (required)
    TnsAdmin (required)
    Listener (required)
    MonScript

PCNetlink

SuiteSpot

Sun Internet Mail Server (SIMS)

Sybase
  SQL Server Agent
    Server (required)
    Owner (required)
    Home (required)
    Version (required)
    SA (required)
    SApswd (required)
    User
    UPword
    Db
    Table
    MonScript
  Backup Server Agent
    Server (required)
    Owner (required)
    Home (required)
    Version (required)
    Backupserver (required)
    SA (required)
    SApswd (required)

-----------------------------------------------------------------

VCS SYSTEM CLUSTER SERVICE GROUP RESOURCE TYPE SNMP ATTRIBUTES

Everything in VCS has attributes. Here is a list
of attributes from VCS 1.3.

Cluster Attributes (haclus -display):

 ClusterName
 CompareRSM (for internal use only)
 CounterInterval
 DumpingMembership
 EngineClass
 EnginePriority
 Factor (for internal use only)
 GlobalCounter
 GroupLimit
 LinkMonitoring
 LoadSampling (for internal use only)
 LogSize
 MajorVersion
 MaxFactor (for internal use only)
 MinorVersion
 PrintMsg (for internal use only)
 ProcessClass
 ProcessPriority
 ReadOnly
 ResourceLimit
 SourceFile
 TypeLimit
 UserNames
 
Systems Attributes (hasys -display):

 AgentsStopped (for internal use only)
 ConfigBlockCount
 ConfigCheckSum
 ConfigDiskState
 ConfigFile
 ConfigInfoCnt (for internal use only)
 ConfigModDate
 DiskHbDown
 Frozen
 GUIIPAddr
 LinkHbDown
 LLTNodeId (for internal use only)
 Load
 LoadRaw
 MajorVersion
 MinorVersion
 NodeId
 OnGrpCnt
 ShutdownTimeout
 SourceFile
 SysInfo
 SysName
 SysState
 TFrozen
 TRSE (for internal use only)
 UpDownState
 UserInt
 UserStr
 
Service Groups Attributes (hagrp -display):

 ActiveCount
 AutoDisabled
 AutoFailOver
 AutoRestart
 AutoStart
 AutoStartList
 CurrentCount
 Enabled
 Evacuating (for internal use only)
 ExtMonApp
 ExtMonArgs
 Failover (for internal use only)
 FailOverPolicy
 FromQ (for internal use only)
 Frozen
 IntentOnline
 LastSuccess (for internal use only)
 ManualOps
 MigrateQ (for internal use only)
 NumRetries (for internal use only)
 OnlineRetryInterval
 OnlineRetryLimit
 Parallel
 PathCount
 PreOffline (for internal use only)
 PreOnline
 PreOfflining (for internal use only)
 PreOnlining (for internal use only)
 Priority
 PrintTree
 ProbesPending
 Responding (for internal use only)
 Restart (for internal use only)
 SourceFile
 State
 SystemList
 SystemZones
 TargetCount (for internal use only)
 TFrozen
 ToQ (for internal use only)
 TriggerEvent (for internal use only)
 TypeDependencies
 UserIntGlobal
 UserStrGlobal
 UserIntLocal
 UserStrLocal
 
Resource Types Attributes (hatype -display):

 These are common to all Resource Types.
 You can change values for each Resource Type.

 AgentClass
 AgentFailedOn
 AgentPriority
 AgentReplyTimeout
 AgentStartTimeout
 ArgList
 AttrChangedTimeout
 CleanTimeout
 CloseTimeout
 ConfInterval
 FaultOnMonitorTimeouts
 LogLevel
 MonitorInterval
 MonitorTimeout
 NameRule
 NumThreads
 OfflineMonitorInterval
 OfflineTimeout
 OnlineRetryLimit
 OnlineTimeout
 OnlineWaitLimit
 OpenTimeout
 Operations
 RestartLimit
 ScriptClass
 ScriptPriority
 SourceFile
 ToleranceLimit
 
Resources Attributes (hares -display):
 
 NOTE: These are only some of the ones common
       to many kinds of resources. See the types.cf
       file for type-specific attributes for resources.

 ArgListValues
 AutoStart
 ConfidenceLevel
 Critical
 Enabled
 Flags
 Group
 IState
 LastOnline
 MonitorOnly
 Name (for internal use only)
 Path (for internal use only)
 ResourceOwner
 Signaled (for internal use only)
 Start (for internal use only)
 State
 TriggerEvent (for internal use only)
 Type

SNMP (predefined):

 Enabled
 IPAddr
 Port
 SnmpName
 SourceFile
 TrapList

-----------------------------------------------

VCS NODENAME

If you change your server's hostname often, it might be a good idea to
let VCS use its own names for your nodes.

To use VCS's own nodenames instead of hostnames for the nodes in a Cluster,
you need the /etc/VRTSvcs/conf/sysname file defined in /etc/llttab. Use
these VCS nodenames in main.cf.

EXAMPLE:

/etc/llthosts:
0  sysA
1  sysB

/etc/VRTSvcs/conf/sysname:
sysA

/etc/VRTSvcs/conf/sysname:
sysB

/etc/llttab:
set-node  /etc/VRTSvcs/conf/sysname

-------------------------------------------------

RESOURCE ATTRIBUTE DATA TYPES DIMENSIONS

Resource attributes are either type-independent or type-specific.
Type-specific attributes appear in types.cf.

Resources have 3 system created attributes that determine failover
behavior. They are user modifiable.

  Critical = 1    If resource or its children fault, service group
                  faults and fails over.

  AutoStart = 1   Command to bring service group online will also
                  bring resource online.

  Enabled   = 0   If 0, resource is not monitored by agent. This is
                  the default value if resource is added in the CLI.
                  Enabled = 1 if resource is defined in main.cf

Static attributes are same for all resources within a resource type.

All attributes have "definition" (Data Type) and a "value" (Dimension).
Global attribute value means its the same throughout the cluster.
Local attribute value means it applies only to a specific node.

Attribute data types are:

1. str (string)   DEFAULT Data Type. 

                  Can be in double quotes, backslash is \\, a double quote
                  is \". No double quotes needed if string begins with letter
                  and only has letters, numbers, dashes and underscores.

   EXAMPLE:       Adding a string-scalar value to BlockDevice attribute
                  for a Mount resource called "export1".

                  hares -modify export1 BlockDevice "/dev/sharedg/voloracle"

2. int (integer)  Base 10. 0-9.

3. boolean        0 (false) or 1 (true).

   EXAMPLE:       hagrp -modify Group1 Parallel 1

                  This is a "boolean integer".


Attribute Dimensions are:

1. scalar         DEFAULT dimension.

                  Only 1 value.


2. vector         Ordered list of values, denoted by [] in types.cf file.
                  Values are indexed by positive integers, starting at 0.

   EXAMPLES of string-vectors:

            Dependencies[] = { Mount, Disk, DiskGroup }
              Dependencies is the attribute.

            NetworkHosts = { "166.93.2.1", "166.99.1.2" }

            NetworkHosts @cp01 = { "151.144.128.1", "151.144.128.102",
                                   "151.144.128.104" }
            NetworkHosts @cp02 = { "151.144.128.1", "151.144.128.101",
                                   "151.144.128.104" }
   Command line would be:
   hares -local mul_nic NetworkHosts
   hares -modify mul_nic NetworkHosts 10.1.3.10 10.1.3.11 10.1.3.12 -sys cp01
   hares -modify mul_nic NetworkHosts 10.1.3.10 10.1.3.13 10.1.3.12 -sys cp02


3. keylist        Unordered list of unique strings.

   EXAMPLE: AutoStartList = { sysa, sysb, sysc }
              AutoStartList is the attribute.  This is a "string keylist".

   hagrp -modify Group1 AutoStartList Server1 Server2

4. association    Unordered list of name-value pairs, denoted by {} in
                  types.cf file.
              
   EXAMPLE: SystemList{} = { sysa=1, sysb=2, sysc=3 }
              SystemList is the attribute.
              This is a "string association".

   hagrp -modify <Object> <Attribute> -add <Key> <Value> <Key> <Value> ...
   hagrp -modify Group1 SystemList -add Server1 0 Server2 1

   To update the SystemList, do this:

   hagrp -modify Group1 SystemList -update Server2 0 Server1 1

EXAMPLE: Adding MultiNICA and IPMultiNIC resources.

Suppose we wanted these IPMultiNIC and MultiNICA resources in main.cf:

IPMultiNIC ip_mul_nic (
Address = "10.10.10.4"
NetMask = "255.255.255.192"
MultiNICResName = mul_nic
            )

NOTES:  Address is a string-scalar.
        NetMask is a string-scalar.
        MultiNICResName is a string-scalar.

MultiNICA  mul_nic (
      Device @sysA = { hme0 = "10.10.10.1", qfe0 = "10.10.10.1" }
      Device @sysB = { hme0 = "10.10.10.2", qfe0 = "10.10.10.2" }
      NetMask = "255.255.255.192"
      ArpDelay = 5
      Options = trailers
      RouteOptions = "default 10.10.10.5 0"
      IfconfigTwice = 1
      )

NOTES:  Device is a string-association.
        Netmask is a string-scalar.
        ArpDelay is a integer-scalar.
        Options is a string-scalar.
        Route-Options is a string-scalar. 
        IfconfigTwice is an integer-scalar.

1. Open the Cluster.

     haconf -makerw

2. Add the MultiNICA resource "mul_nic".

     hares -add mul_nic MultiNICA groupx

3. Make the mul_nic resource local to the systems.

     hares -local mul_nic Device

4. Add the attribute and values and enable resource.

     hares -modify mul_nic Device -add hme0 10.10.10.1  -sys sysA
     hares -modify mul_nic Device -add qfe0 10.10.10.1 -sys sysA
     hares -modify mul_nic Device -add hme0 10.10.10.2 -sys sysB
     hares -modify mul_nic Device -add qfe0 10.10.10.2 -sys sysB
     hares -modify mul_nic NetMask 255.255.255.192
     hares -modify mul_nic ArpDelay 5
     hares -modify mul_nic Options trailers
     hares -modify mul_nic RouteOptions "default 10.10.10.5 0"
     hares -modify mul_nic IfconfigTwice 1
     hares -modify mul_nic Enabled 1  

5. Add the IPMultiNIC resource "ip_mul_nic", add attributes and values,
   and enable it.

     hares -add ip_mul_nic IPMultiNIC groupx
     hares -modify ip_mul_nic Address 10.10.10.4
     hares -modify ip_mul_nic Netmask 255.255.255.192
     hares -modify ip_mul_nic MultiNICResName MultiNICA1
     hares -modify ip_mul_nic Enabled 1

6. Make the IPMultiNIC resource dependent on the MultiNICA resource.

     hares -link ip_mul_nic mul_nic

7. Close the Cluster.

     haconf -dump -makero

-----------------------------------------------------------------

RESOURCE MONITORING SERVICE GROUP FAILOVER TIMES
                                     
Here are important attributes to be aware of in regards to
resource monitoring and failovers. Default values are shown.


*** Resource Type Attributes ***

"Agent level" timimg:

AgentReplyTimeout = 130  Time the engine waits to receive a heartbeat
                         from a Agent (Resource Type) before restarting
                         the Agent.

AgentStartTimeout = 60   Time engine waits after starting an Agent, for
                         an initial Agent handshake before restarting an
                         Agent.

AttrChangedTimeout = 60  Maximum time the "attr_changed" entry point must
                         complete or be terminated.

CloseTimeout = 60    Maximum time within with the Close entry point
                     must complete or be terminated.

CleanTimeout = 60    Maximum time within with the Clean entry point
                     must complete or be terminated.                                     
ConfInterval = 600   The amount of time a resource must stay online
                     before previous faults and restart attempts are
                     ignored by the Agent.

                     Basically how long a resource must stay up before
                     VCS will "forget" that it had faulted recently; used
                     by RestartLimit. If pass the ConfInterval, VCS "resets"
                     the count from RestartLimit. If resource faults under
                     the ConfInterval, VCS looks at RestartLimit.


FaultOnMonitorTimeouts = 4   Number of consecutive times a monitor must time
                             out before the the Resource is faulted.
                             Set this to 0 if you don't want monitor
                             failures to indicate a resource fault.


MonitorInterval = 60  Time between monitoring attempts for an online or
                      transitioning resource.

                      This is the answer to the common question "how often
                      does a resource get monitored?"


MonitorTimeout = 60   Time Agent takes to time out a monitor entry
                      point if it hangs. This is how long the Agent
                      allows for monitoring a resource.


OfflineMonitorInterval = 300   This is how often an offline resource gets
                               monitored. Same purpose as MonitorInterval.

OfflineTimeout = 300   Maximum time an Agent allows for offlining.

OnlineTimeout = 300   Maximum time an Agent allows for onlining.

    EXAMPLE:  If fsck's take too long, make this bigger
              for the Mount Resource Type, otherwise VCS will have
              trouble mounting filesystems after a crash.


OnlineRetryLimit = 0   Number of times to retry an online if an attempt
 (for Resource Types)  to online has failed. Used only when a clean
                       entry point exists.

OnlineWaitLimit = 2    Number of monitor intervals after an online procedure
                       has been completed (online script/entry point exits)
                       before monitor sees a failure and reports it as a
                       failure.

                       Basically, how much time VCS will give the service to
                       come completely online (even after online exits) well
                       enough for it to be monitored.

                       Increase this if an application takes a long time to be
                       completely ready for monitoring even after the online
                       script/entry point completes.


OpenTimeout = 60       Maximum time an Agent allows for opening.

RestartLimit = 0     Number of times the Agent tries to restart a
                     resource that has faulted or is having problems
                     coming online.

                     Basically number of times VCS will try to
                     restart a resource without faulting the
                     Service Group.


ToleranceLimit = 0   Number of offlines that an Agent monitor must
                     declare before a resource is faulted. By default
                     a monitor declaring offline will also a declare a
                     resource to be faulted.


*** Resource Attributes ***

MonitorOnly   If 1, the resource can't be brought online or offline.
              VCS sets this to 1 if the Service Group is frozen. There
              is no way to set this directly.

Critical      If 0, don't fail the Service Group if this resource fails.
              Default is 1.


*** Service Group Attributes ***

OnlineRetryInterval = 0  How long in seconds a Service Group should be
                         failed over to another system if it has already
                         faulted and restarted on the same system.

                         This prevents the Service Group from continuously
                         faulting and restarting on the same system; used
                         with OnlineRetryLimit for service groups.

OnlineRetryLimit = 0  Number of times VCS will try to restart a faulted
 (for Service Groups) Service Group on the same system before failing it over
                      to another system.


------------------------------------------------------------

HAUSER VCS 2.0

In VCS 2.0, you can add users, but they must be able
to write a temp file to /var/VRTSvcs/lock.

Also, if you want to disable password authentication by VCS, just
do:

  haclus -modify AllowNativeCliUsers 1

-------------------------------------------------------------

HAD VCS VERSION

To find the version of had you are running,

  had -version

-------------------------------------------------------------

VCS 2.0 NEW FEATURES

VCS 2.0 introduces some new features:

1. Web console
2. User privileges
3. SMTP and SNMP notification
4. Workload balancing at the system level.
5. New Java console
6. ClusterService service group
7. Single-node support
8. New licensing scheme
9. Internationalizing
10. Preonline IP check
11. hamsg command to display logs
12. LinkHbStatus and DiskHbStatus system attributes; LinkMonitoring
    cluster attribute is no longer used.
13. Password encryption
14. SRM can be scheduling class for VCS processes.
15. VCS 1.3 enterprise agents can run on VCS 2.0.
16. New resstatechange event trigger.
17. Service group can come up with a disabled resource.
18. Sun eri FastEthernet driver supported.
19. No restriction on Sybase resources per service group.
20. New NotifierMngr bundled agent.
21. New attributes:

Cluster Attributes:

  Administrators
  AllowNativeCliUsers
  ClusterLocation
  ClusterOwner
  ClusterType
  HacliUserLevel
  LockMemory
  Notifier
  Operators
  VCSMode

Service Group Attributes:

  Administrators
  AutoStartIfPartial
  AutoStartPolicy
  Evacuate
  Evacuating
  GroupOwner
  Operators
  Prerequisites
  TriggerResStateChange

System Attributes:

  AvailableCapacity
  Capacity
  DiskHbStatus
  LinkHbStatus
  Limits
  Location
  LoadTimeCounter
  LoadTimeThreshold
  LoadWarningLevel
  SystemOwner

Resource Type Attributes:

  LogFileSize
  LogTags

--------------------------------------------

WEB GUI URLS

For VCS 2.0 and VCS 2.0 QuickStart, the URL's to
the web GUI interfaces are:

VCS 2.0     http://<cluster IP>:8181/vcs

VCSQS 2.0   http://<cluster IP>:8181/vcsqs

---------------------------------------------

VCS QUICKSTART COMMAND LINE

The VCS QuickStart has a limited command line set.

These are all the commands:

        vcsqs -start
        vcsqs -stop [-shutdown] [-all] [-evacuate]
        vcsqs -grp [<group>]
        vcsqs -res [<resource>]
        vcsqs -config [<resource>]
        vcsqs -sys
        vcsqs -online <group> -sys <system>
        vcsqs -offline <group> [-sys <system>]
        vcsqs -switch <group> [-sys <system>]
        vcsqs -freeze <group>
        vcsqs -unfreeze <group>
        vcsqs -clear <group> [-sys <system>]
        vcsqs -flush <group> [-sys <system>]
        vcsqs -autoenable <group> -sys <system>
        vcsqs -users
        vcsqs -addadmin <username>
        vcsqs -addguest <username>
        vcsqs -deleteuser <username>
        vcsqs -updateuser <username>
        vcsqs -intervals [<type>]
        vcsqs -modifyinterval <type> <value>
        vcsqs -version
        vcsqs -help

-----------------------------------------------------

VCS 2.0 WORKLOAD MANAGEMENT

VCS 2.0 has new "service group workload management" features.

First, we must distinguish between "AutoStartPolicy" and "FailOverPolicy"
for Service Groups coming online or failing over.

The new AutoStartPolicy is either Order (default),
Priority or Load. Order is order in the AutoStartList, Priority is
priority in the SystemList.

The FailOverPolicy is either Priority (default), RoundRobin or
Load. Priority is priority in the SystemList. Roundrobin will
mean VCS will pick the system in SystemList with the least number
of "online" Service Groups.

Before a Service Group comes online it must consider the systems
in

  (a) the AutoStartList if it is autostarting
  (b) the SystemList if this is a failover

and then

  (1) System Zones (soft restriction)
  (2) System Limits and Service Group Prerequisites (hard restriction)
  (3) System Capacity and Service Group Load (soft restriction)

Steps 1-3 are done serially during VCS startup, the System 
being chosen in lexical or canonical order. After this sequence,
the actual onlining process begins (VCS will then check on Service
Group Dependencies). Onlining of service groups is done in parallel.

1. System Zones (SystemZones), a Service Group attribute.
   The zones are numbers created by the user.

EXAMPLE (2 zones called 0 and 1):

   SystemZones = { LgSvr1=0, LgSvr2=0, MedSvr1=1, MedSvr2=1 }

   VCS chooses the zone base on the current server available in
   the AutoStartList, SystemList or on which zone Service Group
   is currently on.
   If there are no more machines in the SystemZones available, VCS
   will go look at System Limits and Service Group Prerequisites
   of machines in other System Zones. In subsequent failures, the
   Service Group will even stick in the new System Zone.

2. System Limits (Limits), a System attribute,
   and Service Group Prerequisites (Prerequisites), a Service Group
   attribute, for the Service Group.
   These names and values are created by the user.

EXAMPLE:

   System attribute:

   Limits = { ShrMemSeg=20, Semaphores=10, Processors=12, GroupWeight=1 }

   Service Group attribute:

   Prerequisites = { ShrMemSeg=10, Semaphores=5, Processors=6, GroupWeight=1 }

   Basically, VCS rations the Limits among the Service Groups already
   online, so there must be enough of these "Limits" left for another
   Service Group trying to come online. If there are not enough "Limits"
   left on the system, the Service Group can't come online on it. Limits
   and Prerequisites are hard restrictions.

   If VCS can't decide base on Limits and Prerequisites, VCS will go on
   and look at Loads and Capacity, if FailOverPolicy or AutoStartPolicy
   is Load.
   
3. Available Capacity (AvailableCapacity), total System Capacity
   (Capacity), both System attributes, and Service Group Load
   (Load), a Service Group attribute.

When either AutoStartPolicy or FailOverPolicy is "Load", we
must talk about Available Capacity, Capacity and current load.

Before VCS 2.0, "Load" was a System Attribute calculated by
something outside of VCS that can do "hasys -load XX". Now, Load
is a Service Group Attribute. A "system load" can still be set by
"hasys -load" in VCS 2.0, but now the attribute is "DynamicLoad"
(Dynamic System Load).

In any case, when VCS must look at capacity and load to bring up
a service group, it does this:

AvailableCapacity = Capacity - Current System Load

In other words,

AvailableCapacity = Capacity - (sum of all Load values of all online
                                Service Groups)

or

AvailableCapacity = Capacity - DynamicLoad

"Capacity" and "Load" are values defined by the user.
"AvailableCapacity" is calculated by VCS.

Here are 2 System attributes that are used by the new "loadwarning"
trigger (defaults are shown):

LoadWarningLevel = 80      Percentage of Capacity that VCS considers
                           critical.

LoadTimeThreshold = 600    Time in seconds system must remain at or
                           above LoadWarningLevel before loadwarning
                           trigger is fired.

--------------------------------------------------------

MULTINICA MONITOR VCS 1.3 2.0

Here is what the MultiNICA monitor script does in VCS 1.3 (and the changes
in VCS 2.0)

NOTE: $NumNICs = 2 x (largest number of NIC's for MultiNIC on any node)
      Therefore, if one of the nodes has 2 NIC's for MultiNIC, $NumNICs = 4.
      $NumNICs = 4 in the vast majority of cases since most nodes will dedicate 
      2 NIC's for MultiNIC.

$NumNICs is used to make an ordered list of NIC devices to test.

In the following example, hme0 and qfe0 are the MultiNIC devices, the
NetworkHost is 192.168.1.1 and the Broadcast IP is 192.168.1.255.


1. The script finds the values to all the attributes from arguments passed to
   it ( see hares -display <MultiNICA resource> to see the ArgListValues ).

2. The script finds the active device NIC using GetActiveDevice, which uses
   "ifconfig -a" and the Base IP.


3. The script checks if the NIC is really active.

(1) If NetworkHosts is not used, run PacketCountTest.

(a) If PingOptimize=1, run netstat -in using GetCurrentStats to compare
    current network traffic with previous network traffic.

(b) If PingOptimize=0, run netstat -in to get current traffic, then ping the
    Broadcast IP 5 times, then run netstat -in again, then compare traffic.


(2) If NetworkHosts is used, ping a NetworkHost. Return success if any of
    the hosts is pingable.

    NOTE: In VCS 2.0, if ping fails in this part, set $PingOptimize = 0.

************************************************

4. If failure is detected, try the following test up to 5 times (RetryLimit=5).
   Do the following up to 5 times:

(1) If NetworkHosts is not used, ping the Broadcast IP 5 times, then run
    PacketCountTest.

    NOTE: In VCS 2.0, NetworkHosts is NOT tested in this part, however,
          the "ping Broadcast IP 5 times" still applies, and the rest of
          this section (PacketCountTest) still applies.

(a) If PingOptimize=1, run netstat -in using GetCurrentStats to compare
    current network traffic with previous network traffic.

(b) If PingOptimize=0, run netstat -in to get current traffic, then ping
    the Broadcast IP 5 times, then run netstat -in again, then compare
    traffic.


(2) If NetworkHosts is used, ping a NetworkHost. Return success if any of
    the hosts is pingable.

    NOTE: In VCS 2.0, this part is omitted.

    See TechNote 243100 for more information on changes in VCS 2.0 for
    this section.


************************************************

5. If last test in Step 4 was successful, then exit 110 (resource is online).
   If failure is still detected, and MonitorOnly=1, then exit 100
   (resource is offline), but if MonitorOnly=0, then begin failover and do
   the following:

(1) Echo "Device hme0 FAILED" and "Acquired a WRITE Lock".

    Find the Logical IPMultiNIC IP addresses and store them in a
    table (StoreIPAddresses).

(2) Echo "Bringing down IP addresses".

    Run BringDownIPAddresses. This brings down the Logical IP's and if this
    is Solaris 8, unplumbs the Logical NIC's. The Base IP is then brought
    down and the NIC unplumbed.

(3) Plumb the next NIC. Bring up the new NIC with the Base IP.

    Echo "Trying to online Device qfe0".

(a) If IfconfigTwice=1, bring down the NIC and bring it up with the
    Base IP again.

(b) If $ArpDelay > 0, SLEEP for $ArpDelay seconds, then run ifconfig on the
    NIC to bring up the Broadcast IP.  This help updates the ARP tables.

(c) Add routes from RouteOptions if any.

    NOTE: In VCS 2.0, set $use_broadcast_ping = 0 before proceeding to
          Step "5. (4)".


(4) Do the following test loop *up to* $HandshakeInterval/10 times, but
    exit the test once there is a success:

(a) Echo "Sleeping 5 seconds". SLEEP 5 seconds.

(b) Update the ARP tables on the subnet by doing ifconfig on the NIC to
    bring up the Broadcast IP.

(c) If NetworkHosts is not used, echo "Pinging Broadcast address
    192.168.1.255 on Device qfe0, iteration XX".

    Run PacketCountTest.

    NOTE: In VCS 2.0, the test is "If NetworkHosts is not used *or*
          $use_broadcast_ping = 1, run PacketCountTest."

(i) If PingOptimize=1, run netstat -in using GetCurrentStats to compare
    current network traffic with previous network traffic.

(ii) If PingOptimize=0, run netstat -in to get current traffic, then ping
     the Broadcast IP 5 times, then run netstat -in again, then compare
     traffic.

(d) If NetworkHosts is used, echo "Pinging 192.168.1.1 with Device qfe0
    configured: iteration 1", and ping a NetworkHost. Return success if
    any of the hosts is pingable.

(e) If ping tests fail, continue with the Step "5. (4)" test loop.

    NOTE: In VCS 2.0, set $use_broadcast_ping = 1.


(5) If there is no success after all the tests, echo "Tried the PingTest
    10 times" and "The network did not respond", and set PingReallyFailed=1.
    Echo "Giving up on NIC qfe0".

    Bring down and unplumb the NIC. If there is a 3rd NIC, repeat above
    steps from "5. (3)", otherwise echo "No more Devices configured. All
    devices are down.  Returning offline" and exit 100 (resource offline).

    NOTE: In VCS 2.0, there is an added check to see if the next NIC in the
          ordered list of NIC's is actually the current active NIC that has
          failed. In VCS 1.3, the monitor script would return to the failover
          loop in STEP 5.(3) but immediately exit with a false ONLINE. See
          Incident 68872.

**********************************************

6. If failover is successful, migrate the Logical IP's by running
   MigrateIPAddresses and checking the Logical IP addresses table.
   Do the following for each Logical IP:

(a) Plumb logical interface (if Solaris 8) and bring up the Logical IP.

(b) If IfconfigTwice=1, bring down the Logical IP and bring it up again.

(c) If $ArpDelay > 0, SLEEP for $ArpDelay seconds, then run ifconfig on
    the NIC to bring up the Broadcast IP. This updates the ARP tables.
(d) Echo "Migrated to Device qfe0" and "Releasing Lock".

---------------------------------------------------------------------

MULTINICA PINGOPTIMIZE NETWORKHOSTS

Basically, MultiNICA tests for NIC failure (or network failure)
in 1 of 3 ways at a time.

1. NetworkHosts NOT being used, PingOptimize = 1. These are defaults.

In this case, MultiNICA simply depends on incoming packets to tell whether
a NIC is okay or not. If the packet count increases, it thinks the NIC is
up, if the packet count does not increase, it thinks the NIC is down. It
will check a few times before declaring the NIC dead and begin failover
to the next NIC.

This is the simplest MultiNICA monitoring. On very quiet networks, it might
generate a false reading. It might do a NIC to NIC failover if it sees no
packets coming in after several retries.


2. NetworkHosts is being used.

In this case, the MultiNICA monitor script ignores what you have for
PingOptimize. Whether you have PingOptimize = 0 or 1 will have no effect on
the monitoring. MultiNICA will ping the IP address(es) listed in NetworkHosts
attribute to determine if a NIC is up or down. It doesn't depend on knowing
the Broadcast address to test a failed NIC (plumb up and bring up the Base
IP). This is a pretty popular option as it allows you to test connectivity
to specific hosts (like a router).


3. NetworkHosts NOT being used, PingOptimize = 0.

In this case, the MultiNICA will try to find out the broadcast address of
the NIC and ping it to generate network traffic. It then runs a packet count
test like in #1 above. Because network traffic is generated by the monitor
script, this test is sometimes more reliable. This option is not too popular
in places where the network admins don't want a lot of broadcast traffic
being generated.

In the VCS 2.0 and 1.3 monitor scripts, if #3 is what you have, the MultiNICA
will not plumb up and online your Base IP if the interface is already
unplumbed (e.g., by VCS).

---------------------------------------------------------------------

DISKGROUP NOAUTOIMPORT IMPORT DEPORT

The DiskGroup Agent online script imports Disk Groups with a -t option.
This sets the "noautoimport" flag to "on", so that on reboots, the
Disk Group will not be automatically imported by VxVM outside of VCS.

Because of this, the Agent will first detect if a Disk Group it
needs to import has already been imported, and if it was imported
without the -t option. If the Disk Group was imported without the
-t option, the online will intentionally fail. This is a safety feature
to prevent split brains.

If a Disk Group was imported without the -t option, the "noautoimport"
flag will be set to "off". The Disk Group should be deported before
allowing VCS to import it.

To see the flag, do this:

  vxprint -at <Disk Group>

EXAMPLE:

  #vxprint -at oradg
  Disk group: oradg

  dg   oradg tutil0=" tutil1=" tutil2=" import_id=0.2635 real_name=oradg
  comment="putil0=" putil1=" putil2=" dgid=965068944.1537.anandraj
  rid=0.1025 update_tid=0.1030 disabled=off noautoimport=on nconfig=default
  nlog=default base_minor=70000 version=60 activation=read-write
  diskdetpolicy=invalid movedgname= movedgid= mo
  ve_tid=0.0

  The "noautoimport" flag is "on", and that is what should be seen
  in Disk Groups controlled by VCS.

------------------------------------------------------------------

GAB LLT PORTS

a       GAB internal use
b       I/O Fencing
d       ODM (Oracle Disk Manager)
f       CFS (VxFS cluster feature)
h       VCS engine (HAD)
j       vxclk monitor port
k       vxclk synch port
l       vxtd (SCRP) port
m       vxtd replication port
o       vcsmm (Oracle RAC/OPS membership module)
q       qlog (VxFS QuickLog)
s       Storage Appliance
t       Storage Appliance
u       CVM (Volume Manager cluster feature)
v       CVM
w       CVM
x       GAB test user client
z       GAB test kernel client

--------------------------------------------------

SCRIPTPRIORITY NICE VALUES PROCESS

Here is a table showing equivalent nice values between 
ScriptPriority, ps and top.

ScriptPriority   "ps -elf NIce value"    "top nice value"

60                0                      -20
20                14                     -6
0                 20                      0
-20               26                      6
-60               39                      20

--------------------------------------------------

VERITAS TCP PORTS

Here are some TCP ports used by VCS and related products:

8181  VCS and GCM web server
14141 VCS engine
14142 VCS engine test
14143 gabsim
14144 notifier 
14145 GCM port 
14147 GCM slave port 
14149 tdd Traffic Director port 
14150 cmd server 
14151 GCM DNS 
14152 GCM messenger 
14153 VCS Simulator 
15151 VCS GAB TCP port

---------------------------------------------------

SNMP MONITORING NOTIFIERMNGR AGENT


VCS 2.0 introduced a new Agent called NotifierMngr. This agent
can send SNMP traps to your site's SNMP console.

Here is an example of a main.cf on ClusterA sending SNMP traps
to server "draconis".

       NotifierMngr ntfr (
                PathName = "/opt/VRTSvcs/bin/notifier"
                SnmpConsoles = { draconis = Information }
                SmtpServer = localhost
                SmtpRecipients = { "root@localhost" = SevereError }
                )

On ClusterA, I can run commands like:

  hagrp -online testGroup -sys tonic
  hagrp -switch testGroup -to gin

On draconis, I can run the Solaris snoop command and see the SNMP
traffic coming from ClusterA:

draconis # snoop -P -d hme0 port 162
Using device /dev/hme (non promiscuous)
         gin -> draconis     UDP D=162 S=39012 LEN=408
         gin -> draconis     UDP D=162 S=39012 LEN=439
         gin -> draconis     UDP D=162 S=39012 LEN=409
         gin -> draconis     UDP D=162 S=39012 LEN=406


You can also download SNMP software from the internet like UCD-snmp
for Solaris 8, and watch the trap information:

draconis # ./snmptrapd -P -e -d -n
2002-08-27 10:07:59 UCD-snmp version 4.2.3 Started.

Received 431 bytes from 10.140.16.16:39012
0000: 30 82 01 AB  02 01 01 04  06 70 75 62  6C 69 63 A7    0........public.
0016: 82 01 9C 02  01 01 02 01  00 02 01 00  30 82 01 8F    ............0...
0032: 30 82 00 0F  06 08 2B 06  01 02 01 01  03 00 43 03    0.....+.......C.
0048: 1F 47 D0 30  82 00 1B 06  0A 2B 06 01  06 03 01 01    .G.0.....+......
0064: 04 01 00 06  0D 2B 06 01  04 01 8A 16  03 08 0A 02    .....+..........
0080: 02 07 30 82  00 11 06 0C  2B 06 01 04  01 8A 16 03    ..0.....+.......

You can download such a tool from places like here:

http://sourceforge.net/project/showfiles.php?group_id=12694

If your SNMP console is *inside* the cluster, the SNMP traps may be sent
through the loopback 127.0.0.1 device. Solaris snoop cannot listen on that
device, so you won't see any SNMP traffic using snoop.


----------------------------------------------------------

LLT PACKETS SNOOP NETWORK

You can use a packet sniffer, like Solaris's snoop, to see if VCS
is flooding your network with LLT packets.

For example, you can do this on a machine outside of the cluster:

# snoop -Pr -t d -d eri0 -x 0,36 ethertype 0xcafe
Using device /dev/eri (non promiscuous)
  0.00000            ? -> (broadcast)  ETHER Type=CAFE (Unknown), size = 70 byte
s

           0: ffff ffff ffff 0800 20cf 7e73 cafe 0101    ........ .~s.þ..
          16: f40a 0000 0001 ffff 8c00 0000 0038 0000    .............8..
          32: 0000 8000                                  ....

  0.01630            ? -> (broadcast)  ETHER Type=CAFE (Unknown), size = 70 byte
s

           0: ffff ffff ffff 0800 20ab 4f1d cafe 0101    ........ .O..þ..
          16: 0b0a 0000 0001 ffff 8c00 0000 0038 0000    .............8..
          32: 0000 8000                                  ....

The 12 characters to the left of the string "cafe" is the source MAC address
for that packet. So in the example above, you may want to look for
machines with interfaces with MAC addresses of 08:00:20:cf:7e:73 and
08:00:20:ab:4f:1d.

You can find other pieces of information from Line 16 of the snoop output.
For example, in the second packet above, we find the following:

1. Source cluster-ID = 0b
2. The "0a" means this is a heartbeat packet
3. Source node-ID = 0001

By default, LLT packets are 70 bytes in size.

-----------------------------------------------------------------