I've recently setup a bunch of machines for IP multipathing (i.e. recent to this article - Nov 28, 2001) by following the Sun blueprint paper. I thought I would share a simpler step by step approach.
298.178.99.137 host-int0 298.178.99.138 host-int1 298.178.99.139 host-ext0 host-dummy 298.178.99.140 host-ext1 host.eng.auburn.edu
In this example, the first two ips are fixed (internal) to the NICs, and the second 2 are floating. The last one is the one we use to tie to the machine name for programs that might have licensing restrictions tied to particular hostnames. (Always make the hostname tied to one of the public/external/failover NICs)
At the beginning you'll have one network interface (the secondary) that is unconfigured, and another that would initially look something like this:
hme0: flags=1000843mtu 1500 index 2 inet 298.178.99.141 netmask fffffff0 broadcast 298.178.99.143 ether 8:0:20:ff:5b:e2
You need to configure the secondary interface and make it have a unique ether address that is persistent across reboots. I like to take the address of the hme0 (or eri0 or whatever) card and add 1 to the last octet.
# eeprom 'local-mac-address?=true'
# /sbin/ifconfig hme1 plumb
# /sbin/ifconfig hme1 ether 8:0:20:ff:5b:e3
You can pretty much copy these two files as is and just modify them slightly to fit your naming conventions in the same way that you setup the /etc/hosts file above.
/etc/hostname.hme0
host-int0 netmask + broadcast + group production deprecated -failover up \ addif host-ext0 netmask + broadcast + failover up
/etc/hostname.hme1
host-int1 netmask + broadcast + group production deprecated -failover up \ addif host-ext1 netmask + broadcast + failover up
/etc/default/mpathd has a default failover timeout of 10000. This means that it should take 10 at most seconds to detect and successfully fail over an interface. I like to configure this to 2500. In my working with IP multipathing, numbers below that seem to result in excessive messages about that number being too low and lots of messages in syslog. If you change this file, you will have to restart mpathd. Now is as good a time as any to either restart mpathd or start it for the first time if it is not already running.
When do you want to use which? It boils down to the same choices on a non multipathed host. Do you have one router on your lan or do you have multiple? If you have only one (or a pair using HSRP or other failover protocol), then you can use a default route. If you have more than one router, then you want to use in.rdisc much as you would use routed in a non multipathed host setup. Make sure you have router discovery announcements enabled on your routers in this situation.
TIP
Plug each physical interface into a separate switch to make
effective use of multipathing. After that there are several ways
you can configure your high availability. You can plug each
switch into 2 routers and use HSRP to do router failover. In this
case, having the Sun use a default route would be fine. Or, you could
have each switch singly connected to a specific router on the same lan,
and run in.rdisc on the sun to detect these interfaces and
perform failover. A typical configuration is illustrated at right.
This is the easy part. Copy and paste your /etc/hostname.hme* files to ifconfig commands as below:
# /sbin/ifconfig hme1 host-int1 netmask + broadcast + group production
deprecated -failover up \
# /sbin/ifconfig hme0 host-int0 netmask + broadcast + group production
deprecated -failover up \
addif host-ext0 netmask + broadcast + failover up
addif host-ext1 netmask + broadcast + failover up
Occassionally you will see messages like this in your syslog files:
Nov 29 16:02:10 host.eng.auburn.edu in.mpathd[32]: [ID 398532 daemon.error] Cannot meet requested failure detection time of 2500 ms on (inet eri0) new failure detection time is 5922 ms Nov 29 16:12:29 host.eng.auburn.edu in.mpathd[32]: [ID 122137 daemon.error] Improved failure detection time 3644 ms Nov 29 16:12:29 host.eng.auburn.edu in.mpathd[32]: [ID 122137 daemon.error] Improved failure detection time 2500 msI find that they are largely ignoreable. Failover still works.
There is a known issue with Solaris8 IMP where both interfaces can fail under high load if a particular patch is not installed. Reboot will not fix the situation, you must have the patch: 108528-15 (or later)
When you have a failure event of some kind, you'll see a message like this:
Nov 21 23:03:58 host.eng.auburn.edu in.mpathd[266]: [ID 832587 daemon.error] Successfully failed over from NIC eri1 to NIC eri0When it comes back, you'll see one like this:
Nov 23 15:25:00 host.eng.auburn.edu in.mpathd[266]: [ID 620804 daemon.error] Successfully failed back to NIC eri0If you see one like this, it's time to run to the switch closet:
Nov 23 15:23:56 host.eng.auburn.edu in.mpathd[266]: [ID 168056 daemon.error] All Interfaces in group production have failed
Take the opportunity to test it out. Unplug one of your Cat5+ cables and watch failover work. Run a continuous ping to the machine. It's rather nice.
Now that you know how to setup resilient balancing links, you might be interested in how to setup a group with only 1 public, failover interface.
The advantages of this are
# cat /etc/hostname.hme0 DUMMY1 netmask + broadcast + \ group production deprecated -failover up \ addif REALNAME netmask + broadcast + failover up
# cat /etc/hostname.hme1 DUMMY2 netmask + broadcast + \ group production deprecated -failover standby up
# cat /etc/hosts # # Internet host table # 127.0.0.1 localhost 192.168.10.10 REALNAME loghost 192.168.10.11 DUMMY1 192.168.10.12 DUMMY2 #
What does this do? It sets up two dummy (private) IP addresses that are fixed to the interfaces. It sets up a failover group named production. It adds an IP REALNAME to the group and marks it as the failover IP that will be migrated, and hme1 is set as the standby interface. In most situations, hme0 will be used to transmit and receive packets. In the case of failure (interface, switch, cable, router, etc), the IP for REALNAME will migrate to hme1 interface. When hme0 recovers, the IP will migrate back.