In an ldoms configuration, a service domain is responsible for handling hardware resources such as network and storage on behalf of its guests. The simplest configuration is to use a single primary domain which functions as a service domain for all of the guest domains. 

It is possible to configure SPARC servers with redundant service domains (e.g. primary and secondary).

Each service domain is responsible for half of the hardware devices, and allows the guests to continue functioning even if one service domain is offline. Technically, it is both an I/O domain (which has direct access to hardware) and a service domain (which provides services to other domains).

The main benefits of service domain redundancy are:

  • The ability to patch and restart each service domain without affecting guests
  • The ability to cope non-disruptively with a small class of faults such as kernel panics

The main drawbacks are:

  • There is additional work required to configure the secondary domain. For example, in some hardware configurations with insufficient internal disks, you may need to SAN boot the secondary domain
  • There is duplicated work on an ongoing basis whenever configuring or modifying guest domains 
  • The mpgroup disk path redundancy feature is easy to set up and use, but not nearly as well-tested in practice as standard Solaris multipathing with MPxIO
  • Network redundancy is not automatic, so you need to configure IPMP in the guest domains. With probe-based failure detection, this can significantly increase the number of required IP addresses. Link-based failure detection is an option, but requires special configuration of the virtual switch and may be vulnerable to bugs in the virtual switch layer
  • Multiple service domains may not be supported by higher-level ldoms management software

Alternatives:

  • The live migration feature also gives you the ability to patch and restart a service domain without affecting guests: simply evacuate the physical server, perform the work, then migrate guests back
  • High availability software (e.g. cluster) at the application level can provide the same flexibility

Useful links: