Failed heartbeat unnoticed in Distributed Application

Author : Ingmar Verheij

Server down

System Center Operations Manager (SCOM) monitors the health of systems with an agent. One of the most basic checks that is executed is a health check of the agent itself. One of the checks is a heartbeat between the agent and the RMS (Root Management Server). If the heartbeat is lost for three times (configurable), the agent is considered unavailable.Health Service Heartbeat Failure

An alert is generated and (if configured) a notification is send to inform the administrator that there is a problem.

But if a Distributed Application is configured to monitor a chain of components, this failure remains unnoticed.

Node state 'Healthy'

Nodes that are unmonitored are grey and appear to be ‘Healthy’, which is strange for a node who’s heartbeat haven’t reported for quite some time.

Unnoticed heartbeat failure

Operations Manager assumes that if a node is unavailable because the heartbeat is lost, no child objects should be monitored. This is good to prevent alerts of child objects which are probably as unavailable as the parent, but sets the whole node in an ‘unmonitored’ state.

Distributed application state 'Okay'

The effect of putting a node in an ‘unmonitored state’ is that a parent node in a distributed application, containing one or more agents, doesn’t check the health of the machine. So, in other words, if the heartbeat is lost the parent nodes still reports it to be Okay.Distributed application state 'Error'

To prevent a parent node (containing one or more agents) to stop monitoring the health child nodes when they are unmonitored an override can be configured. With the override the state of an unmonitored node can be configured to result in a warning or an error.

Configure override

The override should be configured in the node that contains the agents. As an example I’ve created a Distributed Application with the name ‘Test’ that contains a node ‘Application Servers’. This node contains two agents : VCTX101 and VCTX110.

Distributed Application

Select the node and click ‘Configure Health Rollup’, here you can configure overrides for the node.

On the bottom you will an override for the monitor ‘Monitoring unavailable’. The default option is ‘Monitoring Unavailable’ and would prevent an unmonitored node to affect the state of the node. By enabling an override and setting the value to ‘Rollup monitoring unavailable to error’ an unmonitored node will place the node in an error state.

Override - Monitoring Unavailable - Monitoring Unavailable

Override - Monitoring Unavailable - Rollup monitoring unavailable as error