Monitoring system health with Node Problem Detector

Starting with Milestone 77, Container-Optimized OS includes the Node Problem Detector agent. You can use this feature to monitor the system health of COS instances. Node Problem Detector monitors the instance health and reports health-related metrics to Cloud Monitoring, including capacity and error metrics that you can then visualize with Google Cloud Observability dashboards. Collected metrics from the default configuration are free. Google will use aggregated metrics to understand node problems and improve the reliability of Container-Optimized OS.

The agent is pre-configured with the set of metrics to export. Customizing reported metrics for the built-in agent is not supported at this time. Node Problem Detector is open-source software. You can review its source code and configurations in their respective source repositories.

Enabling health monitoring

The Node Problem Detector agent is disabled by default at boot time. You can enable this feature by using:

cloud-init
startup script
metadata
user-defined guest policies

Using a startup script

You can enable Node Problem Detector by using a startup script.

Using cloud-init

The cloud-init example explains the basics of configuring a Container-Optimized OS instance. You can use cloud-init to enable health monitoring with the following cloud-config example:

#cloud-config

runcmd:
- systemctl start node-problem-detector

Using metadata

In Container-Optimized OS Milestone 88 and later, the Node Problem Detector can also be enabled by setting the value of google-monitoring-enabled to true in the custom metadata section.

To enable monitoring when creating an instance:

gcloud compute instances create VM_NAME \
    --image=IMAGE \
    --image-project=cos-cloud \
    --metadata=google-monitoring-enabled=true

Replace the following:

VM_NAME: name of the new VM
IMAGE: a specific version of a public Container-Optimized OS image. For example, --image=cos-113-18244-85-29.

To enable monitoring in an existing instance:

gcloud compute instances add-metadata VM_NAME \
    --metadata=google-monitoring-enabled=true

Replace VM_NAME with the name of the VM.

Starting in milestone 97, monitoring can be enabled in project metadata:

gcloud compute project-info add-metadata \
    --metadata google-monitoring-enabled=true

After execution, the node-problem-detector service will be enabled.

Using user-defined guest policies

Container-Optimized OS includes OS Config agent, that uses OS system utilities to maintain the state for the VM that is specified in the guest policy. For details about guest policies, see Enable OS Config agent and Create a guest policy. The following guest policy enables the Node problem detector agent on all the instances.

recipes:
- name: recipe-enable-npd
  desiredState: INSTALLED
  installSteps:
  - scriptRun:
      interpreter: SHELL
      script: |-
        #!/bin/bash
        systemctl start node-problem-detector

Viewing the collected metrics

Node Problem Detector reports a list of metrics against a Compute Engine instance monitored resource. The metrics are documented on Monitoring metrics list, prefixed with compute.googleapis.com/guest/. You can view the collected metrics using Monitoring Metrics Explorer:

In the Google Cloud console, go to Monitoring or use the following button:

Go to Monitoring
In the Monitoring navigation pane, click Metrics explorer.
For the resource type, select Compute Engine VM instance.
Select a metric, for example "Problem Count".
You should see charts and statistics on the right side. To view the result for a specific Container-Optimized OS instance, set the filter to "instance_id=[INSTANCE_ID]", replacing [INSTANCE_ID] with the ID for the desired instance.

Disabling health monitoring

To disable the service that has already been enabled through your cloud-config or through your startup script, remove the systemctl start node-problem-detector step, and then reboot the Container-Optimized OS instance. If enabled by metadata, make sure the google-monitoring-enabled key is set to false.