The check_mk Administration Guide

Version: 1.0.35
Copyright Mathias Kettner 2009

Installation

After extracting the distribution archive of check_mk you'll find a file INSTALL which describes the installation procedure. This manual assumes that you already have installed check_mk as normal user into the default paths. This means that all components of check_mk are installed into a directory check_mk into you home directory. This manual assumes that your username is linux and that your home directory is /home/linux.

When installing check_mk as root or using the RPM or DEB package then installation will comply with the Linux Filesystem Hierarchy Standard (HFS) and use directories below /usr, /etc and /var.

Getting started

Many aspects of check_mk can be configured and tested without a Nagios installion running or even being installed. In order to get started most easily, we will first focus on those aspects and in a second step integrate check_mk with Nagios.

Check_mk is on the first hand a replacement for NRPE, check_snmp, NSClient++ and similar Nagios agents, the plugins locally run by those agents and the counterparts on the Nagios server that communicate with those agents.

check_mk comes with its own agents and can even make use of existing SNMP agents. Our first step will be the installation of the Linux agent on a first linux host of your choice, which will be used for testing. This host can of course be the Nagios host itself, since it makes sense to monitor its parameters as well.

Installation of the Linux agent

The agent for Linux consists of only two files: a shell skript called check_mk_agent.linux and a configuration file for xinetd.conf, both of which can be found in the subdirectory agents. (Xinetd is an improved version of the classical inetd and a is standard on most current linux distributions. You need to install either inetd or xinetd (recommended) on all of your target linux hosts.)

Please install the file check_mk_agent.linux on your target host as /usr/bin/check_mk_agent (drop the .linux). You should be able to execute the agent simply by calling it from the command line. It can be run as non-root user but some diagnostic information can only be retrieved if it is run as root. The output of check_mk_agent looks like this (abbreviated):

<<<mknagios>>>
Version: 1.0.35
<<<df>>>
/dev/sda1     ext3     1008888    223832    733808      24% /
/dev/sdc1     ext3     1032088    284648    695012      30% /lib/modules
<<<ps>>>
init [3]
/sbin/syslogd
/sbin/klogd -x
/usr/sbin/cron
/sbin/getty 38400 tty2
/sbin/getty 38400 tty3
/sbin/getty 38400 tty4
/sbin/getty 38400 tty5
/sbin/getty 38400 tty6
/sbin/getty 38400 tty1
sshd: linux [priv]
sshd: linux@ttyp0
-sh
/usr/sbin/sshd

The configuration of xinetd is simple: Copy the file xinetd.conf into /etc/xinetd.d as check_mk and (re-)start Xinetd with /etc/init.d/xinetd restart. You also might want to have a look into that file and adapt it to your needs. Do not forget to activate the start script of Xinetd (for example with chkconfig xinetd on on SuSE/RedHat). Otherwise Xinetd will not be started automatically at the next reboot.

If everything went fine, you should be able to retrieve the output of the agent by connecing to TCP port 6556 from the Nagios host. You can test this with telnet hostname 6556. Note that if you want to test this on the Nagios server itself via localhost, the IP address 127.0.0.1 has to be allowed in xinetd.conf. You could also check via if the TCP port 6556 has been opened with netstat:

root@host:/ # netstat -ltn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:6556            0.0.0.0:*               LISTEN
tcp6       0      0 :::22                   :::*                    LISTEN

Accessing the agent via check_mk

After a successful installation of the agent you can retrieve the data on the Nagios host via check_mk. This is done with bin/check_mk -d localhost and should show the complete data from the agent.

Configuration of Hosts

If check_mk finds a file named main.mk in the current working directory it loads that one instead. Also the directory ~/check_mk is also always scanned and preferred over the standard path. You even may explicitely specify a configuration file with the option -c.

In order to perform actual checks with check_mk you need some configuration. This is done in the configuration file main.mk which is found in ~/check_mk (if you used the standard paths while installing). The syntax of that file is Python. This does not mean that you need to know how to write programs in Python. The file simply contains a list of variable definitions, which must be written in Python syntax.

The most important and also the only mandatory variable is named all_hosts and lists all hosts to be monitored with check_mk. That variable is specified as a Python list of strings. You can either write it in a single line:

all_hosts = [ 'localhost', 'zabcsrv1', 'zabcsrv2' ]

Or you might want to list each host in a line of its own:

all_hosts = [ 
  'localhost', 
  'zabcsrv1', 
  'zabcsrv2' 
]

NOTE: The opening square bracket must be in the same line as the variable name!

Auto-Detection of items to be checked

After you have entered all those hosts where an agent has been installed, you can scan those hosts for items to be checked. This is done with the option -I (like Inventory). To scan only localhost call bin/check_mk -I alltcp localhost:

linux@host:~/check_mk$ bin/check_mk -I alltcp localhost
No new checks of type oracle_asm_dg.
No new checks of type vms_md.
1 new checks written to /home/linux/check_mk/var/autochecks/cpu.loads-2009-03-25_14.54.4
4.cfg
No new checks of type lsi.disk.
No new checks of type statgrab_net.link.
No new checks of type multipath.
No new checks of type oracle_tbs.
No new checks of type statgrab_net.params.
1 new checks written to /home/linux/check_mk/var/autochecks/cpu.threads-2009-03-25_14.54
.44.cfg
2 new checks written to /home/linux/check_mk/var/autochecks/diskstat-2009-03-25_14.54.44
.cfg
2 new checks written to /home/linux/check_mk/var/autochecks/df-2009-03-25_14.54.44.cfg
No new checks of type ipmi.
No new checks of type vms_sys.mem.
No new checks of type local.
No new checks of type statgrab_net.ctr.
No new checks of type lsi.array.
No new checks of type vms_df.
1 new checks written to /home/linux/check_mk/var/autochecks/mem.used-2009-03-25_14.54.44
.cfg
No new checks of type netif.params.
No new checks of type services.
No new checks of type tsm_stgpool.
No new checks of type winperf.diskstat.
No new checks of type vms_sys.util.
1 new checks written to /home/linux/check_mk/var/autochecks/netctr.combined-2009-03-25_1
4.54.44.cfg
No new checks of type md.
1 new checks written to /home/linux/check_mk/var/autochecks/kernel.util-2009-03-25_14.54
.44.cfg
No new checks of type oracle_inst.
No new checks of type logwatch.
No new checks of type netif.link.
No new checks of type oracle_asm_disk.
No new checks of type vms_netif.

check_mk creates configurations for the detected items in the subdirectory var/autochecks. Those files will never be overwritten by check_mk. You may edit them in order to change warning/critical levels for some items or remove items not needed any longer. For each check type a separate file will be created. The following one is for checking space on filesystems:

# /home/linux/check_mk/var/autochecks/df-2009-03-25_14.54.44.cfg
[
  # === localhost ===
  ("localhost", "df", '/', filesystem_default_levels), # 32
  ("localhost", "df", '/lib/modules', filesystem_default_levels), # 30
]

Some checks are not detected automatically because they cannot be. We'll see examples for that later.

Performing a check dry run

Once your host is inventorized you can perform a dry-run check even without Nagios running. This is done by calling check_mk with the options -n and -v which are usually combined into -nv. Also specify the host to be checked:

linux@host:~/check_mk$ bin/check_mk -nv localhost
Getting info cpu from host localhost (127.0.0.1)
CPU load             OK - 0.00
Number of threads    OK - 36 threads
fs_/                 WARN - 85% used (warning at 80%)
fs_/lib/modules      OK - 30% used
Disk IO read         OK - 0.0MB/s (in last 1237989779 secs)
Disk IO write        OK - 0.0MB/s (in last 1237989779 secs)
CPU utilization      OK - user:  0%, system:  0%, wait:  0%
Memory used          OK - 16.0% of RAM (47 MB) used by processes
NIC eth0 counters    OK - Receive: 0.00 MB/sec - Send: 0.00 MB/sec
OK - Version 16, Successfully processed 9 host infos

Integration into Nagios

In this manual we assume that you already have a running Nagios installation. Check_mk has been tested with the Nagios version 3 only (It might also work with Nagios 2, but that has never been tested). We assume that you store your Nagios object configuration files in the directory /etc/nagios/conf.d.

The Nagios-integration is done by putting adding a two Nagios configuration files and restarting Nagios. The first one can be created automatically by check_mk and defines all hosts listed in all_hosts and all checks that are done by check_mk. This file is created by calling check_mk with the options -H and -S - usually combined into -HS. check_mk writes the configuration file to standard output (abbreviated):

linux@host:~/check_mk$ bin/check_mk -HS
#
# Automatically created by check_mk at Wed Mar 25 16:16:08 2009
# Do not edit
#

# Active host checks for MK plugin

# ----------------------------------------------------
# localhost
# ----------------------------------------------------

define service {
    use                      check_mk_passive_perf
    host_name                localhost
    service_description      fs_/
    check_command            mknagios-df
}

define service {
    use                      check_mk_passive_perf
    host_name                localhost
    service_description      Number of threads
    check_command            mknagios-cpu.threads
}

# ... left out most service definitions ...

define host {
    use            check_mk_host
    host_name      localhost
    alias          localhost
    host_groups    +check_mk
    address        127.0.0.1
}

# default hostgroup
define hostgroup {
    hostgroup_name check_mk
    alias          Default hostgroup for check_mk-Hosts
}

Simply redirect the output into a file and copy that file into your Nagios configuration directory, for example /etc/nagios/conf.d/check_mk_data.cfg.

The second needed configuration file defines templates that are referred to by the first file. You find a template for this in doc/check_mk_templates.cfg and can copy it to /etc/nagios/conf.d/. Please edit it to your needs. In particular you have to define a method for doing host checks such as check_icmp. The template only does a dummy check which is always successful.

After creating those two files you can restart Nagios. Now you should see your new host with the autodetected services your Nagios web interface. If Nagios does not start please make sure that you do not have a duplicate host definition. If you configure a host with check_mk then its Nagios object definition will be created by check_mk. If you already have a manual definition for that host then please remove it. This is not really a restriction, since you may keep any pre-existing checks for that host.

Submission of check results to Nagios

As you might remember from the introduction check_mk sends all check_results via the Nagios command pipe as passive service checks to Nagios. In order for that to work, you have to make sure...

You can check write access to the pipe by calling check_mk without any options other then the hostname to be checked. The following example shows a case with a permission problem:

Cannot write to nagios command pipe /var/lib/nagios3/rw/nagios.cmd: 
[Errno 13] Permission denied: '/var/lib/nagios3/rw/nagios.cmd'
OK - Version 16, Successfully processed 9 host infos

If everything is working the check_mk checks should appear in Nagios like this:

Administration Reference

After this short introduction for getting started, the main part of this manual will show the different aspects of check_mk in detail.

Manually configured services

As we have seen above check_mk supports a conveniant method for automatically scanning for items to check. In some circumstances, however, a manual configuration is the better or even the only solution. Services to be checked can be configured in main.mk. This is done by defining the variable checks, which is a list of check-entries. Each entry is a 4-tuple with the following columns:

Call check_mk -L for a complete list of all checks supported by your installation of check_mk.

Here is example for manually configured checks:

checks = [
  ( 'lnxsrv01', 'cpu.threads',  None,  ( 200.0, 400.0 ) ),
  ( 'lnxsrv02', 'cpu.threads',  None,  ( 300.0, 600.0 ) ) ]

If you have a list of hosts with the same parameters for a certain check then you can group them together into a single line:

all_hosts = [ 'lnxsrv01', 'lnxsrv02' ]

checks = [
  ( all_hosts, 'cpu.threads',  None,  ( 200.0, 400.0 ) )
]

It is also possible to define exceptions. Let's assume that for all hosts the number of threads should not exceed 200, but for lnxsrv01 a value of 500 is OK. This can be configured like this:

checks = [
  ( all_hosts,  'cpu.threads',  None,  ( 200.0, 400.0 ) ),
  ( 'lnxsrv01', 'cpu.threads',  None,  ( 500.0, 800.0 ) ),
]

NOTE: The general case has to be defined before the exceptions. Later entries of the same host/check type/check item override previous ones.

Management of autodetected checks

Scanning for new checks

The inventory feature of check_mk has been presented in the previous chapter. It is an easy method to scan one or several hosts for new items that need to be checked. The inventory ist triggered with the option -I. The syntax is:

check_mk -I CHECKTYPE1,CHECKTYPE2,CHECKTYPE3 HOST1 HOST2 HOST3

You can get a list of all available check types by calling check_mk with the option -L:

Available check types:

                    plugin   performance  inventory
Name                type     data         possible    service description
-------------------------------------------------------------------------
cisco_fan           snmp     no           yes         %s
cisco_locif         snmp     yes          yes         Port %s
cisco_power         snmp     no           yes         %s
cisco_temp          snmp     yes          yes         %s
cpu.loads           tcp      yes          yes         CPU load
cpu.threads         tcp      yes          yes         Number of threads
df                  tcp      yes          yes         fs_%s
df_netapp           snmp     yes          yes         fs_%s
diskstat            tcp      yes          yes         Disk IO %s
fc_brocade_port     snmp     yes          yes         PORT %s
ipmi                tcp      yes          yes         IPMI Sensor %s
ironport_misc       snmp     yes          yes         %s
kernel              tcp      yes          no          Kernel %s
....

The following example checks for new filesystems on the hosts lnxsrv01 and lnxsrv02:

linux@host:~/check_mk$ bin/check_mk -I df lnxsrv01 lnxsrv02

You can scan for several check types at the same time by separating them with commas:

linux@host:~/check_mk$ bin/check_mk -I df,ipmi lnxsrv01 lnxsrv02

Note: The list of hosts is separated with spaces. It is also possible to scan all hosts defined in all_hosts at once. Simply leave out the list of hostnames. The following example scans for new filesystems on all hosts:

linux@host:~/check_mk$ bin/check_mk -I df

If you want to exclude some host from the inventory you can list them in the variable noinv_hosts in main.mk. Those hosts will never be scanned by the inventory:

noinv_hosts = [ "zbghcim", "zbghcim01", "zbghcim02" ]
Also SNMP based checks can be inventorized. This will be explained later when we discuss SNMP in general.

The special tag alltcp is used to scan for all non-SNMP checks. The following command checks for all new check items on all hosts:

linux@host:~/check_mk$ bin/check_mk -I alltcp

What happens with existing checks?

One point is important to understand: The inventory never finds items that are already checked for a specific host - either by a previous inventory or manually configured. Let's make an example. We assume that you want to check the root filesystem on all hosts with the levels 90% and 95% for warning and critical level. You can do this with the following explicit check definition:

checks = [
  ( all_hosts,  'df', '/', (90, 95)),
  # some other definitions follow...
]

If you now run the inventory it will not generate any entries for the check type df with the item / anymore. The same holds for all filesystems that have been found on previous inventories.

Removing inventorized checks

The day will come where an item previously found in an inventory will not longer exist. Let's assume that on host lnxsrv01 the filesystem /oradata has been removed. The check of this filesystem will enter a UNKNOWN state. In order to remove it from Nagios find the file in var/autochecks containing that check item and remove it by simply deleting the line.

# /home/linux/check_mk/var/autochecks/df-2009-03-25_14.54.44.cfg
[
  # === lnxsrv01 ===
  ("localhost", "df", '/', filesystem_default_levels),
# delete the following line:
  ("localhost", "df", '/oradata, filesystem_default_levels),
]

Of course it is also possible to remove a complete file in var/autochecks. After any change in the list of checks do not forget to regenerate the Nagios configuration files with check_mk -HS and restart Nagios.

Host/Service-groups and Host/Service-Contactgroups

Hostgroups

Check_mk fully automizes the generation of all Nagios configuration files needed for the hosts and services checked via check_mk. As we have already shown the host configuration data is created with the option -H.

What we haven't shown yet is the possibility to assignment hosts to host and contact groups. This is done by defining the two configuration variables host_groups and host_contactgroups. They are lists of mapping entries. Let's look at an example:

host_groups = [
 ( 'munich', [ 'zabc02', 'zmucna04' ] ),
 ( 'rome',   [ 'zabc02', 'zromxy01', 'zromab01' ] ),
]

This assigns the hosts zabc02 and zmucna04 to the Nagios host group munich. The entry for rome is similar. Note, that the host zabc02 is in both hostgroups. The host groups themselves are not created by check_mk. You have to configure them manually in your Nagios configuration.

Hosts that that are not assigned to any hostgroup are automatically assigned to a default hostgroup, which is automatically created by check_mk. The name of this group is check_mk but you can redefine it to another value by defining the variable default_host_group. This way you get a consistant configuration even without manually defining any host groups. Remember: Nagios enforces each host to be in at least one host group.

Usage of Python variables and lists

Remember that the syntax of the configuration file is Python. This means that you can use Python variables to store lists of hosts and make the configuration more conveniant:

# put all munich hosts in one list...
munich_hosts = [ 'zabc02', 'zmucna04' ]

# ... and the rome hosts into another
rome_hosts = [ 'zabc02', 'zromxy01', 'zromab01' ]

# join the two lists together for the definition of all hosts:
all_hosts = munich_hosts + rome_hosts

# now re-use the lists for the definition of the host groups
host_groups = [
 ( 'munich', munich_hosts ),
 ( 'rome',   rome_hosts ),
]

Host Contact Groups

If you want to map all hosts of a host group to the same contact or contact group then you can simply configure this manually in you Nagios configuration. A more flexible way is to define host contact groups in the variable host_contactgroups. The syntax is the same as with the host groups:

host_contactgroups = [
 ( 'unixadmins',  unix_hosts ),
 ( 'networkteam', [ 'zfw1' ] )
]

Service Groups

The configuration of Nagios service groups is similar - but not entirely identical. It is done by defining the variable service_groups as a list of triples. Each triple consists of a service group name, a list of host patterns and a list of service patterns. Consider the following example:

service_groups = [
 ( 'logwatch-all',  [ 'srv' ], [ 'LOG'] ),
 # other definitions follow here...
]

This entry maps all services whose descriptions begin with LOG on hosts whose names begin with srv to the service group logwatch-all, the definition for which itself is not being created by check_mk. You have to configure it manually.

If you want the definition to match all hosts, simply specify an empty hostname:

service_groups = [
 ( 'logwatch-all',  [ '' ], [ 'LOG'] ),
 # other definitions follow here...
]

Of course it is also possible to make use of variables as in previous examples:

service_groups = [
 ( 'logfiles-linux',  linux_hosts, [ 'LOG'] ),
 # other definitions follow here...
]

The following example puts all services beginning with Disk or fs_ into one group:

service_groups = [
 ( 'disk',  [ '' ], [ 'Disk ', 'fs_'] ),
 # other definitions follow here...
]

Exceptions can be denoted with an exclamation mark. It is important to know that the list of service patterns is executed from left to right. If the service name in question matches a pattern, the service is included in the service group and the execution is aborted. If a pattern beginning with ! matches the service name then the service is excluded from the group and the execution is also aborted. If none of the patterns match then the service is not included in the group. The following definition puts all services except those beginning with LOG into the service group nolog:

service_groups = [
 ( 'disk', [ '' ], [ '!LOG', ''] ),
 # other definitions follow here...
]

The patterns allow you to use extended regular expressions. The expression .* matches an arbitrary string. That way you can make an infix match. Here is a definition of a service group of all services containing R15 on all hosts containing lnx:

service_groups = [
 ( 'r15', [ '.*lnx' ], [ '.*R15'] ),
 # other definitions follow here...
]

Service Contact Groups

The definition of service contact groups is entirely analog to that of the service groups. The variable you need to define is service_contactgroups. Here is a more complex example:

service_contactgroups = [
  ( "bia-blades", hostnames_biablades,  [ "" ]),
  ( "filer",      hostnames_filer,      [ "" ]),
  ( "ldap",       [ "zbiza04", "zbiza050" ], [ "!IPMI ", "!NIC ", "!RAID ", 
                                               a"!Multipath ", "" ] ),
  ( "oracle",     [ "" ],               servicenames_oracle),
  ( "os-linux",   [ "!Tablespace ", "!DB_", "!ASM ", "!FBA ", "" ]),
  ( "san",        hostnames_san,        [ "" ]),
  ( "sap-wac",    hostnames_sap_wac,    [ "" ] ),
  ( "sap-sil",    hostnames_sap_sil,    [ "" ] ),
  ( "tapelibs",   hostnames_tapelibs,   [ "" ]),
]

General parameters in main.mk

Contacting the agent

The following global settings can be done in main.mk:

mknagios_port: TCP port for connecting to then agent. The default is 6556. If you change this to another value please make sure that you change it accordingly on all agents as well.

tcp_connect_timeout: Timeout in seconds for building up the TCP connection to the agent. The default is 3.0. This is not a timeout for the complete datatransfer from the agent.

mknagios_min_version: Minimal version of the agent that is expected. The default is 0 which means that any version is fine. If an agent sends a version that is less then mknagios_min_version the service check_mk itself is in a WARNING state.

Directories

The directories check_mk uses are configured at installation time. With the one exception of the path to the configuration file itself all directories and paths can be redefined in main.mk:

check_mk_configdir: check_mk will - after reading the main configuration file - look into this directory for files ending with .cfg and read them as well.

checks_dir: Directory where check_mk looks for check plugins

modules_dir: Directory with extension modules for check_mk

mknagios_autochecksdir: Directory where to put and look for autodetected checks.

precompiled_hostchecks_dir: Directory for precompiled host checks

counters_directory: Directory where current values of performance counters are kept. This directory will contain one file per host.

tcp_cache_dir: Directory where the output from the TCP-based agents are cached. One file per host contains the most recent output of the hosts agent. The cache files are used for a faster inventory.

rrd_path: Directory for round robin databases. This is needed for direct RRD updates.

nagios_command_pipe_path: Path to the Nagios command pipe.

Templates to use for Nagios configuration

When creating the configuration files for Nagios all host and service definition refer to templates. This gives you more flexibility in your configuration. You may override the names of those templates by defining the following variables:

pingonly_template (Default: 'check_mk_pingonly'): Host template for hosts that are only pinged (that do not have configured at least one service check except check_mk).

host_template (Default: 'check_mk_host'): Template used for hosts that have at least one check.

cluster_template (Default: 'check_mk_cluster'): Template used for cluster hosts.

active_service_template (Default: 'check_mk_active'): Used for the service definitions which actively call check_mk - once for each host.

passive_service_template (Default: 'check_mk_passive'): Template used for services that do not provide performance data.

passive_service_template_perf (Default: 'check_mk_passive_perf'): Template used for services that do provide performance data.