Monitoring a Windows machine via Nagios

In this post, I detail how to setup Nagios to monitor a Windows machine via NSClient++

Prerequisites:

  • A working Nagios install, configured on a server with a fixed IP
  • The latest NSClient++ MSI Installer

NOTE: At time of writing, the latest version of NSClient++ is 0.42 and is recommended as 0.41 has issues with binding to IPv4 addresses

NSClient++ Installation:
Run the installer
Choose your preferred installation type
(Typical should suffice: Custom and Complete allow for additional functionality such as client-side LUA and Python scripting, but that functionality is outside the scope of this post)
Choose to install a sample configuration
(The default is functional and works well for our purposes)
Enter the IP of your nagios server under Allowed Hosts
Enter an NSClient password if desired
Enable common check plugins and nsclient server
Finish the install

NSClient++ Configuration:
(Windows XP)
Right click on My Computer and select Manage
Go to Service & Application -> Services

(Windows 7)
Open the start menu
Type 'Service' and select the 'Services' item

(Windows 8)
Open the start screen
Type 'Service' and select 'View local services'

From the services menu, locate NSClient++
Right click on it and select Properties
Open the Log On tab and check the 'Allow service to interact with desktop' box- this gives NSClient access to the data it will be monitoring

As part of the installation, NSClient should have added the appropriate exceptions to the windows firewall. If you're using a third-party firewall on the client side, you'll need to open port 12489.

Testing:
First, ensure that NSClient++ runs in test mode- open it from the newly-added start menu shortcut
A command prompt window should appear with NSClient++'s output.
NOTE: On our test machine, version 0.42 had an issue with character encoding which caused some garbage characters to be displayed in the command line, but the service itself ran with no issues.

Check through for any errors pertaining to binding to an IP or listening on a port- if you find any, ensure that no other program is listening on port 12489 and try again. Missing file warnings can be safely ignored.

From a terminal session on your nagios server, find the check_nt binary- common locations are in /usr/local/nagios or /usr/local/nagios/libexec
Run the check_nt binary as follows:
check_nt -H -p 12489 -v MEMUSE

If this outputs memory usage stats, move onto the next section
If it outputs 'Could not fetch information from server', you'll need to double-check that port 12489 is open. This can be done with telnet:
telnet (Client IP) 12489

Nagios configuration:
Nagios 3 comes with some pre-configured example settings for windows servers, located (by default) in /usr/local/nagios/etc/objects/windows.cfg
These will work fine for the purposes of this tutorial, so edit your nagios.cfg (located at /usr/local/nagios/etc/nagios.cfg by default) and uncomment the following line:
cfg_file=/usr/local/nagios/etc/objects/windows.cfg
Next, open windows.cfg
Modify the sample server definition (windows-server) with your server's host_name, alias and address.
Modify the sample service definitions to use your server's new host_name

In theory, nagios should now be set up to monitor your windows machine's NSClient++ version, uptime, cpu load, memory usage, drive space, explorer.exe status and W3SVC service status. However, we found that this was not the case.

Troubleshooting Nagios:
When setting up on our test machine, all of our services would return 'Could not fetch information from server' despite the (successful) testing detailed above. This turned out to be an issue with the command definition for check_nt.

If you open /etc/nagios-plugins/config/nt.cfg, you'll notice that there are two command definitions: check_nt and check_nscp. The reason for the above errors is that the check_nt command definition doesn't pass in a port parameter- not only that, but the check_nt binary uses port 1248 by default! check_nscp on the other hand has the correct definition, so there are a few possible solutions:

  1. Modify the check_nt command definition to use port 12489
  2. Modify the service definitions in windows.cfg to use check_nscp
  3. Explicitly define a custom command for each NSClient++ service

In our test machine's case, we went with option 3. Some of the check_nt commands (MEMUSE, CPULOAD, etc.) take upwards of two parameters, and the definitions for check_nt and check_nscp only pass in one and two parameters respectively. This was causing invalid parameter errors using the existing windows.cfg setup, and adding more $ARG$ entries wasn't working (presumably due to the presence of -w, -c, etc. parameter prefixes).

So to fix this, we ended up using something along the lines of the following:

nt.cfg:

define command{
command_name check_nt_disk
command_line $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v USEDDISKSPACE -l c -w 80 -c 90
}

define command{
command_name check_nt_cpu
command_line $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v CPULOAD -l 5,80,90
}

define command{
command_name check_nt_uptime
command_line $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v UPTIME
}

windows.cfg:

define service{
use generic-service
host_name winserver
service_description C:\ Disk Space
check_command check_nt_disk
}

define service{
use generic-service
host_name winserver
service_description CPU Load
check_command check_nt_cpu
}

define service{
use generic-service
host_name winserver
service_description Uptime
check_command check_nt_uptime
}

It is worth noting that you do lose the ability to parametrise your nagios settings per-server with this setup. In hindsight, you could get around this by hardcoding the parameter prefixes into the new command definitions and use Nagios' $ARG1$, $ARG2$ symbols to pass through the actual values. Like so:

nt.cfg:

define command{
command_name check_nt_disk
command_line $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v USEDDISKSPACE -l $ARG1$ -w $ARG2$ -c $ARG3$
}

windows.cfg:

define service{
use generic-service
host_name winserver
service_description C:\ Disk Space
check_command check_nt_disk!c!80!90
}

After modifying the command and service definitions accordingly, you should be ready to go. Run service nagios restart from your server's terminal session, fix any config file errors and then check the web interface. Give it a few minutes for nagios to query the machine for the first time, and your services should start showing up green.

Comments

Oh that's brilliant!! I'd been bashing my head against the desk for quite awhile (using a 0.4.1xx client) before I came across this post. Client 0.4.2xxx worked a treat as did the explanation you've given here. Thanks for clarifying my understanding of what's going on with the windows client (and for saving my forehead from further damage!)

Add new comment

CAPTCHA