Nagios is a very powerful tool that allows you to test various services such as SMTP, PING, and many other things through plugins. It enables you to know if your platforms are still operational.
In short, it can alert you in different ways. Try it, you'll see, it's magical :-)
Configuration, on the other hand, gets more complicated. You should know that Nagios is difficult to approach initially, but once you understand it, you save a lot of time and can deploy new services with ease.
apache2.conf
This is the standard Debian configuration for apache2. It allows you to access Nagios via http://server/nagios3. It's up to you to modify the configuration or not:
# apache configuration for nagios 3.x# note to users of nagios 1.x and 2.x:# throughout this file are commented out sections which preserve# backwards compatibility with bookmarks/config for older nagios versios.# simply look for lines following "nagios 1.x:" and "nagios 2.x" comments.ScriptAlias/cgi-bin/nagios3/usr/lib/cgi-bin/nagios3ScriptAlias/nagios3/cgi-bin/usr/lib/cgi-bin/nagios3# nagios 1.x:#ScriptAlias /cgi-bin/nagios /usr/lib/cgi-bin/nagios3#ScriptAlias /nagios/cgi-bin /usr/lib/cgi-bin/nagios3# nagios 2.x: #ScriptAlias /cgi-bin/nagios2 /usr/lib/cgi-bin/nagios3#ScriptAlias /nagios2/cgi-bin /usr/lib/cgi-bin/nagios3# Where the stylesheets (config files) resideAlias/nagios3/stylesheets/etc/nagios3/stylesheets# nagios 1.x:#Alias /nagios/stylesheets /etc/nagios3/stylesheets# nagios 2.x:#Alias /nagios2/stylesheets /etc/nagios3/stylesheets# Where the HTML pages liveAlias/nagios3/usr/share/nagios3/htdocs# nagios 2.x: #Alias /nagios2 /usr/share/nagios3/htdocs# nagios 1.x:#Alias /nagios /usr/share/nagios3/htdocs<DirectoryMatch(/usr/share/nagios3/htdocs|/usr/lib/cgi-bin/nagios3|/etc/nagios3/stylesheets)>OptionsFollowSymLinks
DirectoryIndexindex.php
AllowOverrideAuthConfig
OrderAllow,Deny
AllowFromAllAuthName"Nagios Access"AuthTypeBasic
AuthUserFile/etc/nagios3/htpasswd.users# nagios 1.x:#AuthUserFile /etc/nagios/htpasswd.usersrequirevalid-user
</DirectoryMatch># Enable this ScriptAlias if you want to enable the grouplist patch.# See http://apan.sourceforge.net/download.html for more info# It allows you to see a clickable list of all hostgroups in the# left pane of the Nagios web interface# XXX This is not tested for nagios 2.x use at your own peril#ScriptAlias /nagios3/side.html /usr/lib/cgi-bin/nagios3/grouplist.cgi# nagios 1.x:#ScriptAlias /nagios/side.html /usr/lib/cgi-bin/nagios3/grouplist.cgi
cgi.cfg
I'll spare you all the comments in this file and show you the configuration I use. This is the configuration for the Nagios web interface:
This file allows you to define special commands for specific checks. For example, if you have developed plugins, you will need to create an associated command:
I mentioned custom commands earlier. Here's one I created to test MySQL replication. This will allow me to launch replication via the check_mysqlrep command:
# Generic host definition template - This is NOT a real host, just a template!
define host{
name generic-host ; The name of this host template
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
check_command check-host-alive
max_check_attempts 10
notification_interval 0
notification_period 24x7
notification_options d,u,r
contact_groups admins
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
_PROCWARN 150
_PROCCRIT 200
}
conf.d/generic_service.cfg
This file is used to provide the generic configuration for services, notification intervals, etc... I'll let you look at the official documentation:
define service{
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_interval 0 ; Only send notifications on status change by default.
is_volatile 0
check_period 24x7
normal_check_interval 3
retry_check_interval 1
max_check_attempts 3
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
conf.d/timeperiods.cfg
This file is used to define templates for notification periods:
###############################################################################
# timeperiods.cfg
###############################################################################
# This defines a timeperiod where all times are valid for checks,
# notifications, etc. The classic "24x7" support nightmare. :-)
define timeperiod{
timeperiod_name 24x7
alias 24 Hours A Day, 7 Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
# Here is a slightly friendlier period during work hours
define timeperiod{
timeperiod_name workhours
alias Standard Work Hours
monday 09:00-17:00
tuesday 09:00-17:00
wednesday 09:00-17:00
thursday 09:00-17:00
friday 09:00-17:00
}
# The complement of workhours
define timeperiod{
timeperiod_name nonworkhours
alias Non-Work Hours
sunday 00:00-24:00
monday 00:00-09:00,17:00-24:00
tuesday 00:00-09:00,17:00-24:00
wednesday 00:00-09:00,17:00-24:00
thursday 00:00-09:00,17:00-24:00
friday 00:00-09:00,17:00-24:00
saturday 00:00-24:00
}
# This one is a favorite: never :)
define timeperiod{
timeperiod_name never
alias Never
}
# end of file
conf.d/hostgroups/unix-srv.cfg
Here is an example configuration for a hostgroup. You will then simply associate a host with this hostgroup to inherit all the services described below:
# Some generic hostgroup definitions
define hostgroup {
hostgroup_name unix-srv
alias Unix servers
}
# Define a service to check the disk space of the root partition
# on the local machine. Warning if < 20% free, critical if
# < 10% free space on partition.
define service{
use generic-service
hostgroup_name unix-srv
service_description Disk Space
check_command check_nrpe!check_all_disks!20%!10%
}
# Define a service to check the number of currently logged in
# users on the local machine. Warning if > 20 users, critical
# if > 50 users.
define service{
use generic-service ; Name of service template to use
hostgroup_name unix-srv
service_description Current Users
check_command check_nrpe!check_users!2!3
}
# Define a service to check the number of currently running procs
# on the local machine. Warning if > 250 processes, critical if
# > 400 processes.
define service{
use generic-service ; Name of service template to use
hostgroup_name unix-srv
service_description Total Processes
check_command check_nrpe!check_procs!$_HOSTPROCWARN$!$_HOSTPROCCRIT$
}
# Check Zombie process
define service{
use generic-service ; Name of service template to use
hostgroup_name unix-srv
service_description Zombie Processes
check_command check_nrpe!check_zombie_procs!2!3
}
# Define a service to check the load on the local machine.
define service{
use generic-service ; Name of service template to use
hostgroup_name unix-srv
service_description Current Load
check_command check_nrpe!check_load!5.0,4.0,3.0!10.0,6.0,4.0
}
# check that ssh services are running
define service{
use generic-service
hostgroup_name unix-srv
service_description SSH Servers
check_command check_ssh
}
conf.d/hosts/serveur.cfg
And here I declare a host and associate unix-srv declared above so it inherits the services above:
define host{
use generic-host
host_name server.deimos.fr
alias server
address server.deimos.fr
hostgroups unix-srv
_PROCWARN 280
_PROCCRIT 350
}
Here I use variables (_PROCWARN and _PROCCRIT) that override the default values (See the documentation for more information). You must add as in the documentation HOST ($_HOSTPROCWARN$ and $_HOSTPROCCRIT$) only for the command declaration part.
If you want a more complete configuration, I've attached an archive with a more complete Nagios3 configuration: Nagios configuration
Addons
NRPE
We need to configure the NRPE check to handle multiple arguments. Otherwise, we will be limited to just one. Edit the following file and add the necessary number of arguments:
# this command runs a program $ARG1$ with arguments $ARG2$
define command {
command_name check_nrpe
command_line /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$ $ARG8$ $ARG9$
}
# this command runs a program $ARG1$ with no arguments
define command {
command_name check_nrpe_1arg
command_line /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}
Here I've gone up to 9, but if I remember correctly, you can go up to 32 arguments.
Hiding "non-OK" alerts
There is a simple solution to avoid displaying certain alerts that are somewhat annoying, such as if like me you have nearly 1500 alerts, a 107cm screen just for Nagios, and some alerts take up too much space and cannot be resolved quickly (e.g., a weekly backup check that failed).
The solution is done through the graphical interface by acknowledging the services you no longer want to see. This way, they will be hidden and when they return to OK, they will be displayed again and the acknowledgement will disappear.
Then in your browser URL, you'll need to modify it a bit to ask it not to display acknowledged alerts. By looking through the CGI sources, you can find this kind of information:
In some cases, you may have certain checks that temporarily store information on the Nagios server and you want to be able to execute actions from the Nagios interface. For this, there's the 'action_url' option where we can give a URL to a CGI that will execute what we want, perhaps with options.
To start, we'll create our CGI. Here's a minimalist example where I delete a temporary file:
#!/usr/bin/perl useCGI;$query=CGI::new();$host=$query->param("host");# Avoid inputing special characters that would crash the programif($h=~ /\`|\~|\@|\#|\$|\%|\^|\&|\*|\(|\)|\:|\=|\+|\"|\'|\;|\<|\>/){print"Illegal special chars detected. Exit\n";exit(1);}print"Content-type: text/html\n\n";print"<HTML>\n";print"<HEAD><Title>Removing $host temporary file</Title>\n";print"<LINK REL='stylesheet' TYPE='text/css' HREF='/nagios/stylesheets/common.css'><LINK REL='stylesheet' TYPE='text/css' HREF='/nagios/stylesheets/status.css'>\n";print"</HEAD><BODY>\n";print"Removing $host Interface Network Flapping temporary file...";if(-f"/tmp/iface_state_$host.txt"){unlink("/tmp/iface_state_$host.txt")orprint"FAIL<br />/tmp/iface_state_$host.txt: $!\n"andexit(1);print"OK\n";}else{print"FAIL<br />/tmp/iface_state_$host.txt: No such file or directory\n";}print"</body></html>\n";
And then in the configuration of the service in question, I insert my 'action_url':
define service{
use generic-services-ulsysnet
hostgroup_name network
service_description Interface Network Flapping
check_period 24x7
notification_period 24x7
_SNMP_PORT 161
_SNMP_COMMUNITY public
_DURATION 86400
check_command check_interface_flapping
# For Thruk & Nagios
# action_url ../../cgi-bin/nagios3/remove.cgi?host=$HOSTADDRESS$
# For Nagios only
action_url remove.cgi?host=$HOSTADDRESS$
}
All you have to do now is reload Nagios.
Sending SMS alerts via Free Mobile
Thanks to Free Mobile for offering the possibility to send SMS via an API (which can be activated from the web interface of your account). How does it work? Simply create a scripts folder in the configuration folder and put the content of this script in it:
#!/bin/bashmessage="$(perl-MURI::Escape-e'print uri_escape($ARGV[0]);'"Naemon $1: $2 on $3$4 ($5)")"curl--insecure"https://smsapi.free-mobile.fr/sendmsg?user=<userid>&pass=<password>&msg=$message"
Adapt the 'userid' and 'password' fields with your personal settings (these credentials are available from the Free Mobile web interface). Now create the associated contacts and contactgroups:
define contact {
contact_name oncall-sms ; Short name of user
alias oncall-sms ; Full name of user
use contact-sms ; Inherit default values from generic-contact template (defined above)
}
define contactgroup {
contactgroup_name admins-sms
alias Nagios Administrators SMS
members oncall-sms
}
Add the necessary commands that can call the script:
define contact {
name contact-sms ; The name of this contact template
host_notification_commands notify-host-by-sms ; send host notifications via email
host_notification_options d,u,r ; send notifications for all host states, flapping events, and scheduled downtime events
host_notification_period 24x7 ; host notifications can be sent anytime
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!
service_notification_commands notify-service-by-sms ; send service notifications via email
service_notification_options u,c,r ; send notifications for all service states, flapping events, and scheduled downtime events
service_notification_period 24x7 ; service notifications can be sent anytime
}
And finally the service escalation, to notify everyone by SMS if there hasn't been action taken quickly enough:
Of course, this is just an example, and you will certainly need to adapt it to your needs.
FAQ
No output returned from plugin
Solution 1
If you encounter this type of error, it's because you haven't activated the arguments in the NRPE configuration. In the NRPE configuration file /etc/nagios/nrpe.cfg, change this value:
I had this problem and searched for a while before finding the solution. It comes from NRPE and for the host's configuration in question, replace something like this:
You need to check if the mail command is present in command.cfg and especially with the correct PATH. For example with Debian, I encountered a small problem and this simple symbolic link solved the problem:
This is probably due to a permission error (plugin execution) or because you are not indicating the complete path of the executable you want to launch. Forget $USER1$, it doesn't work very well for me. After a few Nagios reloads, it starts to malfunction, so use the full path.
CHECK_NRPE: Error - Could not complete SSL handshake.
I bet it's because you don't have the permissions! A small telnet on port 5666 of your server will tell you a lot. Then modify the allowed_hosts line of nrpe.cfg and everything should be back in order :-)
Monitoring Wordpress
Wordpress has a small peculiarity with check_http, which is that you need to make it follow links (option "follow"). Here's an example check:
My pending checks are still there even after a reboot
It's possible to have issues on your Nagios machine and the checks that need to be done remain active, even after a reboot. This is simply because Nagios keeps them in memory to be able to re-execute them later. To purge this queue, modify this in the Nagios configuration, restart it, then change the parameter back to 1: