[go: up one dir, main page]

Page MenuHomePhabricator

Better abstractions for puppet & icinga/nagios/shinken
Open, MediumPublic

Description

Right now, provisioning new monitoring alert checks is unnecessarily burdensome. Moreover, it's currently very inconsistent between local & remote checks (master & NRPE), making it confusing and hard to switch from one type to the other.

We currently have:

  • monitoring::service, thin wrapper over the nagios_service native puppet resource, to be consumed by naggen on the Icinga master
  • nrpe::check, creates a /etc/nagios/nrpe.d config file snippet, to be included in the target host. The plugin is *not* provisioned by the definition, but one has to explicitly place it under /usr/local/lib/nagios/plugins and potentially write a sudo definition as well.
  • nrpe::monitor_service, thin wrapper over monitoring::service with nrpe_check + nrpe::check
  • nagios_common::check_command, creates copies plugins under /usr/lib/nagios/plugins/$title & /etc/icinga/commands/$title.cfg but with many hardcoded assumptions that make it difficult to use it outside of nagios_common::commands and monitoring "masters" (icinga & shinken).
  • manually setting up checkcommands by writing Nagios config text to modules/nagios_common/files/check_commands/$title.cfg instead of having a native puppet definition.

Deploying a new check from the monitoring master to a target host needs:

  1. Copying the check to modules/nagios_common/files/check_commands/$title
  2. Writing a new modules/nagios_common/files/check_commands/$title.cfg by hand (or, alternatively, modules/nagios_common/files/checkcommands.cfg) which usually is just a silly, mostly unnecessary abstraction against the actual check parameters with positional arguments.
  3. Editing modules/nagios_common/manifests/commands.pp and adding it to the list
  4. Invoking monitoring::service from the role class or module with the check_command that was defined in (2).

Deploying a new NRPE check requires a whole different process that involves putting the plugin to /usr/local/lib/nagios/plugins with a separate File resource & using nrpe::monitoring_service. base::monitoring::host is a good example for this.

Switching a check from local to remote is a PITA. One has to basically repeat all the steps, deal with /usr/lib/nagios vs. /usr/local/lib/icinga, rewrite positional arguments into proper arguments again, make sure the NRPE check isn't exposed on the Icinga master so that the two monitoring::service won't clash etc.

This is too difficult (it took me a while to fully grasp and even when I did, it was hard to recall to document above). We should abstract all this away and provide a sensible API inside our tree.

Ideas are welcome but I vote that we should just have: a) a single definition to deploy a new check, whether it's local or remote that would just DTRT based on Puppet dependencies, b) a single definition to use it, that has a boolean parameter that specifies whether it should run locally or remotely.

Event Timeline

faidon raised the priority of this task from to Medium.
faidon updated the task description. (Show Details)
faidon added projects: ops-core, observability.
faidon added subscribers: faidon, yuvipanda, akosiaris.