The History and Evolution of a Monitoring System

(Jul 2021, as of SMG v1.3)

Part 1: The world before

(2010) AWS, Django and Rails

I joined Smule as a “Ruby/Python Server Developer” in 2010.

I already had a lot of experience with most of the mainstream platforms at the time - C/C++/Java/.NET. I left a cozy Architect position in the Bulgarian branch of a multi-national Enterprise company ( still - a great one ;) ) to join a 20-person, less than 2 years old Bay Area startup, as a contractor working remotely from home in Bulgaria. That might not make whole lot of sense but it had two parts - I would work directly for a US company (and not necessarily limited to Bulgarian standards in terms of the money I could potentially make) but more importantly - I would be paid to write code in Ruby which I would otherwise do for fun :)

In any case this was all made possible thanks to a colleague from the previous company who introduced Smule to me and also me to Smule (thank you Alex Kishinevsky!), though eventually he left soon after I joined. At some point we were just two server engineers handling everything - writing new code but also maintaining the existing stuff and also deploying to AWS and verifying it works (I may have been the first Smule “DBA”). You could call us “Dev Ops” or “full stack engineers”. I should mention Michael Wang at this point (the other “server guy”) - he is one of the best programmers I know but also one of the best Linux sysadmins at the same time. I learned tons from him (and am still learning these days) and will always be grateful for the “free knowledge” I got from him :)

This was the time when Rails was becoming popular. We were targeting to build our next generation Platform and replacing our existing Python/Django based server (might be called a “monolith” these days) with several Ruby based services (not sure if the term “micro-service” was popular or even existed at the time). This was all running in AWS Linux machines but also using AWS infrastructure like ELBs, SQS (RDS did not exist yet at the time).

We didn’t have an “operations team” at all at the time. Monitoring was mostly handled by e-mail notifications generated by some scripts checking for stuff (or built into the app server systems). We did have some Munin graphs which Michael had setup at the time but the (relatively small) size of the infrastructure did not require too much automation and processes. Whoever would see an alert e-mail would jump on it.

However with the growth we were realizing that we had some real scalability issues, the main one probably being the prohibitively expensive AWS bills. Maybe we were ahead of our time back then …

(2012) Welcome to the Data Center world

Somewhere in 2011-2012 there was a change of direction - the Smule infrastructure was moving from AWS into our own data center.

Instead of using The Cloud we were becoming The Cloud (which apparently turned out to be possible to implement cost-efficiently and ultimately allowed Smule to grow and still exist to this day).

At that point it was decided that we are going to build our (another) next generation platform using Java and the (awesome) Play Framework (was v1.x at the time). Play has somewhat similar to Rails project structure (I guess not very different from most MVC frameworks) which was making it easier to port/re-implement the part of the previously Rails-served APIs we wanted to keep and continue using.

Somewhere at that time (somewhat naturally, being one the persons more familiar with all the now “legacy” systems we were running) I officially became part of the new “operations” team - lead by my current boss (and friend for life) Parker Ranney. He is the other person I have learned the most from about sysadmin stuff and how to efficiently run large scale systems.

With growth, our infrastructure was becoming large. It was no longer sustainable to have a bunch of scripts which you run against some fresh AWS host to install everything needed on it for the role you want it to serve. This was a time when “configuration management” systems like Puppet and Chef were becoming very popular. I was tasked to evaluate these and eventually convert our existing setup scripts into a more manageable “Infrastructure as code” system using one of these. Eventually we decided to use Chef (partially because the Chef recipes are technically Ruby code and it was easy for me to get up to speed).

We also needed some better/actual monitoring systems than what we had before. At a high level there could be two approaches to monitoring:

“active” - the on-call person staring at dashboards/graphs and diving into systems when something wrong shows up in the graphs. A graphing system (like the popular at the time Cacti and the already old MRTG) could do the job for that.
“passive” - instead of someone staring at screens we rely on checks and alerts to tell us that something is wrong via e-mail or phone/message. Nagios would be the choice for this type of monitoring at the time.

Of course none of these is sufficient by itself in any large and complex system and we would need both. Normally one would know what is broken from the alert but in order to identify root causes one would use the graphs (and logs of course but that is a completely different topic).

And then what would happen is that we get a Nagios alert about something, then find the corresponding/related to that alert graph(s) in the Graphing system to better understand the issue (e.g. for disk space/memory usage alerts it helps a lot know if these were growing slowly over time or spiked very quickly). I didn’t particularly like the fact that these were two different systems but that was the current state of monitoring. At that time Prometheus and Grafana did not exist yet. Graphite (a no longer very popular system) was just becoming popular but wasn’t necessarily very stable (and apparently had various issues as a lot of people were complaining from it).

This is when I became the “Chef guy” and the “Monitoring guy” around the same time. I was not necessarily very happy about the later initially. It could be a very laborious and error prone job to add and remove monitoring to our hosts and services. And then its a bit like a defender in (European) football - you can play great most of the time but a single mistake can easily render the whole effort a disaster … But I realized that with Chef, since I had the list of all Nodes and all Roles they had in a central database I could at least mostly automate myself out the laborious part of the “Monitoring guy” job.

I inherited a manually maintained Nagios configuration from a temporary contractor (I was a “permanent” one :) ) who brought up our initial set of machines in our first Data Center in Mountain View, Bay Area. I converted most of it to Chef ERB templates (where the hosts and hostgroups lists were generated from Chef instead of listed manually) and eventually the entire Nagios config (hosts and services) would be re-generated on every chef run and reflecting the current status of all Chef Nodes and Roles/Services we have brought up on them.

The graphing part was slightly trickier at the time. Cacti (probably the most popular sysadmin charting software at the time) expected you to go a UI and configure hosts and what templates to apply to them. There were some options about APIs/script-ability but it was certainly not as convenient as plain text config files generated by Chef. However MRTG on the other hand (an even older and in general - more limited system) was more like Nagios - you configure it what to poll via plain text config and it does it. Originally created for monitoring network equipment (and talking SNMP “natively”) MRTG could also be used to graph any numeric value outputted by a shell command. Technically - it could keep track and plot max 2 lines per graph (corresponding to network In/Out stats, but ppl could label them however they want). MRTG is normally run via a cron job every 5 minutes during which it is expected to complete the polling for all stats it has configured.

All MRTG objects must have an unique ID which maps directly to a filename on disk. This imposes some limits on how big these ids can be and what characters they can contain but none of these would be a big deal for a sysadmin tool. MRTG could work in two modes - either using its own original/native “log” format or RRDtool-based format. In the former mode it would also produce graphs for everything it polled where in the later mode it would not produce graphs at all. The idea would be for people to create/use their own front-end/UI in top of the rrd files. There were some open source options about such UI at the time but these seemed too limited and under-developed. For a long time I was running MRTG using its log-based format and graphing everything which had the benefit that I didn’t need any fancy UI besides MRTG’s indexmaker-generated static html pages but the huge draw-back that its very inefficient. I am pretty sure that significant % of the graphs it was generating were never seen by a human eye.

In addition to many graphs, we wanted some kind of Web UI to be able to browse them and were using MRTG’s indexmaker to generate “indexes” - HTML pages containing a selected subset of all the stats MRTG would be polling for. I would have a chef recipe+template generating the “main” HTML page, listing all “Configured Indexes”. Indexmaker itself works using filtering based on regex matches against the MRTG object ids. That may sound like a big limitation but if one has good object ids (as coded in the templates, e.g. host._hostname_.snmp.cpu_wait or host._hostname_.mysql.replication_lag) it is fairly easy to define per-host and per-service indexes as needed. So the “Configured Indexes” list was mostly automatically generated and we had an extra list of “manually defined” ones for special stuff we would explicitly care about (e.g. the switch ports connecting our Data Center with the world via a bunch of ISPs). This would be a fairly simple list in our chef-repo consisting of an “uid” (to become the index html file name), a human friendly “Title” and the regex to be passed to indexmaker.

Because all of this was either configured via config file or was generated using a command (following a config file update) that was making it quite easy to implement similar strategy as with Nagios: On every Chef run, the chef-client will dump the list of all Nodes from the chef-master and depending on their roles it would apply “MRTG templates”. I would have one base template for all Nodes (covering basic system stats like CPU/Memory/Disk/Network usage). Then depending on the roles I would add/remove one or more templates for any given hosts via chef.

With this setup being the “Monitoring guy” became less of an issue - everything was mostly automatic. If anyone asked for plotting stats from a new type of service it would be a one-time effort for me to create the appropriate Chef MRTG config template and possibly - one or two script commands wrapping the respective service native client for convenience and to extract only the numbers I would care about. Then apply that to nodes based on their (usually new type of) role in the config generation recipe and be done with it.

Side note: at some point during early development SMG would actually be used as MRTG UI/Front-end - being able to use MRTG-updated RRDtool files to display graphs from them. Of course that required me to have identical MRTG and SMG configs (and disable SMG’s internal run scheduler via the play/application conf). I had such setup for a while to validate that my SMG graphs will look the same as their MRTG counter-parts. Of course that wasn’t very practical to maintain in the long run so never done it after. Technically - its should still be possible as of today though.

Around the same time I was tasked to create (and then maintain) our “Jamon graphs” scripts - using RRDtool to update and graph stats emitted by our Java application servers in a specific json(-like) format in their logs:

Our java devs can instrument any method they care about by just annotating it with a special annotation. As a result they get stats (like hits per sec, average time, etc) for any instrumented call in the application logs every X minutes and they can inspect these later for issues/slowness etc. We wanted graphs from all of these but also aggregations - the individual stats would come from each app server separately and we also wanted an overall view from all app servers about how some API behaves across all application servers. These ended up being a lot of stats (eventually - millions with the growth) This was all written in Ruby and was running hourly against the previous hour logs. Eventually it would be unable to parse and process all previous hour stats logs from the entire fleet of app servers within an hour. This would be among the first systems to break because of increased data/processing volume due to growth at many points in time. For a while we were able to handle it with various script optimizations (and throwing beefier hardware at it).

Growing more

Somewhere in 2014 we were seeing the effects of an upcoming “Hockey stick” growth.

We were expanding our hardware footprint by a lot and already had plans to expand into multiple Data Centers around the world. My poor MRTG and Ruby+rrdtool -based systems were having troubles keeping up with the ever increasing number of stats to monitor. My every 5 minutes MRTG runs would fail to complete for 5 minutes (especially during busy times when the polled services also respond slower than normally). It was in theory possible to parallelize the MRTG runs but that would mean for me to have a more complex logic in Chef about how to split (or “shard”) all the hosts and services.

My Ruby/Jamon scripts were also approaching their limits even after switching all writes to go via rrdcached. This was a huge boost (mostly eliminating the disk I/O bottleneck) yet at some point just parsing all the logs and the aggregation of stats across the systems would be unable to process the “previous hour” logs within the hour.

This is when SMG was born.

Part 2: SMG was born (end of 2014)

… to replace MRTG

Smule has this awesome tradition to run internal company “hackathons” (as we call them “Smackathons”). The company would schedule a couple of days where everyone can work on any idea/project they want to (alone or within a team) and the “best” projects as per jury and popular votes would get awards.

Somewhere around that time I started reading about Scala and liked what I read - Scala has a lot in common with Ruby (which at least from my perspective is to be able to quickly code stuff but also enjoy doing it) but also one huge benefit - it is just as efficient as Java is and Ruby is unfortunately nowhere near that. SMG (still unnamed at the time) was one of two projects I had in mind for the Smackathon and was going to use Scala (and the Play framework) to learn the language. Ultimately I ended up doing the other project so SMG was the second ever project I started in Scala :)

Here is the initial rough list of requirements/features I had in mind when I started with the Smule Grapher (a.k.a. SMG)

Needed to have a transition plan

from the MRTG-based system and ideally - preserve history.

Decided to follow my MRTG-based strategy about unique and meaningful object ids - e.g. class.host.service.metric and indexes based on these object ids. I had full control over the ids via the templates so would use ones which are meaningful enough (and often - don’t need a human-friendly title) but also easy to filter using simple prefix/suffix and regex filters. With good object IDs structure SMG will build a meaningful “automatic indexes”. Considering the dot a “level separator” SMG can build automatic Index Trees - the top level ones would contain the first part of the ID (“class” in the above example) and under each top-level tree node there will be sub-trees built the same way. In the class.host… example that would be an index for each “host” under the top-level “class” index.
Plain text configs defining everything which needs to be polled (and then rendered/displayed on demand). Easy to generate from Chef using its ERB templates.
Using arbitrary external (bash) commands to get data from monitoring services, in MRTG-script compatible format. I was planning to convert my MRTG Chef ERB templates to SMG ones one-to-one, at least initially.
easy to add “one-off” graphs for random stuff via $include and “drop in” directories. Adding a new graph would be as simple as adding the following config:
```
  - id: network.external.google.resp-time
    # title: Response time from google.com
    command: curl -o /dev/null -f -w '%{time_total}\n' google.com
    vars:
      - label: total_time
        mu: s
```
This defines a single object with id network.external.google.resp-time, the corresponding curl command which will output its response time in seconds and in the vars array defines a single “variable” which will be tracked. That variable will have a label “total_time” when displayed in the graph and the measurement unit (mu) of “s” (from seconds) which will be appended to the “reference” (avg/max/etc) values displayed in the graphs legends. This is a “RRD Object” and can have many additional properties (like e.g. the commented out title) to fine tune its behavior. This object does not define its interval so the default (60 seconds unless changed globally) will be used.

These objects would (optionally) support pretty much all features MRTG config objects would support but also extra features only available in RRDtool (and not necessarily MRTG alone). For example it is common (the “norm”) for NetOps teams to measure traffic bandwidth in “bits per second” however the Kernel (and thus SNMP in modern systems) will report the traffic counters in bytes. So in MRTG one would use a special “bits” config value telling it to multiply by 8 but in SMG (and RRDtool) that could be achieved with a feature called CDEFs where an arbitrary RPN expression is applied to the stored in the rrd file value and the result is actually what is graphed (for the “bits” case the SMG expression looks like "cdef: $ds,8,*").
I chose yaml because I find it easy to write. At the time it was far from the popularity it has today (thanks to docker and kubernetes but these technologies were yet to become popular at the time)
RRDtool would allow me to import our history easily and is doing nice (and accurate) graphs so that made it the primary candidate for the “database backend”. Being the core of the SMG graphing system it deserves its own section.

RRDtool

RRDtool has some very nice features for a time series database:

Fixed file sizes - one does not need to worry about increased disk usage with RRDtool files as data accumulates with time. The files are allocated in their full size initially and data is stored in the so-called Round Robin Archives (RRAs). The relatively small disk footprint allows for easy regular backups.
Automatically aggregating older data over larger periods (losing some granularity in the process) but being able to keep track of the trends over very long periods (like years) using very limited disk space.
These RRAs are defined using special syntax (check rrdtool docs for details) and SMG has some built-in logic to define some sane default RRAs (which tend to depend on the object update interval). For example the default SMG RRA for the “60 seconds” (every minute) interval would be something like this:
- keep 4 days worth of data at the max resolution (every minute data points)
- keep 8 days at 5-minute average granularity
- keep 4 weeks at 30-minute average granularity
- keep 120 days at 2h-average granularity
- keep 2 years worth of data at 6h-average granularity
- keep 4 years worth of data at 24h-average granularity
Such RRA results in about 121K RRD file if it holds just one variable but all of these are configurable using custom RRA defs in the SMG config. People can trade-off space usage for data granularity any way they want in this way.
Native support for Counters, including properly handling integer overflows. A lot of system (and other) stats come as counters - ever incremented values where the current absolute value is not necessarily meaningful but having two values (v0, v1) and the time interval (dT) between their measurement it is easy to calculate the rate at which the value has changed during that time interval ((v1 - v0) / dT). Counters are generally easy and cheap to maintain on the update side and commonly used for monitoring (e.g. in the Linux Kernel). But an ever increasing 32 bit counter will soon or later overflow the 2^32 barrier and instead of increasing it will become a small number (from where to start increasing again). RRDtool if configured properly will detect these overflows and can derive the correct value considering the previous (huge) value and the integer overflow.
RRDtool can produce graph images (PNGs) on its own. These are relatively small and although might be somewhat old-school for people used to heavyweight JS-based charts these days they have some benefits like easily copying and pasting an image in Slack/Gmail etc. Also we often want to see many graphs on the same page - having the browser display 100s PNG images in one page is no big deal for the modern browsers but try plotting 100s independent graphs using JS in the same page and your browser will be struggling with that.

RRDtool (and thus - SMG) will use SI prefixes on the measurement units when displaying very large (or very small) numbers which would otherwise require lots of digits to display in full within the somewhat limited image space.
Efficient updates using rrdcached - rrdcached is a background service (coming with RRDtool) which when enabled will listen on a socket and accept writes from rrdtool. The benefit is that rrdcached does not write immediately to disk but will batch/aggregate multiple writes for given time period into one write. This behavior is configurable but can greatly reduce the Disk I/O requirements for running a large scale SMG instance.

Note that even with rrdcached if you do all updates using commands like rrdtool –socket … update … there will still be a certain bottleneck/limit about how many you can do in a given minute - forking a new process to do every db write is not necessarily the most efficient way to do db writes. That wasn’t a problem for SMG for a long time (the polling of remote services would always be the bottleneck) but eventually (much later) I changed SMG to be able to use the rrdcached protocol directly and also batch the updates itself before sending them to the rrdcached socket. With that I am yet to see the db writes to become a bottleneck.

All of the above would allow me to trivially translate my MRTG-based Chef recipes to the new format as they would be mostly (logically) compatible. But the goal was to create something better than my MRTG -based system so I had a few more in mind:

Support shorter polling intervals and also support multiple intervals within the same system

We wanted most of our graphs to update every minute instead of every 5 minutes as it was the case before.

This implied that I would have to run multiple polling commands in parallel using dedicated JVM thread pools running external commands first to fetch the data from the polled service and then use the numeric output to update the RRD files using rrdtool. The Play framework (which we were already using for our application servers) is built on top of Akka - an awesome Actor-based system for handling concurrency and safely deal with multi-threading challenges. Scala is actually the primary language for the Play framework so it makes a lot of stuff actually easier than using Java. So that made it a natural choice for me to build SMG on (and never had regrets about that since then).

While we normally wanted most of the stats polled as often as it is feasible (like - at the time - every minute), in some cases stats simply come with their own different/specific time intervals. For example with some log based statistics (and hourly log rotation) it would make sense to run the poller every hour (vs every minute) for these specific stats.

With our MRTG-based system I had to keep separate config for “hourly” graphs (to pass to the hourly poller) but that had a draw-back, I could not use indexmaker to produce an Index including both every-5minutes graphs and hourly graphs on the same page. So I wanted SMG to support different polling intervals for its objects but be able to display them together within the same UI page(s). I also wanted that the (possibly slower) hourly run commands to not impact/slow down the every-minute polling so would use separate (configurable) thread pool for each unique interval defined via an object in the SMG config. That way the hourly polling run thread pool could be throttled by just giving it a very few threads.

Config reload

Since my graphing system config would be somewhat dynamic (mostly - generated by Chef) I wanted to have a simple way to notify SMG from Chef that the config has changed and it needs to be reloaded and without the need of a full restart. SMG itself will parse the entire config into an immutable structure. Having an immutable config greatly simplifies any thread safety issues around access and modifications (its immutable so - no modifications). So on a config reload SMG would parse the entire disk config from scratch, generating a new immutable version of the config and then atomically replace the global-like reference to the “current” config.

More than two lines per graph

MRTG (being a network router monitoring tool) has this annoying limitation of max two (in/out) variables per graph. RRDtool itself doesn’t have such limitation so that one was not a big deal.

However it was actually a good deal to be able to replicate the popular at the time Cacti “templates” in SMG - for example there was a standard “Linux host” template showing the most relevant Kernel stats for a given system and ordered in a meaningful way. Like a CPU usage graph showing all “types” (user/wait/idle/…) of CPU usages stacked on top of each other where their sum would always be “100% x number of CPU cores”.

Later on I also replicated some MySQL Cacti templates in SMG, which ultimately convinced our DBAs to abandon the Cacti instance which they were running themselves (but had to add/remove hosts manually, was OK until the number of databases/shards exploded with growth).

And people would be familiar with these templates and seeing them ported into SMG would make them like it :)

For me personally (never being a “professional sysadmin” before Smule) was a great learning experience to replicate these and/or create new templates for new types of service in SMG - the truth is that one needs to understand the service being monitored in order to confidently monitor it.

Multi - Data Center support

At this point we were already expanding our operation into multiple DCs across the world. I wanted my system to be able to provide a single Central UI from where to be able to browse the graphs from all data centers.

A naive approach would be to try to use a single instance in one of the DCs and monitor everything from there. That has multitude of issues including slowness (to poll for stuff across the world) and scalability issues - no matter how efficient a monitoring system is it would have a certain limit of the number of target servers it can poll every X seconds.

Instead SMG exposes almost all “read” operations available in the UI via an internal JSON-based APIs. Each Data Center instance would cover its local Nodes/Services (and managed by the per-dc Chef instance). Then the Central UI would have the other DC instance(s) configured as “remotes” and will pull and cache their configuration locally. The remote/“worker” instances have the “central” one configured as a remote too but with a flag that it is a worker instance. In that case when the remote DC instance gets its config reloaded it will notify the “central” instance about the event and the central instance will refresh its cache by pulling the conf again.

To avoid potential object id conflicts across the instances from multiple locations, all the remote objects are referenced in the local/central UI instance with their original (local) ids, prefixed with a “@_remote-id_.” prefix.

I decided to stick to an old-school request/response UI with server side generated HTML (using Play’s Twirl templates). SMG was going to be a sysadmin tool where the pattern would be that a relatively few sysadmins would normally be browsing a relatively small subset of potentially millions of time series. The polling/update part was for sure more challenging and I decided to avoid jumping the “pure JavaScript” UI bandwagon to avoid myself complications. Or in other words - the (somewhat arguably) less efficient on the server side UI would be much less likely to kill SMG than the polling/updates part.

In addition I wanted share-able UI views. If I was seeing something interesting in some Dashboard view I wanted to be able to show it to others by just copy/pasting the current URL from the browser in a chat channel or an e-mail and they should see the same as they open it.

At a high level the SMG “Dashboard” page (/dash with no params) would display a “Filter + Graph options” form and the “first page” of the (possibly very big) list of all the graphs defined in config and in the order they were defined. The display is in the form of a grid of graphs with some small amount of info above each. People would use the filter to select what subset (from all) graphs they want to see, submit the form and get the result. The filter form params would translate to GET params making the resulting URL fully share-able.

The actual filter is essentially regex based (like the mentioned MRTG/indexmaker filters) but supports some extra “syntactic sugar” filters like prefix and suffix (not a regex so dots wouldn’t be special and match any char). There is also a “regex exclude” filter available making it easy to write filters like “everything matching this but except some special stuff”. The filter would allow the end-user to set the page “rows” and “cols” which would define the grid and ultimately the page size. Obviously the larger the page size the slower it would be to generate that many graphs and then for the browser to load them. The default SMG page is defined as “6 cols” (if you have wide enough screen …) and “10 rows” meaning 60 graphs which is a good trade-off between usability/speed and the number of items you can gasp with a single page scroll. Of course - this is all a matter of taste and configurable.

Of course one of the most important options when displaying graphs would be the period you want to look at. One observation I had was that in 99% of the cases we would be looking at graphs starting at some time ago and ending “now”. I didn’t want to bother with fancy JS-based date/time pickers and decided that I will use suffixed numbers (which rrdtool already supports) representing the “Period since” with a default value of 24h. I would support suffixes like 1y, 5m, 2w, 24h, 50M, 60 (no suffix - seconds). For the rare cases when e.g. I actually wanted to look at say 24h worth of data but from a week ago I would add the “Period length” option supporting the same suffixed value. So I could set a “Period since” = 1w and “Period length” = 24h and achieve what I wanted in the example.

There is also an additional caveat when dealing with old data and RRD files - old data is aggregated into lower resolution data points. So if you try to see a day worth of data from an year ago you may only see 4 (6-hour-average) or even just 1 (24-hour-average) data point, depending on the RRDs/RRAs structure.

A side note: At some point (later) I was working with a vendor to evaluate some software which could potentially replace our own open-source based home-grown solution. The vendor solution did not ultimately win the evaluation (wasn’t much worse but couldn’t beat our own existing one) and their engineers wanted to see some system stats from the machines used in the evaluation. We couldn’t allow them direct access to our data center but the fact that SMG is using server-side generated HTML, PNG files for graphs and little to no JavaScript for rendering meant that I could do a simple browser “Save page as …” action, send the resulting html file + assets dir as a tarball to the vendor engineers and they could open it and see it pretty much the same way it is visible in the SMG UI. That would be quite challenging to achieve with a pure JS UI -based solution.

A secondary side note (and disclaimer) - do verify the source of the files you would send that way from SMG to a third-party. I haven’t tested it in that way for a while (and it wouldn’t have such issues at the time) but do make sure you are not accidentally leaking some sensitive info that way.

The same form properties (and also http params) would all map to properties in the SMG Index objects. So if anyone would come up with an interesting/useful filter/link I could quickly convert that to a yaml-format Index definition, which might be as simple as like this:

- id: ^hosts._my_host_
  title: "_my_host_ graphs"
  px: "host._my_host_."

I would follow a convention where only (and all) host -related stats will have an id starting with the “host._hostname_.” prefix. Given that no two hosts (Chef nodes) would share the same hostname the above effectively defines a “per-host” Index page.

Then I would add it to my Chef-managed SMG yaml templates and it would be deployed for me to SMG and made available in the “Configured Indexes” view for future use. That “Configured Indexes” view can actually be considered the “main” SMG page and this is what opens up if you open the root url of a SMG instance. As mentioned my old MRTG-based system already had lots of indexes - for each host, for every role etc and these would be mostly a list of regex filters. I simply changed my Chef code to output SMG config Index objects with the same regex filters instead of running the indexmaker commands and had everything ported over.

I also decided that I want some more structure in my “main” page so added a special property to Indexes - a parent Index. So the “Configured Indexes” page actually displays the “top level” indexes (ones which do not have a parent) and groups any “child” indexes under their parents. The page can display more than one level in addition to the top-level but that can be unpractical with too many indexes - the page might become huge.

Graph aliases (View Objects)

I had a use case to have a UI page with some selected graphs out of a large group of otherwise generated by chef and/or a script objects where I didn’t always have full control on the object ids (e.g. the ids for the ports of a network router device). These would look like network.edge01.ethXX (where XX would be some number). I could have a regex filter listing all ids I care about like (network.edge01.ethXX|network.edge01.ethYY|…) but that could end up being a very long regex.

So I decided that it makes sense to implement a special new type of Graph objects. These wouldn’t have their own interval and command to fetch data but would instead reference another RRD Object effectively being an alias (an additional name). In the above case I could alias the network.edge01.ethXX to something like netext.edge01.vendor_name and then after defining such aliases for all ports/vendors I care about I would define a simple prefix-based filter based on the aliases prefix (“px: netext.” in the example).

These objects would be “free” on the update side but on the display side in addition to being aliases simplifying filtering these could apply transformations to the referenced RRD Update object values - including only using subset of the available vars, re-ordering them or even applying some arithmetic calculation to them using cdef/RPN expressions.

Aggregate functions, grouping and sorting

We used to have some MRTG stats computed from the outputs of others. That involved some ugly hacks using temp disk files for every polling run. I wanted SMG to be able to do better and I knew that with RRDtool and RPN expressions I could have graphs based on the output of pretty much arbitrary arithmetic calculation involving existing RRD objects values.

A common use case was to be able to see the sum of traffic from all of our “External” (internet-facing) links in one graph. A sudden drop in this graph would mean that something is likely very wrong. A sudden spike would also be worth checking as these rarely come up just because of organic growth. But the nature of these individual links is such that traffic would sometimes switch from one to another provider (usually because of issues outside us). So its not uncommon to have a sudden drop or spike in traffic on any given link. As long as the “SUM” graph of those is not showing issues I knew our users are there and using the app.

Eventually we wanted such aggregations to be possible across the graphs from multiple (remote DC) SMG instances (the “Cross-remote” UI checkbox). The way this works is not extremely efficient - SMG will download all remote RRDs needed for the aggregation and apply it locally for display. This can be slow if too many RRD files are involved but that was a rare case and few tens of 100-200K RRD files wouldn’t be a problem to download. Note that “local” (vs Cross-remote) aggregations happen locally on the SMG instances and are much faster and more efficient. Reality is that the later (local aggregations) is a much more common use case so the Cross-remote aggregation was never a performance problem.

So I implemented a bunch of “aggregate functions” in SMG UI:

GROUP - display the grouped lines all in one graph
STACK - same as GROUP but have the graphs stacked on top of each other.
SUM - display the sum(s) of all graphs of the aggregated together graphs
SUMN - same as SUM but treat NaN values as 0.0 (with SUM any NaN value will result in a gap in the graph). This was useful in the context of the mentioned use case with the External network links. If we added a new provider with no history data the SUM of all network links would show no history (would be all NaNs until the last provider was added). This is where one could use SUMN which would treat the missing provider data as 0s and correctly show history. One might ask what is the SUM use case then which would be a good question. The fact is that when you want to watch a graph in “real time” and possibly refreshing it like crazy you will hit a situation where only some of the summ-ed objects will have a value for the “current”/last timestamp and some of them will be yet to be updated. In that case the ones yet to be updated would report a NaN and the if treated as 0.0 it would show up as a sudden drop in the last data point (always scary to see …). The correct behavior is to use SUM and see no data for the last data point (until its actually available). So the rule of thumb was (and still is) - use SUMN for “historical data” over long periods (where the last data point is likely some longer-than the update interval average value) and use SUM for everything else.
AVG - that’s similar to SUM but the resulting (sum) value would be divided by the count of involved objects effectively calculating average(s)
MAX - plot the max value from all the aggregated graphs
MIN - plot the min value from all the aggregated graphs

The way this worked initially was to only group together graphs with identical “var definitions” - essentially the same number of vars (yaml maps) where the yaml maps would have to be identical (normally - this meant that they were defined from the same Chef template definition). Later this was extended to allow me to have more options on how to group the displayed graphs, including object id suffix/prefix and (since very recently) object labels/tags. The available types of grouping are listed in the “Group By” drop down selection box.

This grouping can be used not only to produce merged/aggregate graphs from others but to also simply re-order them in groups and then optionally - Sort them by values (within the groups).

It is worth mentioning that all of the UI aggregate functions work only over the currently displayed objects (or “current page”). The reason for that is to avoid users accidentally trying to group/sort/aggregate all (possibly millions) defined objects which otherwise match the UI filter. In short this means that in order for these to work as expected all of the involved in the grouping graphs must be visible in the first UI page. If you try to use them on a multi-page filter result SMG will ignore the invisible on the current page objects, still do the aggregation but also display an error message telling you that its probably wrong. The way around this is to simply increase the page size via the rows/cols filter props and get everything you wanted aggregated on one page. Rationale was that if SMG can generate that many graphs and your browser would be able to display them all within one UI request - it would be safe to tell SMG to actually try to aggregate them. In the sort by value case involves reading from all involved RRDs separately from generating graphs from them and with cross-remote aggregations all of these would have to be downloaded first.

This (the grouping/sorting) is an area where SMG can be considered under-developed even to this day but I never bothered to make it more advanced (and complex) because the available options would cover pretty much all use cases I had. Some common examples include grouping all graphs for hosts in a role buy their “standard” system stats next to each other to spot misbehaving ones. Similarly - with the stats from multiple identical Haproxy/LBs or MySQL/Db server shards - easy to see outliers once grouped together. Another use case would be to sum together the CPU usages from all systems in given role - its fairly easy to see how much capacity one has by looking at the green (“cpu idle”) area in the stacked sum.

Aggregate Update objects

Having these Aggregate functions (and “Aggregate Object Views”) was nice and solved a lot of the use cases we had but also had some issues. With the example of the External network graphs sometimes we would turn off and remove a provider. If I removed the object that would change the SUM history picture (the traffic previously on that link would be subtracted). I could in theory keep these objects forever (updating them with 0s) just for the sake of keeping history but they would still pollute my non-aggregate “External network” view with empty graphs.

There was another use case with similar issue - we run multiple Haproxy LBs in various points of our infrastructure. Often there would be a bunch of these running identical configs and traffic would be somehow balanced across them. We did want the SUM of these stats from any group of such identical LBs. The same issue would show up with these - if we remove/replace some instance, the history would be messed up.

My old MRTG-based system did not have these “display-time” aggregations available at all but I was able to work around that by using ugly hacks where in addition to outputting the individual stat the MRTG script would also output it in a temp file. Later in the polling run (and after the individual nodes polls had completed) I would run a special command which would know which objects (temp files) to aggregate and that would read all of them and output the aggregated value.

I could hack something like that using SMG’s external commands too but decided that this use case is common enough to deserve its own type of objects - “Aggregate RRD Objects”. These are defined in SMG config as all other SMG objects and their object id would start with the “+” char (otherwise disallowed in object ids). Instead of a command to execute every update interval these define a list of regular RRD objects and an operation (one of the Aggregate Function above) to use to combine their “last” values on every run.

In order for this to work Aggregate objects are always updated at the end of the polling run (in a separate “run stage”) and after all individual included objects had their values fetched and updated in their own RRD files (and the values cached in SMG memory for such aggregation purposes)

Later this concept was extended to support a special RPN agg opp and the resulting aggregate object values could be calculated from any other “last” values using arbitrary arithmetic expressions.

Strict about “overlapping runs”

I wanted the polling always to be able to complete within the interval time, i.e. to avoid overlapping runs of the interval pollers. Having overlapping runs would mean that I am not meeting my “Service Level Objectives” about my system polling for values every minute. It would be an indicator that it needed to be scaled - whether horizontally and/or via optimizations or ultimately - by somehow logically splitting the Data Center into two (or more) parts in the Chef recipe generating the configs, likely based on roles.

Strict timeouts

With monitoring its not uncommon for a service being polled to be unresponsive/slow and client connections to get “stuck”. We don’t want our poller to stay blocked on an external command forever and we want to enforce strict timeouts. So every command defined in SMG config has a timeout associated with it (half the interval unless explicitly set via “timeout” setting).

Initially I had that implemented with a Java-based timeout which was supposed to kill the polling child process once hit. However I noticed that this wasn’t always the case and I would see “stuck” poll command processes piling up forever. For whatever reason Java (“8”, less than an year old at the time) was leaking child processes … at that point I decided to wrap/prepend all external bash commands SMG runs with the GNU timeout command. For example the above example curl command would actually become “timeout 30 curl -o /dev/null -s -f -w ‘%{time_total}’ https://www.google.com”. Although this adds an extra fork per polling command it proved to be extremely robust - I have never seen GNU timeout failing to terminate misbehaving child processes.

Period-over-period

There was an (already old at the time) Facebook developers presentation about their operations and monitoring stuff (which unfortunately I am having trouble to find now). It was talking about how the traffic would follow certain daily/weekly patterns following the natural human cycles. People don’t use our apps when asleep but might use them more over the weekend compared to working days, obvious stuff like that.

This was certainly the case for Smule - I was already familiar with our “hat”-like (or “boa digesting an elephant” from the Little Prince if you want) shape of a lot of our 24-hours graphs.

So the Facebook folks were suggesting to somehow always overlay your “previous period” when looking at charts for given “human cycle” period (like day(s), week(s) etc). SMG does that by default and in almost all charts (except for STACK) will display the “previous” period with dotted lines together with the requested period charts. It becomes very obvious when you have a stat “drop” or a “spike” but also smaller drops or increases. SMG does have a check-box to disable the PeriodOverPeriod lines but in reality the only times I had to use that option were when wanting to share images with third-party vendors (and wouldn’t want to have to explain to them what the dotted lines mean …)

The Plugin system

As mentioned above one of the systems constantly bugging me as the “Monitoring guy” would be our application servers instrumentation stats or our “Jamon” system. My ruby scripts handling all the parsing and updates were approaching their limits - I had to do something about these.

However these would not fit the standard “run external command”/“update rrd file with output” polling cycle SMG would be using. The stats would have to be parsed from the previous hour logs and the rrd files needed to be updated following the timeline (and with the timestamps) of their generation. Also I wouldn’t always know the objects (coming from the logs) beforehand - new ones would simply appear there, i.e. these could be quite dynamic.

There was also another use case which didn’t fit the “run external command”/“update rrd file with output” cycle and that is JMX monitoring. While it is possible to implement fetching JMX values using external command it would not be very efficient - the establishment of the JMX connection tends to be orders of magnitude slower than fetching the values. It would be much more efficient to somehow persist the JMX connections between the fetch calls.

Both of these wouldn’t fit the (IMHO at the time - neat) concept of simply running commands to get data for updates. At that point SMG was already getting close to stable in its goal of replacing my original MRTG-based system and I didn’t want to pollute the code too much with new stuff and risk breaking it.

I decided that there will likely be other similar use cases in the future so I would make SMG extensible using a plugin system.

The SMG core defines a Plugin interface (a Scala trait) which could expose its own custom list of Graph objects, Indexes etc, similar to the regular SMG objects defined in yaml. Plugins can reference and use classes form the Core SMG system but the Core SMG system does not know anything about plugins besides their names and only interacts with plugins using the Plugin trait.

All plugins (Scala classes) would have a standard constructor accepting the plugin string “id” (to be used as a name/reference by SMG), the plugin “interval”, the “plugin config file” (a string) and a reference to the internal SMG “Config” service. All available plugins are listed in the Play application conf as a list of

{ id = ..., class = "com.smule.smgplugins....", interval = ..., config = "..." }

objects. Plugins would be instantiated using standard Java reflection means (Class.forName(…).getConstructor(…).getInstance(…)).

At startup time SMG will go over that list and load all plugins where the file pointed by the “config” value exists (it doesn’t care about its contents - it would be up to the plugin to parse and make sense of it). That would allow me to have a single unified SMG build (with all available plugins bundled) but only actually enable a subset of them and without modifying application.conf. I.e. I could have these managed by chef and based on the role my SMG server was having. I was planning to run the Jamon stuff on dedicated machine anyway (there was no way I could co-host it with the system stats) and wanted an easy way to turn on/off that functionality via the Chef-managed /etc/smg dir.

Plugins can implement a run() method and the SMG scheduler will call it every “interval” seconds, along with the regular poll/update runs. Inside this run() method there could be arbitrary logic to somehow fetch data and update rrd files based on that logic. But anything else too …

Plugins would also be subscribed as “config reload listeners” so whenever a Config reload happened these would be notified to potentially re-read their own conf files.

Initially that was sufficient for me to implement the replacement of my Ruby/RRDtool-based scripts and finally had some headroom in the capacity of the system. I was also able to implement the fist version of the JMX plugin using the same framework. That was later rewritten to use a different approach but at the time it was doing the job (albeit somewhat ugly code-wise).

Over time the Plugin interface was expanded providing even more ways to extend SMG (including display plugins, monitor check plugins, “custom commands” plugins and what not). In a lot of cases later I would resort to implementing some new feature as a Plugin initially (and possibly - somewhat quick-and-dirty), greatly limiting the risk of breaking existing functionality (which would be running in production). If I liked the result I could decide to actually merge the new feature in the core codebase (or just clean it up and leave it as an enabled by default plugin, if that makes sense).

Later (when SMG was open sourced) this approach yielded additional benefits - for example I didn’t want to open-source my Jamon plugin. It would be way too Smule-specific (and worse - possibly leak some info about our systems I wouldn’t want to make public). It would have been a likely significant effort for me to separate that part if it was built as part of the SMG core. Because it was a segregated plugin I could keep it outside the open source repo and have a custom build process which would build the open source version, then build my proprietary plugins (depending on the open source build) and then bundle all of these in a package suitable for Smule internal usage.

The JSGraph and Calc plugins

The Plugin UI interface would be simple - Every Plugin would have its own menu item and the plugin would “own” the HTML content of its page view, in the form of a string which the SMG Core retrieves on user request. This would work via the Plugin Trait htmlContent method which would also accept the http parameters passed so I could implement the full http request/response handling for the plugin UI URL entirely within the Plugin.

In addition Plugins could expose PluginActions which would be applicable to individual graphs and SMG will display these actions as a small links above the graphs in the meta-data section. When clicked the action and the object it was clicked on would be passed to the appropriate plugin which would know how to handle it.

With these two I could implement the “Zoom” action. PNGs are great for efficiency but sometimes it is nice to be able to have interactive JS graph with the exact numbers for specific timestamps to show up on hover. So I decided to create the JSGraph plugin - it would expose the “Zoom” action to any Graph object.

The plugin itself would expose a single JavaScript based HTML page in its htmlContent method. The JavaScript on the page would “know” the object id and its csv fetch URL/API and could get the timeseries from the server and plot them however it likes. The Zoom action would simply plot the time series in line charts and show some metadata around that.

But with that I realized that I could do some more extra actions based on the timeseries data. One example is the “Histogram” action - you can get the values grouped in (user defined) buckets and see the histogram of entries for these buckets.

Later I also added one more action which is quite useful - the “Deriv” action. That would plot the “First Derivative” graph of the time series. The way it works is quite simple, its almost the same as the Zoom action but instead of plotting the timeseries as retrieved from the csv fetch API it will first calculate the (v2 - v1) / (t2 - t1) derivative values for every two adjacent data points and plot these numbers instead. Note that this transformation can be applied more than once, e.g. applying it twice will yield the “Second Derivative” function and so on. The plugin actually supports that but it is too rare use case and I decided to not expose it as a PluginAction so currently only “Deriv” action is visible by default.

The first version of that plugin was using Highcharts which is a nice library but unfortunately - not an open source one, so I couldn’t bundle it and redistribute it with SMG later when open sourcing. At that point I rewrote it using the JS version of the plot.ly library and this is what is used to this day.

The other very useful Plugin I implemented back then was the “Calc” plugin. I already had most of the aggregations we would need covered using the mentioned one-click aggregate functions. But there would be cases where I wanted to compute arbitrary complex expressions from a bunch of graphs. RRDtool supports arbitrary RPN expressions and I was considering to simply expose UI for people to write these. Yet they are convenient for computers to process but not very convenient for humans to write and understand. So I decided that I can actually make the UI accept human-friendly arithmetic expressions consisting of graph object ids, numbers, arithmetic operations (+-/*) and parentheses for ensuring operators precedence. I decided to leave (computer languages style) operator precedence for a future enhancement and simply apply the operations left to right but starting with the innermost parentheses and processing “outwards”. That type of processing actually also makes it very easy to translate the human-friendly expression to a RRDtool-friendly RPN Expression. Because an object id can actually map to more than one variable (timeseries) the 0-based index of that variable can be specified as a number in square brackets after the object id. If omitted - the first ([0]) variable will be used.

As an example the expression to calculate the cache hit % (or cache efficiency) for given CDN can be calculated based on the egress (user) traffic and the ingress (origin) traffic via a formula like this:

(1 - ( cdn._provider_.ingress[0] / cdn._provider_.egress[0] )) * 100

That would translate to the following RPN Expression:

1,cdn._provider_.ingress[0],cdn._provider_.egress[0],/,-,100,*

So yeah - obviously its easier to write human-friendly expressions :)

The Calc Plugin UI is far from great and has always been considered “Beta” (albeit being one of the oldest plugins implemented). From one perspective - it is a bit rough around the edges but it does the job it is supposed to do and the use cases for it are not that frequent. If I ever decide to make it much more user-friendly I will likely move that functionality in the SMG core vs trying to work around the Plugin Trait interface.

The JSGraph plugin on the other hand is unlikely to move to the SMG Core. It is certainly subject to multiple different implementations which can be based on different JS libraries and via the PluginActions it is already reasonably integrated with the core UI.

“Inspect”

One of the first features I implemented in my system was to be able to “inspect” the individual objects internal representation/state.

Putting too much structure into that would make me have to change it whenever I changed something about objects (quite often especially in the beginning). So most of my objects would implement an “inspect” method dumping its internal state into a string. Separately, I could use Scala’s toString method on my “case class” config objects for that too.

The actual inspect page would simply display a bunch of such “inspect” strings reflecting the object but later - its parent object(s), alert and notifications configs etc.

To this day this is the most under-developed feature in SMG but some day I may decide to make it more user-friendly. Technically there should be no valid use cases to check that besides debugging and all the info is already present in some form in other parts of the UI so not really a high priority ;)

Nagios checks for errors in the Graphing system

SMG would be a new thing, no one would trust it to be a replacement of the existing and battle-tested (albeit old-school) MRTG -based system from day 0. So I was quite diligent in logging any unexpected/error conditions in the SMG logs (and avoid logging too much under normal/“everything works” situations). Then I had a Nagios check which would check these logs every minute for any ERRORs and alert us that something was not right. That would include any (not necessarily unexpected) polling failures. Soon after I realized that whenever some service goes down (and we get the due Nagios alerts) we would also get a SMG alert that the polling for graphing is failing. Not necessarily great but at least initially this was helping reinforce my trust in the system.

The actual migration took a while - I had my new SMG -based graphing system running in parallel with the existing MRTG-based one for months (fixing all bugs discovered along the way) before eventually gaining enough confidence in it to ditch the old MRTG-based system.

Part 3: Checks, Alerts and Notifications - replacing Nagios (2015)

At that time we were still using Nagios (one instance per data center) for alerting but SMG was now our “official” graphing system.

One observation I had was that more or less the same polling was happening for graphs and monitoring. As mentioned Nagios had a check for SMG log errors … which was coming whenever (and sometimes before) some actual service Nagios alert came.

Nagios itself was (and still is) a well established monitoring system. It would be known in its reliability in “nagging” people about problems. And reliability would by far be the most important property of any monitoring system because its the “last line of defense” against bugs but also operational issues. Nagios was apparently built by people who knew what is in a Data Center and how to address the needs of the sysadmins maintaining it with the ultimate goal of minimizing downtime. It had a bunch of concepts about the monitoring process which would make a lot of sense.

Nagios features and monitoring process

Nagios has the “Host” concept (ping command check) and Services concept (checked by external commands following certain protocol). Services would be “attached” to hosts either directly or more commonly via hostgroups - lists of (in my case - usually generated by chef) hosts belonging to given role and thus - hostgroup. If a host is down you get a “Host down” alert (services on that host are not even necessarily checked) and if its up, individual services and stats are checked and we would get individual services alerts as desired.

In Nagios any Host can be in UP or DOWN state as determined by a ping command. Services can be in one of the following states - OK (all is good), WARNING (something is not quite right), CRITICAL (something is definitely wrong) and UNKNOWN (“no idea” or plugin timeout).

Technically Nagios also has an (optionally enabled via “flap detection”) FLAPPING state. If a service goes from good to bad and back to good state too many times per given interval it would be declared in FLAPPING state and no further alerts would be sent (until its no longer flapping, however that is determined). This can be useful to limit alert spam to some extent but not very effectively. On the other hand it does bring the risk to miss alerts (which is very bad for the “Monitoring guy”) and that did bite us on at least one occasion. Ultimately we decided not to use flap detection so the FLAPPING state would not be present in our Nagios setup. I also didn’t implemented such feature in SMG - it has other (and potentially safer but more effective) means to deal with alert spam.

Normally when a service or host is in bad state Nagios would repeat the alert every 2 hours (and until the service recovers in which case it would send a “RECOVERY” e-mail/alert). Of course that could be annoying but it would imply a process where someone (on-call person) will have to Acknowledge the problem (in the Nagios UI) which in turn will stop the repeating alert messages. When that happens Nagios will send an “ACKNOWLEDGMENT” alert and the service/host state will become “Acknowledged”. The Acknowledged state would automatically clear on recovery - so a subsequent failure will ought to trigger a new alert.

Another use case for wanting to disable alerts would be planned maintenance - if you are going to work on something and know its going to be down for some time you want to prevent alerts upfront so that everyone else will not freak out. In that case Nagios would allow you to “Schedule downtime” for given host or service. That option will force you to enter a period for the downtime. That is actually great as even if you forget to “Unschedule the downtime” after the work is done it will eventually expire. The “permanent” disabling of notifications option would be much riskier and we would normally avoid it (it did bite us at least once to forget to unsilence a silenced for maintenance service). With the mentioned Flap detection disabled we would sometimes have case where a service is going bad and recovering in cycles. In that case Acknowledgment doesn’t help a lot as it clears itself on the next recovery. So that would be another use case to schedule downtime for a service and then maybe switch into “active” mode (looking at graphs and logs) to find a way to fix it.

As one can imagine, in the role of the “Monitoring guy” I ended up writing tons of custom “Nagios plugins” for custom checks. As scary as this may sound, it would be about writing arbitrary scripts which follow a simple protocol:

If the target conditions imply that there should be a CRITICAL state (often comparing some fetched stat value to defined “critical” threshold) - exit with 2
If the target conditions imply that there should be a WARNING state (often comparing some fetched stat value to defined “warning” threshold) - exit with 1
If all is good exit with 0 and print an “OK” message
Any other exit code would result in UNKNOWN Nagios state but that would be considered more like a plugin bug (we would treat these as the same alert level as CRITICAL states). A timing out Nagios plugin would also result in an UNKNOWN error.

The two most-important (IMHO) properties of a Nagios plugin/script would be:

To be reasonably fast - slow checks would slow down the Nagios poller(s) which in turn could eventually increase the actual interval between individual service checks beyond the “check every minute” requirement. Usually these would be involving network calls which would normally be the slowest part. There were some issues with that - we wanted some Nagios alerts for stats coming via expensive and slow analytics queries. These queries could take around a minute to complete but whenever it exceeded 60 seconds Nagios would kill it (that was my timeout). In addition some log-parsing based statistics would take even longer time to generate and doing that synchronously within the Nagios plugin would be problematic. I decided that I can simply change my slow plugins into cron jobs (deployed via chef) but instead of using an exit code to indicate check success/failure it would output a temp text file where the first few characters of the first line would be one of “OK”, “WARN” or “CRIT”. The remaining output would be the same as if it was a regular Nagios plugin. Then a dedicated Nagios plugin (configured in Nagios) would check that file (likely over ssh), make sure it was recently updated and use the contents to determine the exit code (and also output them). If the temp file didn’t start with one of the supported prefixes, that would mean UNKNOWN Nagios error.
To be reliable … - For a service check to be “wrong” it would mean one of two things - “the service is up but the check is failing” (“false positive”) or “the service is down but the check is succeeding” (“false negative”). The other cases where the check correctly reflects the actual status of the services are good as far as the check script is concerned. And there is a big difference between the two “wrong” cases - while the “false positive” case can be annoying it can never do the same damage as a “false negative” can do. In other words if you get a false alarm you may be pissed off at that but if the service is up then its just the “Monitoring guy” fault and he needs to fix his checks. But if your service is down for a day (or a weekend) and you don’t even know about it because your check is happily swallowing errors this could be a disaster. Still the “Monitoring guy” fault but this time with much worse consequences - some companies can go out of business if down for days …

In any case a wise approach when writing monitoring scripts (and monitoring code in general) would be:

Keep it simple and don’t swallow any errors. Bail out with an error as soon as something unexpected is happening.

How exactly that is handled would depend on the platform/language the script is being written in. For example using Bash can be tricky with that regard (hint: always use set -e) so I would normally avoid using it in favor of slightly “safer” in my book languages like Ruby or Python. I should mention Perl too - although certainly not “safe” (for me) the majority of the upstream Nagios plugins would be written in Perl. So if I needed a slightly modified existing Perl plugin I could allow myself to tweak a copy of it but for any non-trivial changes I’d rather rewrite in Ruby or Python. The choice between these two would be mainly depending on the availability and stability of certain client libraries for the service I would be checking.

Very often a Nagios check plugin would look something like the following simplified pseudo code:

res = fetch_data_from_host_service(host,port) 
# error out if failed to fetch/parse,
# usually with critical/exit code 2/CRIT

if res.stat_under_check > crit_thresh
  print CRITICAL ...
  exit 2
else if res.stat_under_check > warn_thresh
  print WARNING ...
  exit 1
else
  print OK ...
  exit 0
end

One observation from that was that if I had multiple checks against values from the same host/service each of these would call fetch_data_from_host_service on its own which is not necessarily the most efficient way (remote calls can be slow). Also if such service with multiple checks goes down we would get alerts for all the checks. Note that it would be possible to combine these multiple checks into just one plugin. That could address the concerns but would be against the “keep it simple” principle - the if condition would need to have much more branches and multiple thresholds would have to be supplied.

IMHO a better approach would be to separate these two - ideally the fetch_data_from_host_service would be called just once per polling interval and then we could have multiple checks based on that (cached) result. This is the approach i took later in SMG.

In addition to having “default” system CPU/Mem/Disk Usage etc checks for every host, we would have (Chef role based) service specific checks (like Haproxy LB up/down, Haproxy connection queues, MySQL up/down, MySQL replication) but also lots of “one-off” checks. A lot of these were the result from various production issues and whenever we identified a root cause for some issue we would likely add a specific Nagios check for that condition so that whenever it was causing issues again we would know immediately. The net result (then with thousands of systems across multiple DCs) was that we would get a lot of Nagios alerts …

Reality is such that if you want to know about and get an e-mail for every (even minor) issue happening across a fleet of a few thousand servers you should be prepared to deal with a lot of alert spam. Nagios has some ways to address that:

Normally a single failure event will not trigger an immediate alert. At that point the service is in a “soft” error state. If the error recovers on the next check run it will not trigger an alert. Only if a service is in an error state for 3 (configurable) consecutive runs it will result in a “hard” error and notifications (e-mail and any other configured) would be triggered. This alone greatly reduces the alert spam - soft errors would only be visible in Nagios UI and logs if they recover before becoming hard. Note: It was still a good idea to review the Nagios log on a regular basis - too many instances of a soft error might indicate an issue.
The distinction between WARNINGs and CRITICALs was also useful for us. At a high level we considered WARNINGS as things which need to be checked but not necessarily urgent (i.e nothing is broken yet, but there are signs that it might break soon). Often these would recover by themselves and in such cases it might be sufficient to adjust some alert thresholds (and if these happen too often). Critical alerts in turn would be reserved for stuff which potentially required immediate attention - stuff like a host or service being down and needs manual attention to bring back. These would also trigger a PagerDuty alert so the on-call person would have to respond within 5 minutes.

As one can imagine we would get a lot more warnings than critical alerts (often an alert would first be in a WARNING state for a while before becoming CRITICAL). While originally the entire “extended operations” (sysadmins and server developers) would get all alerts at some point the noise was too much (and the group was growing) so we started sending the WARNINGs only to the smaller groups of sysadmins.

SMG Monitoring configuration

In addition to the draw-backs mentioned above about having two monitoring systems, Nagios checks have the issue that they would be “stateless”. For example I wanted a Nagios check to verify that our current traffic levels have not dropped below certain % from yesterday’s traffic at the same time. Just defining an absolute value for such check (alert us if below X Gbps) doesn’t work well as our peak traffic volume could be 3 times the lowest point for the day. Of course I had the traffic data in my SMG system so it was not too hard to implement a Nagios plugin which gets CSV data from SMG over HTTP and compare the recent vs 24-hours old numbers. With time more such (“stateful”) checks would pop up and I had to keep them in sync between the two systems (graphing and alerting). That could be laborious and error-prone and the “Monitoring guy” wouldn’t like it.

Eventually I added some features in SMG to help me with that and as with most new (and quick-and-dirty) features it would start as a SMG Plugin. For a while I was adding features including extending the Plugin interface to support “Monitoring plugins”. In the very first versions these would simply run every interval, do a bunch of checks as configured and output “status files” which Nagios in turn would poll and possibly trigger alerts.

My problems in the “Monitoring guy” role would not be that important though. It was more important that we weren’t as efficient as we could be in quickly identifying and troubleshooting our production issues. The on-call person would get a Nagios alert which meant that they would have to go to the respective DC Nagios instance UI, acknowledge the error and then more often than not - go to the SMG UI for troubleshooting. For example with a lot of check types (including common stuff like disk and mem usage) you want to know how the respective stat reached its alert threshold value - like was it slowly growing over days or just spiked in the last 5 minutes. In my mind a better monitoring system would send me an alert e-mail containing links to relevant graph(s) and the ability to acknowledge the alerts form the same links/UI.

And then the ultimate issue which was popping up with Nagios was that it was approaching the time where it could not keep up with all services and checks we wanted it to do in the desired frequency (which was - every minute). This could mean delays in the expected time between the event happening and the alert triggering and that’s bad. I was considering to “shard” the hosts in the bigger Data Centers onto two Nagios instances to address that but the usability of the system would certainly not improve with that.

With the knowledge that about 90% of the stuff Nagios was polling for was also polled for in SMG I decided that I could try to implement a Nagios replacement within SMG and simplify “Monitoring guy” job but even more importantly - improve our ability as an operations team to identify and troubleshoot issues.

That is all good but replacing a battle tested solution like Nagios is still far from a trivial task. I thought that in order for my system to be adopted by the team it would have to support everything Nagios was doing for us (but not necessarily everything Nagios was doing in general). And then it would have to be better in at least some ways for people to like it and adopt it.

Here are some features I had in mind when I started working on the SMG “Monitoring” subsystem

Command trees

Nagios’s host -> services relation is nice (defining a dependency - like services depend on their host being up) but in SMG I wanted this to be more generic and actually support multiple such dependency levels and not just two. A natural structure to define parent -> children relationship would be a “Tree” of “commands”.

SMG already had the “command” notion in every graph it was managing - values would be retrieved via a bash “fetch command”. I decided that I could extend my SMG RRD Object configurations to support a “pre-fetch” definition - a string id pointing to a special new type of global object called $pre_fetch. An example such object definition could look like this:

- $pre_fetch:
  id: host._myhost_.ping
  command: ping -c 1 _myhost_ip_
  timeout: 5

Having such top-level “ping” command would be the equivalent of the Nagios “Host” concept. If that command failed I would detect that the host is probably down. Note that I could (and actually did) use a slightly more resilient ping command if I thought that a single icmp packet dropped is too weak signal to conclude that the host is down - something like

ping -c 1 -w 2 _myhost_ip_ || ping -c 1 _myhost_ip_

i.e. it would try twice, the first time with a two seconds timeout.

Then I could define the top-level ping command children - these could be RRD Objects but also other pre-fetch commands. Both of these support a pre_fetch attribute which can have the value of the id of a $pre_fetch global - host._myhost_.ping in the above example. For example normally we would fetch most of our systems OS stats via SNMP using snmpget commands. These would include stuff like CPU/Memory/Disk/etc usage and we would monitor a bunch of these for every host in Nagios. However if SNMP was down or not responding on some host we would get a bunch of alerts with the same (and not necessarily clear) root cause. Instead I would prefer to get a single alert telling me that SNMP is down. The way to address that in my new system is to introduce a $pre-fetch command which checks that SNMP is up and do not even try to run the individual stats fetch commands. This could look like that:

- $pre_fetch:
  id: host._myhost_.snmp 
  # pre_fetch being a "pointer" to parent id
  pre_fetch: host._myhost_.ping
  command: snmpget -v2c -c_community_ _myhost_ip_ sysDescr.0

Then I could have my “leaf” (RRD Update) Objects parented off that command, like this:

- id: host._myhost_.snmp.some_stat1
  pre_fetch: host._myhost_.snmp # pointer to parent
  command: snmpget -v2c -c_community_ _myhost_ip_ some_stat1_oid.0
  vars:
    - label: some_stat1

- id: host._myhost_.snmp.some_stat2
  pre_fetch: host._myhost_.snmp # pointer to parent
  command: snmpget -v2c -c_community_ _myhost_ip_ some_stat2_oid.0
  vars:
    - label: some_stat2

Or at a high level every host and the services on it would be defined as a tree which might look like this:

host._myhost_.ping
    host._myhost_.snmp
       host._myhost_.snmp.some_stat1
       host._myhost_.snmp.some_stat2
       ... 
    host._myhost_.mysql
        ...
    host._myhost_.jmx
        ...
    ...

Because every non-root node in such tree would be attached to a parent via the id “pointer” it would be possible to create configurations with cycles. These would be invalid and SMG will reject such configuration. Cycles are prevented via a simple limit of the number of the nested tree levels supported. Currently limit is 10 and I have never used more than 4 or 5 levels.

That looked good to me - I had a way to define service/objects dependencies and limit the detection and alert to the place it actually fails. But it was somewhat flawed the way I wrote the commands above.

First - it would be possible for SNMP to fail right between the “host._myhost_.snmp” command and the “host._myhost_.snmp.some_statX” commands. That would still result in a bunch of child commands all failing. Yet that was not so common case to be worried too much about it. The other issue however was worse - I was not actually reducing the number of remote calls for the checks - instead I added an extra call to be executed before the others.

But then the idea came to me - what if instead of running all these snmpgets I run just one but with all the values (oids) I care about in one shot and then redirect the output to a file? I could actually use that output in the child commands to parse the needed SNMP values with no extra remote calls. It would look something like this:

- $pre_fetch:
  id: host._myhost_.snmp 
  pre_fetch: host._myhost_.ping
  command: snmpget -v2c -c_community_ _myhost_ip_ some_stat1_oid.0 some_stat2_oid.0 ... >/tmp/host-_myhost_-snmp.out

- id: host._myhost_.snmp.some_stat1:
  pre_fetch: host._myhost_.snmp
  command: grep some_stat1_oid.0 /tmp/host-_myhost_-snmp.out | awk '{print $2}'
  vars:
    - label: some_stat1

- id: host._myhost_.snmp.some_stat2:
  pre_fetch: host._myhost_.snmp
  command: grep some_stat2_oid.0 /tmp/host-_myhost_-snmp.out | awk '{print $2}'
  vars:
    - label: some_stat2

Of course the “grep”s in these example RRD objects definitions would actually be slightly more complex scripts to validate and parse the SNMP output but the idea should be obvious. Also this is still not the most efficient way to do this stuff - much later SMG would be enhanced to support passing parent (pre_fetch) command output to child commands directly from memory but at the time the solution was already great for me. I had eliminated a lot of network calls consolidating them which significantly improved the throughput of my system (how many commands/checks it could run per minute)

The Command trees eventually got their own UI page in SMG under the “Monitor” -> “Run Trees” menu.

Flexible alerts configurations

In SMG all RRD Update objects (the leafs of the commands tree) ultimately represent one of more (timestamped) series of numbers. The vast majority of the checks based on numbers would be to compare the values against pre-defined thresholds. In addition I wanted to have separate “warning” and “critical” thresholds to be able to map my Nagios checks thresholds into the new config. So a value could be in one of the following states matching the Nagios ones - OK, WARNING or CRITICAL.

In order to accommodate these I came up with the following config properties to define value-based alert thresholds:

alert-_level_-_op_: _threshold_value_

In the above level would be one of “warn” or “crit”, op would be a string representing arithmetic comparison, one of “gt”, “gte”, “lt”, “lte”, “eq” or “neq”. So an example alert config for a stat representing the memory usage in % on given host could look like this:

alert-warn-gt: 90
alert-crit-gt: 95

That would translate to “send me a critical alert if memory usage is above 95%” and “send me a warning alert if above 90%” where a critical threshold would always have a precedence over a warning threshold (the later would not even be evaluated if the critical condition is met).

Then the question would be where to put this config. I wanted to be flexible and (similar to the graphing subsystem) be easy to add one-off checks (on possibly a one-off graph). For these it would make sense to be defined inline with the graph object definitions, together with the value label/measurement unit etc. I would also use inline definitions in templates - e.g. the base “system” template containing the “host memory usage” stat would likely contain similar to the above definition. Here is how an inline definition would look like:

- id: host._myhost_.snmp.mem
  command: get_snmp_mem_percent.sh _myhost_ip_
  vars:
    - label: mem used
      mu: B
      alert-warn-gt: 90
      alert-crit-gt: 95

In addition I wanted to be able to apply such thresholds on selected groups of graphs values. The standard way for SMG to group a bunch of graphs was the “Index” concept and I decided to reuse that. But these define groups of objects/graphs and we want to apply the value-based thresholds to individual time series/vars in these graphs. I added an alerts property available to any index. That property itself is a list of alert definitions, like this:

- id: ^hosts._my_host_
  title: "_my_host_ graphs"
  px: "host._myhost_."
  alerts:
    - label: mem used
      alert-warn-gt: 90
      alert-crit-gt: 95
    - label: 0
      ...
    ... (possibly more alert definitions)

Each of these alert definitions would map to any variable in the RRD Update objects matched by the index filter which matches the alerts entry “label” property. In the above case - from all objects matching the “host._myhost_.” prefix pick variables which have “mem used” label in their vars definition. Which would most likely be equivalent to the above inline definition. There were some special cases where matching by label might not work well (label is technically optional in object var definitions) so I made it that if the Index alerts definition label is an (unquoted) number it would be treated as a variable index. So a “label: 0” would match the first variable from all matching objects.

These two options would give me enough flexibility to define value -based alerts. The Index option meant that sometimes I had to define an Index just for alert purposes but not necessarily useful to be displayed in the “Configured Indexes” page. For that purpose I introduced Hidden Indexes - these are defined the same way as regular one with the difference that the id must start with a ‘~’ instead of ‘^’. Only difference in behavior is that these are simply not shown in the UI where “regular” Indexes are.

Note that with this flexibility it was possible to define multiple different thresholds for the same value at given level. That would not be a big deal - any applicable thresholds would be evaluated and the one resulting in “higher” (would be CRITICAL) severity will actually trigger the alert. If there are more than one - it doesn’t mater which will trigger it - would still be the same alert level.

This was all good as far as value states are concerned. However there was an extra state needed - the case where a command would fail and there would be no numbers to compare with thresholds. That could be the RRD Update object command but also any of the parent commands in the “Commands tree” up to the root ping command (if the host is down). I decided to map that to the Nagios UNKNOWN state and originally named it like that which was somewhat unfortunate and often a source of confusion over time. Eventually (much later) I renamed that to be the “FAILED” state - slightly more accurately representing the fact that a check/data retrieve command has failed.

I also had one more state in mind - “internal SMG error”, or SMGERR. That would be a special state indicating issue with SMG and/or possibly a capacity one. Such example would be “Overlapping runs” - e.g. the every minute poller did not complete for 1 minute which if happening often means that SMG is approaching its limits.

I thought that I need to be able to order my states by severity and a question would be where to put the FAILED state compared to WARNING and CRITICAL. I decided to go with this order:

OK - all is good - all fetch commands are succeeding and values are within expected ranges
WARNING - something is approaching a certain limit and might break soon
FAILED/(UNKNOWN originally) - a parent/fetch command is failing for some reason - likely a service down (but also possibly a bug in the command)
CRITICAL - we know that all parent fetch commands succeeded and we do know that something is broken (no uncertainty)
SMGERR - SMG is misbehaving, the “Monitoring guy” needs to look into it.

The FAILED/CRITICAL order might be an arguable choice but it made sense to me and ultimately - wouldn’t be as important. I would never have “conflicting” cases where both states would be applicable and I need to choose one, for warn vs crit its actually normal to have both matching their condition in which case priority matters.

Also ultimately what would be more important for me is who will get the notification and via what means - e.g. PagerDuty should be reserved for issues requiring immediate attention. That would require its own somewhat orthogonal configuration.

Flexible notifications configuration

Nagios has reasonably flexible (and not necessarily simple) way to configure Notifications depending on host and/or services (groups) and alert level which we were using so I needed my SMG configuration to support the same use cases (or at least - the subset we were using)

I decided that I will use external commands for SMG notifications (which technically is the same with Nagios). These would have to follow certain protocol and accept the notification info/meta-data via environment variables passed down by SMG. We would be using mainly two notifications methods at the time - e-mail and for subset of the alerts - PagerDuty. I wrote a quick “mail” command wrapper and also modified the PagerDuty-provided Nagios integration script to work for SMG and bundled these within SMG. So for example defining a mail notification command I would use something like this in the SMG config:

- $notify-command:
  id: mail-asen
  command: smgscripts/notif-mail.sh asen@smule.com

At least initially I wanted to get all alerts so it made sense to me to configure default/global notifications recipients, based on the alert severity. These would look like this:

- $notify-crit: mail-oncall,notify-pagerduty
- $notify-warn: mail-oncall
- $notify-fail: ... (originally - $notify-unkn)

- $notify-strikes: 3
- $notify-backoff: 7200

The last two would allow me to port two other Nagios configurations we were using:

max_check_attempts - how many times an alert needs to be repeated to become HARD (from SOFT) and send notifications. In SMG that would be named “notify-strikes” and the default value would follow the “3 strikes rule”.
notification_interval - after what time to re-send alerts for still not recovered services. In SMG that would be named “notify-backoff” and set in seconds, so the above value actually means 2 hours.

That would probably cover 90% of the needs of our ops team at the time yet it was not sufficient for all. I needed to be able to override these somehow for certain objects and send certain alerts to specific teams to take care of.

Also - the notify-fail recipient would be questionable. At the time we already had our existing Nagios configs to not trigger PD alerts on warnings but a “host down” type alert would become questionable. We already had our infrastructure built in a way that there was almost no single host which could take an entire production service down with it. So if a single host going down and if part of a large farm - it should not necessarily trigger a PagerDuty alert. Yet - some important hosts like a Database or LB VIP going down should actually trigger a PD alert, ASAP.

Similar to the alert- configs I decided to support my notify- confs in two places - inline within RRD Update objects definition but also via Indexes.

For the value-based alert levels (warn/crit) it made sense to be defined together with the alert-(warn/crit) definitions, so either inline with the object vars or in an index alerts definition. The notify-fail would be at the “command” level, whether in a RRD Update object or a pre-fetch command.

Or reusing the above examples it could look like this inline within an object:

- id: host._myhost_.snmp.mem:
  command: get_snmp_mem_percent.sh _myhost_ip_
  notify-fail: mail-netops
  vars:
    - label: mem used
      mu: B
      alert-warn-gt: 90
      alert-crit-gt: 95
      notify-warn: mail-sysadmin
      notify-crit: mail-sysadmin,notify-pagerduty

Or an Index:

- id: ^hosts._my_host_
  title: "_my_host_ graphs"
  px: "host._my_host_."
  notify-fail: mail-netops
  alerts:
    - label: mem used
      alert-warn-gt: 90
      alert-crit-gt: 95
      notify-warn: mail-sysadmin
      notify-crit: mail-sysadmin,notify-pagerduty

I could also override the global notify-strikes and notify-backoff there. There might be cases we wanted alert on every failure (which would mean to set notify-strikes to 1 and the alert becomes HARD immediately). In other cases (with some flaky and unimportant services) I might want to set that to 10 so that I’d get alert only if the service is down for at least full 10 minutes.

Similar to the alert definitions it was now possible to define multiple conflicting recipient lists for given alert state. The conflict resolution would work like this:

If there are any notify- configs specified via Index or inline with the Object and possibly many of them, the recipient lists will be combined. I.e. any of the notify-commands listed in any of the configs would be included. This is trickier with notify-strikes and notify-backoff as I need to pick one vs combining them. Erroring out on the safe side (false positives are better than false negatives) I would pick the smallest notify-strikes set and the shortest notify-backoff interval set.
If there are no notify- configs specified for the given Object/vars (neither inline nor via Index) - use the globals.
If there are also no globals - don’t send notifications for that alert severity.

Yet there was one extra case I wanted to be able to cover - I would have my default/global $notify- recipients set, including for notify-fail. In some very rare cases (maybe some development machine which gets broken on a daily basis) I would want to permanently disable all notifications for given Object or group of objects. So I added a special boolean flag named “notify-disable”. With regards to conflicts resolution that would be a special case - I decided that notify-disable set to true once will override any other confs (likely not setting it at all). That could be dangerous so I would very rarely use it. SMG would have better means of temporarily silencing hosts and this “permanent” solution would be more like a special case/“escape hatch”.

With these I had most of the Nagios notifications use cases covered, now needed to figure out how to actually implement all of this on top of my already reasonably stable at the time graphing system.

The SMG “Data Stream”

I started to implement all of this as The “Monitoring plugin”. At that point I also thought that it might be a good idea to be able to “stream” all the data (mostly numbers) SMG is gathering to an external destination or for some funky types of processing (including Anomaly detection, described later) or maybe an analytics database. I also realized that the entire Monitoring system could be built on top of such “stream”.

It would consist of 3 types of events:

“Data” Events - for every successful leaf (RRD Update) command, the resulting numbers/values will be sent as “Data events” representing the object id and the numbers just fetched for that object.
“Command” Events - for every command from the Commands tree (including objects commands) executed there would be either an OK event or FAILED event. The later would also contain some error details, including the failed command stderr/stdout output. SMG would also make sure that on every interval/run every configured command would result in a Command event (that way the listener could potentially detect “missing” events). So if a command failed that would result in a FAILED event and while its child commands would not be actually executed SMG will still emit the same FAILED event for each of them where the child events would have a special “inherited” flag so that the listener could distinguish these as such.
“Interval” Events - These would be sent once for each poll interval run and could indicate the mentioned SMGERR (Overlapping runs, or maybe a bug in SMG if it processed less commands than it was supposed to, any kind of discrepancies).

So any SMG Plugin could register itself (using the Config Service) as a Data Feed Listener on startup and it would start receiving them asynchronously via dedicated methods (“callbacks”) implemented in the plugin.

But soon I realized that implementing everything I wanted as a Plugin (and without making the Plugin interface way too fat) will not be feasible - ultimately I wanted a tight integration of the Monitoring and Graphs UI and eventually it had to be refactored and become part of the SMG Core in the form of a Monitoring Service. That Monitoring service would still work as a Data Feed Listener though and Plugins can still register themselves as such.

Years later I was able to implement a Plugin forwarding all SMG “metrics” to InfluxDb trivially using the same concept, and any kind of data forwarding would be trivial to implement.

The Monitor states trees

I already had the “Command Trees” structure as built from the SMG yaml configuration and that had one important property - it was constructed during config reload and would be immutable for the duration of its life time which is until the next config reload. Having it that way would help me worry very little about thread safety when running these commands in parallel and know that their dependencies and properties are not changing. I only needed to ensure that the same LocalConfig object would be used for the duration of the entire interval run and I knew that everything would be valid and consistent. A config reload would update the “current” global reference to the LocalConfig object with a new one but that one would not be used until the next interval run. It would also be a single reference update vs updating individual members of the structure.

I decided that it makes sense to have similar structure for the Monitor States, more or less replicating the Command Trees. The structure itself would be reasonably stable - it would only change on config reload which is always synchronized/single threaded. But its members (nodes) would contain the mutable recent states for the given object and these would update every minute. So I had to be a bit more careful with these especially at config reload time but overall I was able to write it without the need of too much synchronization (Akka/Actors help a lot with that) so I didn’t really loose throughput in my graphing system by adding the monitoring stuff.

Without getting into too much implementation details my Monitor states could be of one of a few types roughly matching the internal SMG object types (pre-fetch commands, rrd objects, “aggregate” objects) and all would have some common properties making them suitable for displaying them together in a unified UI. Each Monitor state object would keep a list of recent state values (OK/WARN/CRIT/FAIL/..) with a size matching at least the notify-strikes value. That would help it determine when the state has switched from a SOFT error to a HARD error and notification is due. That was pretty much all the state I needed (a short list of recent states for every command/update object in the system) so the memory footprint of my implementation would not be too big.

The leafs of the Monitor state tree would be the individual “variable states” - ones actually matching the timeseries/graphs. These would normally be in one of the “value states” (OK/WARNING/CRITICAL) but could also inherit the FAILED state of any failed command along the parents chain. Any other (non-leaf) tree nodes could be in either OK or FAILED state (for some of them it could also be inherited).

Sounds simple? Well it is not - there are a lot of edge cases to handle. Like for example if I had a leaf (value) state at critical level and with 2 “strikes” (i.e. still SOFT but would become HARD on the next run if unchanged) and then a parent state fails, suddenly the leaf state will be with 3 consecutive bad states so a question arises - should the leaf state trigger an alert (being “HARD” now) or not? There would be other similar questions and in order to have reasonably simple and reliable logic and implementation I decided to follow a simple rule - an inherited alert state will never trigger alert but inherited previous states will still count against the notify-strikes threshold. What that meant that in a worst case scenario (and with notify-strikes: 3) there could be up to 5 (2 * N - 1) consecutive non-OK states before actually some of them becoming HARD and triggering alert. In the above example these could be: CRIT, CRIT, FAILED (inherited), FAILED (inherited) and then the 5th (unless OK/RECOVERY) would be either again a CRIT/WARN value state triggering the respective alert or now the FAILED command would have failed for the 3d time triggering its own HARD alert.

The “Interval run”

I have mentioned the “interval runs” a bunch of times but didn’t explain how these would work.

At the beginning of every interval SMG would send one message to the single Update (dispatcher) Actor for every root node in the command trees. The actor would run these asynchronously using the configured (per-interval) thread pool and on completion the commands would notify back the Update Actor with the result of the command. If successful that will send an OK data message and trigger the execution of the child commands and if a failure - would send the FAILED data messages for itself and all children.

The idea was to run as much commands as possible in a short interval while still strictly ordered as per the trees structure (children after the parent). That way an individual slow command would not slow everything else down. Note that the limited thread pool meant that ultimately enough slow commands can still actually slow everything down.

With this approach I noticed an issue - the poller would “obsess” over each host one after the other (executing a bunch of probes against the same host at the same time, shortly after the top-level ping). This was undesirable in my view (sometimes even causing false positives) so I decided to add some control about how child commands in a given tree will be executed. To avoid the “obsessing” issue I decided that by default child fetch commands will be executed in sequence. Then I added a new parameter to the pre_fetch command objects named “child_conc” (from “child concurrency”) which would allow me to parallelize the children of any pre-fetch command as much as I need but have it with a default value of 1 - meaning to run these in a single thread - so sequentially. The leaf (update) objects however would still be parallelized to the maximum allowed by the thread pool size.

Anomaly detection and plugin checks

I had ideas and attempts about implementing “Anomaly detection” even before planning to replace Nagios with SMG. The original version of that would be the “Spiker” plugin - named after its ability to detect “spikes” and “drops” in certain stats. The way this would work would be on every check, for every “interesting” object/graph it would do a rrdtool fetch to get a “long” period (normally - the last 24 hours) of data and then based on these numbers it would calculate some statistics about the numbers, things like average/max/std deviation etc.

To determine if there was an anomaly I would calculate these stats for a “short” recent period, e.g. 30 min. Then calculate the same stats for every such remaining “short” period for the full 24 hours of data (in the 24h example that would mean 47 sets of stats). Then I would compare the recent “short” period stats with each of these remaining/previous “short” period stats and if it didn’t “look like” any of them - that would mean an anomaly. Of course the “look like” statement can be very subjective. Even humans staring at the same graph would not always agree whether what they see happening in the last 30 minutes is an anomaly or not. I would use a threshold value (normally 1.5) where in order for a “short” period to be considered different (i.e. does not “look like”) form another one, all of the stats of the first one should be at least that much “times” different (whether bigger or smaller). If any of the stat is within the 1.5 “times” range it meant that they are similar enough and there is no anomaly.

Also the same time series graph can look quite differently depending on the time scale. Something which looks like a minor increase in a 24-hours graph can be an actual spike in a yearly graph. So the actual short/long periods would be configurable, together with the mentioned threshold value.

But there was an issue. My system was designed for high throughput and fast writes and not necessarily optimal for doing heavy mass-read operations. Sysadmins would normally only read small subsets of the rrds for graphing via the UI. And if I wanted anomaly detection for all of my stats it would mean that on every Check Interval all the stats I write will also be read from the RRD files together with their 24h history. I could mitigate that by making the “anomaly checks” less frequent (e.g. check every 30 minutes instead of every minute) but I still didn’t like it - I wanted it to be “live”, telling me “you have a spike now” vs “you had a spike 30 minutes ago”.

I was thinking that I can do better with this type of anomaly detection, especially for “live” spikes detection over reasonably short periods and do it all in memory. I already had all data points in the monitoring service before they even go to the rrd files. But for keeping 24 hours of every-minute stats I would need to keep a list of 1440 data points for each stat. I didn’t like the memory requirements overhead this could impose but I could do better - I didn’t actually need all older data points. I could instead only keep the stats used when comparing two chunks whether they “look like” each other, effectively compressing the data - instead of 30 data points I would only keep their avg/min/max/stddev values - about an order of magnitude less data to keep in memory.

I also decided that I will extend my (mostly “borrowed” from Nagios) states (OK/WARNING/…) with one more state called ANOMALY. That would be another value-only state only applicable to the leafs of the monitor states tree and would be sorted between OK and WARNING in the states “severity” sorting (i.e. the least severe among the non-OK states). This also introduced a new “notify-anom” config value, to be able to send alerts for that error state if I wanted to.

Initially I implemented the “real-time” anomaly detection as part of the Monitoring service, mainly because I already had a mutable structure with all update objects in the form of the Monitor states tree. But my implementation was certainly not the only possible implementation for Anomaly detection. It was even likely far from the best one. I thought that it would make sense to refactor that in a plugin - so that I could even possibly have multiple different implementations and compare.

I would hook my Plugin as just another Data Feed listener. It would ignore all but the value state events and for them it would keep the necessary in-memory structures helping it determine anomalies.

I also had to extend the syntax of my alert-… definitions to support something beyond simple value <-> threshold comparisons. So far I had alert-warn-op and alert-crit-op and the initial implementation would use the alert-anom: … syntax for that. But then I decided that I should be able to implement arbitrary checks in SMG plugins which would not be limited to Anomaly detection or only ANOMALY states. The syntax should be extensible and possibly plugin-dependent. I decided that all plugin check “thresholds” (in whatever form they would be defined) will be defined a special new alert-p-pluginId-… property (p - from plugin). The pluginId part after alert-p- would identify the plugin implementing the check and the portion after it would be plugin-specific. For the Anomaly check, after I extracted it to a “mon” plugin the syntax to define Anomaly detection thresholds would be like this (applied to variables same as the other alert- properties):

alert-p-mon-anom: "1.5:30m:30h"

That would translate to “similarity threshold” of 1.5 (times), applied over a “short” interval chunk covering the last 30 minutes and compared to all other “short” intervals within the 30h “long” interval.

With that (alert-p-plugin-check_name) syntax the mentioned “Period over Period” check thresholds definition would look like this:

alert-p-mon-pop: "24h-5M:lt:0.8:0.3"

That would translate - compare the last value with the same value 24h ago but at 5-minute average (the 5M). If the value is less than (lt) 0.8 times form the 24h-ago value - trigger a WARNING, if the value is less than 0.3 - trigger a CRITICAL.

For me the bottom line was that I ended up with a nice way to extend SMG value-based checks with anything I could come up with based on the time series. And this could be done via a Plugin allowing me to do crazy experiments without worrying too much that I will break the core SMG functionality. Such plugins can be enabled and disabled (if causing issues) at deploy/startup time too.

The Event log

I wanted a trace-able log of all monitor state changes where I could also see SOFT errors, similar to the Nagios log. My implementation would be simple - I would batch “events” in (the EventLog Actor) memory for up to 30 seconds and then append them to a daily “json lines” file in a dedicated directory. The access pattern for those would be “give me logs since X time ago” with a filter and limit so I would just read all necessary daily files up to the X point in time and then apply the filter on all messages.

Surely not the most efficient to read and search implementation but good-enough for that purpose and also very fast as far as writes would be concerned, highly unlikely to become the bottleneck of my system’s write throughput.

The Notifications service

Once the “Monitoring service” detected a HARD error it will send an Actor message to the “Notification service/actor”. That one has a few responsibilities:

it keeps track of all “active alerts” and their last notification times.
the last notification time is used to enforce the “notify-backoff” setting - do not repeat notifications until the backoff expires.
the “active alerts” (including all recipients) part is needed to be able to send RECOVERY messages to all alerted recipients

The actual notifications would have their own “severity levels” matching the alert state levels. One difference would be that the “OK” state is actually named “RECOVERY” in the notifications (normally you only get these after a problem is solved). There is one extra notification type - ACKNOWLEDGEMENT, which would be sent to all already alerted recipients when someone acknowledges a problem in the UI (indicating to them that someone is working on it).

These concepts exist in Nagios and we would rely on them in our workflow so I had them in SMG too. But I also needed two extra Nagios cases covered - acknowledging active alerts and silencing (possibly not yet active) alerts.

I already explained how these work in Nagios (where “silencing” maps to Nagios “scheduled downtime”) so I tried to implement them the same way in SMG so that our workflow could remain unchanged.

And there was one extra case I wanted to cover. In some situations someone would silence a host for two hours, then possibly rebuild it and do some extra work before it is ready to be monitored and back into production usage. If that work involved temporarily removing from Chef, there was a chance that it will also disappear from the monitoring configuration. With that the “silencing” would be lost and people would complain that they get alerts from hosts they supposedly silenced. Note that this would be the same with my Nagios setup and a side effect from Chef being the source of truth about which nodes exist and need to be monitored. This could happen even more often later when I ended up “sharding” some of my SMG instances and monitored nodes could move between shards as their roles change. For that I implemented “Sticky silencing” - you define a standard SMG regex-based filter which will silence any currently matching but also future (not-yet existing) matching objects, and as with the regular silencing requires you to supply a limited time period for the silencing.

The actual alert notifications would contain the object meta-data, a “link” to the object graph(s) (can be many graphs when a parent command is failing) but also links to all “relevant indexes” - these are all indexes which would include the alerting object. And for example since I had Indexes defined for every chef role in the system, as a side effect I can see all chef roles the alerting host/service has from within the alert message.

It is very convenient to be able to click on a link in the alert e-mail and get directly to a graph showing the problem and how it developed over time. The Monitoring UI would also allow me to acknowledge the problem directly from there.

The Monitoring UI

Of course I needed to integrate all of that with my existing and relatively simple UI.

I didn’t want to clutter the UI too much so decided to opt for representing the displayed graphs “monitor states” as small SVG squares with color matching their current monitor states (non-ok states would also have a symbol in them to assist color-blind people). I would use green for the OK state, light blue for ANOMALY, yellow for WARNING, brown for FAILED and red for CRITICAL. I would put 3 such squares stacked in a column and these would map to the 3 most-recent states of the RRD object’s one or more time series. In addition to the “value states”, these could also be in an (inherited) FAILED state.

I also wanted the monitor state details to show when I hover these squares with the mouse. Initially I would simply embed the state text in the SVG but that suddenly made my html pages big and slow due to the large amount of hidden text. So I refactored that and the tooltip text is not loaded with the page but would be loaded on-demand form the server via a JS call when hovered.

The /dash (graphs) UI would also allow me to Acknowledge (if any problems) and/or Silence alerts for given period. Similarly to aggregate functions this would only apply to currently displayed graphs on the page, to avoid someone accidentally silencing everything.

Yet that wouldn’t be sufficient, I needed a few more dedicated pages so I added the “Monitor” menu item. That would land on the “Problems” page - listing all current non-OK states. This is where one would see all current problems at a glance. It has some filtering by severity/remote and by default will not show any soft errors and silenced/acknowledged states. These could be displayed via checkbox clicks. Normally - that page should be mostly empty (showing “no issues”) and this would be the first place to go to see “what is broken” if in doubt.

There would be also an Event log page where all recent alert events can be seen and with some filtering.

The “State trees” page allows one to browse the “Monitor states tree” and has some filtering abilities. This is one of the intended places to silence alerts upfront from and this is the place to apply the mentioned “sticky silence”.

There is also the “Silenced states” page showing all currently silenced objects whether in good/OK or another state.

The Monitor section also hosts a few other pages including the “Run trees” (to browse the Command trees, relatively static), “Alert conditions” (listing all defined alert thresholds), “Notification commands” (listing all notify- commands) and the “Heatmap” (somewhat under-developed view of all states, condensed in groups and represented by a SVG square). Some of these were added later though.

Alerts throttling

We have had many cases to get alert “explosions”. These could be caused by actual outages but a classic mishap could be to have the monitoring server disconnected from the network for a few minutes. That would result in all Nagios (or SMG) checks to fail and trigger e-mail alerts. The actual e-mails will be received by the local postfix relay but will not be delivered immediately (cause the network is down). And then once the network is restored, Postfix would happily deliver thousands of alert notifications (followed by the same number of recoveries).

That could apparently happen with Nagios and my newly developed and not yet very stable system would be even more likely to trigger alert explosions (e.g. due to bugs on my side).

I decided to implement a special new feature in SMG to address that. I would define a maximum “alert rate”, by default - max 10 alerts per 10 minutes. Once the SMG instance detects that it is about to exceed that rate (already triggered enough notifications in the current period) it will trigger a special THROTTLED notification. After that it will stop sending more until the period is over. At that point it will send another special notification - UNTHROTTLED. That one will contain the number of suppressed alerts during the throttled period and also list their subjects/objects. After that newly triggered alerts will again start to result in notifications.

The idea is that when you get a THROTTLED alert, chances are that many things are broken and you should be switching to “active” mode and looking for the issue in graphs and logs. At that point one would normally go to the Monitor page where all alerts (throttled or not) will be visible.

Note that the throttling period is not a “moving 10 minutes” window but rather reset at fixed 10 minutes periods starting at XX:00, XX:10, …, XX:50 in any given XX hour. This means that it is actually possible to get max 20 (2x) alerts in any given 10 minutes (covering two adjacent fixed 10-minute periods). That is not a big deal for me - the goal of this feature is to prevent alert “explosions” (like many hundreds) and not to be strictly accurate about the “maximum alert rate”. And having a fixed window actually makes the implementation trivial and hard to get wrong. Rates calculated based on “moving window” are certainly more involving and error-prone to implement.

Of course this behavior poses some risks about missing alerts. Because of this it is configurable and can be disabled.

Later it was also enhanced to apply the throttling on a per-recipient basis. That way throttling of lower severity messages can not “hide” higher severity messages as long as they have different recipient lists.

Actually replacing Nagios

Monitoring is about confidence - I would run Nagios and SMG in parallel for very long. Initially I would be the only one getting the SMG alerts. Once I was reasonably confident that I was getting the same alerts from SMG as the ones I would get from Nagios I asked a few fellow engineers to subscribe them for the SMG alerts too (essentially getting double amounts of alerts) until it got to a state where I could actually replace some Nagios checks with SMG checks.

Over time more and more checks would move from Nagios to SMG. Initially I was considering only value-based checks for moving but eventually all of them would be subject to. In order to emulate some “binary” Nagios check as a “value state” I would just convert the Nagios plugin exit code to be the actual graph value (0,1,2).

Yet we still run Nagios these days. But its main purpose today is to monitor … SMG :) Yes - I wouldn’t trust a single monitoring system which no one checks automatically (no matter how confident I am in SMG’s stability).

Part 4: Open source and polishing

Open source (2016)

I was graciously allowed by Smule to open source SMG in 2016. My personal motivation was partially to be able to use it elsewhere if I ever decided to leave Smule - I was still legally a contractor at the time, albeit a “permanent” one. This was also a chance for me (and I guess - Smule) to give something back to the open source world, given that me and everyone else relies on so many great open source tools. Because of that I decided to use the very liberal “Apache 2” license - which as per my understanding is “do what you want with it, don’t bug me (legally) if it doesn’t work for you and it would be nice to mention me if you redistribute it”. Its quite possible that this understanding is not exactly accurate in legal terms but that would be my intention :)

Of course I didn’t want to leak any internal and potentially sensitive Smule operations info and I wouldn’t bet my life that there was no such info in my SMG repository history at the time. I made the somewhat harsh decision to wipe the entire git history and import the “current state” at the time as the first commit of a brand new repo.

But there was more - I had the mentioned earlier “Jamon” plugin and I didn’t want to open source that - it has a bunch of Smule-specific assumptions and was never intended to be open sourced. So the Open Source build wouldn’t work “as is” for me and Smule.

So I created a new repo in our internal private git server which would have the open source version added as a git subtree, under a smg/ directory. Then any custom plugins would be a separate sbt project under a smgplugins/ directory. The top level build dir would contain its own sbt build project parenting the mentioned two sub projects. At a high level the build process would look like this:

build the open source version under smg/ using sbt stage
build the custom smgplugins/ project (which depends on the smg/ one)
copy the smgplugins/ jars in the open source build stage lib/ dir and package the resulting stage dir into a tgz (and later - a Docker container).

And that would be it. With that setup I could experiment with some stuff in a “private” plugin until it gets into a shape for sharing and then I could move it to the open source version. Note that I never change the smg/ subtree inside the internal repo and would never push back changes to upstream from there - it is a one-way relation, the custom build is based on unmodified upstream build and just adds some plugins in the form of an extra jar and some conf.

With that sorted out I was able to continuously develop SMG both as an open source project and an internal tool.

Various small and not so small features

The following period can be tracked via the (now public) git history. Below are some features I thought are worth mentioning but these are not necessarily listed in the order they were added.

Use POST for too long URLs

SMG normally uses GET requests to submit its /dash page filter form which results in shareable URLs. But in some instances the filter URLs get very long.

For example I have cases of large hostgroups (like media processing farms) for which I want to be able to summarize stats. Normally I would define an index (generated by Chef) which would have a regex filter containing the full list of nodes. So if my nodes were named a1,a2,…a300, my index definition could look like this:

- id: ^my.group.of.300.hosts
  rx: "^host\\.(a1|a2|...|a300)\\.""

Now, with 300 (or more) node names chances are that this rx: string will be a very long one. Normally that doesn’t matter much - it is generated for me and the /dash URL would look like /dash?ix=my.group.of.300.hosts as per the index id. Yet there are cases where such filter could “leak” into the URL (including via the “Merge with user filter” UI feature allowing one to get an index-defined filter populated in the user filter form for them).

Another case would be to apply an aggregate function to a lot of objects and then click on the image to load the “Show” page. Since the resulting URL needs to contain the full list of the aggregate object “members” it can become very long.

In such cases I would sometimes hit the browser limits on how long a URL can be so I enhanced all of the places I knew could result in too long URLs to detect the situation using some simple javascript and switch to use POST instead. Such views are unfortunately not share-able but still - they are not broken (and its usually easy to share the steps to get to such a view).

Since the “max url length” limit can depend on a lot of things (including http proxies along the way), the actual value is configurable in SMG.

Search and auto-complete

Initially the “Configured Indexes” page would cover most of my team’s needs for discovering Indexes and graphs. That is still the “main” SMG page as of today.

But I thought that I could have a faster (than /dash) search UI which would only list indexes and objects but not actually display graphs for them (which would be slower).

This is when I implemented the Search page. It represents a fairly large search input box and a couple of buttons - “Search” and “Graphs text regex filter”.

The first one will treat the input box as search terms and will look into all local and cached remote objects “text representations” and yield indexes and objects matching them where indexes would be displayed first and individual objects after. For me that is a faster way to find Indexes (the previous alternative would be browser search in the Configured Indexes page but that is only limited to actually listed there indexes, which are normally limited by levels)

The “Graphs text regex filter” would directly post the form to the /dash page as a “Text regex” (trx) filter. The alternative to that is to load the “All graphs” page (/dash with no filter) and use the trx field there. But that is slow and involves displaying a the full first page from that view before using the filter which is not necessarily very fast.

In addition to that I decided that I can implement some simple auto-complete in the filter forms. I would tokenize all object ids (as separated by dots) and keep all unique tokens as auto-complete suggestions. This could work on multiple levels - considering all adjacent tokens as “larger” ones in pairs, triplets etc up to max level to avoid too much memory usage.

The “search database” would actually be simple in-memory data structures convenient to be used for search and auto-complete reasonably efficiently. And similar to the SMG config it would also be immutable for the duration if its lifetime - it is rebuilt from scratch on every config reload. This can be slow on a “Central UI” instance talking to many “worker” instances so it happens asynchronously, after config reload.

Horizontal scaling - (External) “Sharding”

I knew that eventually a single SMG instance can not keep up with all the stats we want it to track for an entire large data center.

On the other hand Chef can still do fine with a lot of Nodes and I have never had to run more than one Chef server per data center. That meant that I could decide to logically split a data center to two, where for every host chef will decide which host belongs to which part or “shard”.

In our case we have a large number of application servers and related infrastructure but also large number of content distribution related nodes, Analytics and what not.

So I would define two SMG roles, e.g. smg_appservers and smg_default. Both roles would run the same SMG config recipe which will dump all target nodes from chef together with their roles and metadata. Then for every target node and based on its own role and the target roles it will determine if the target node “belongs” to this instance or not.

I would list all “appserver-related” roles in a big if condition in the recipe which will determine if the node has any such role and if so it is deemed to be in the “smg_appserver” group. All others would be in the “smg_default” group.

Then if the node does belong to “this” SMG instance (group matching smg role) that will result in generating yaml files resulting in SMG monitoring the objects defined in these yamls. And if it does not belong - any configs related to that node are deleted. The actual yaml files are one or more per node (normally - one for the system stats and any additional per-service ones) and follow a naming convention including the node name and suitable for managing automatically,

Note that if we change a node roles in this setup it is possible for its monitoring to “move” from one of these SMG instances to the other. That means that history can be lost but normally a node history only makes sense in the context of a given role so that’s not a big deal.

My Chef recipe would also need to handle the “node deleted” cases but once that was sorted out in the recipe I had a pretty much automatic “sharding” system to logically split the DC into two and handle each part monitoring by a dedicated SMG instance.

I have never had a case to need more than two “system” instances per a DC even in the largest one we have had. And not all of them would even require two SMG instances.

So that would be pretty much good-enough for my “horizontal scaling” needs. But I still had room for “vertical scaling”, in the form of batch updates.

Batch updates

In most of the use cases I had and with rrdcached doing the I/O I was rarely having troubles with the updates. Yet I had some instances dedicated to haproxy/LBs stats where the polling wasn’t supposed to be that expensive. Yet they were slower in processing than what I was expecting. It was apparent that with the large number of updates doing all the updates using a fork and then exec rrdtool was becoming the bottleneck.

So I decided to make it possible to bypass the rrdtool update commands altogether - rrdcached uses trivial text based protocol for the updates so I added an option for SMG to do the updates in larger batches (e.g. 1000 or more) which would be sent to rrdcached in a single external socat command.

That could easily eliminate the fork/exec bottleneck but there were more to address.

pass_data

In most of my templates I was leveraging using local temp files to pass data from parent to child commands. The pattern would often be for the parent command to do a remote call and store the output into a temp file and then all the child/object commands would each parse that and extract the data from it.

That usage was so common that I decided that I can support it in SMG by adding a new flag to the pre-fetch commands named pass_data. If set to true SMG will keep the command stdout output in memory and pass it to all child commands directly from memory without any disk I/O involved.

Later with the “plugin commands” concept that could be further optimized to also parse the data just once.

ignorets and data_delay

Normally SMG will update the data with the timestamp of the highest level parent command which doesn’t have an “ignorets” setting. That setting is normally set on the top-level ping command as that is not actually getting data. Then the children of that are normally fetching the data and this is the timestamp we want to use for the updates that will happen shortly. But that assumes that the data retrieved is actually “live” and valid for the time of polling.

I had cases with CDN vendor stats where these would come with certain delay (e.g. 30 min) so I needed a way to tell SMG to update the data with some offset in the past. That offset would be set via the data_delay property which is set at the update object level and represents a number of seconds.

Randomized delays

I mentioned some cases where I would update some stats hourly e.g. based on previous hour logs or other reports based on such hourly logs. In the later case these reports might take some time to generate and come a few minutes after the hourly rotation. So originally I would have a script run by an hourly update object which would first sleep for a few minutes, waiting for the log-based reports to complete and then get numbers for RRD updates from these reports.

Yet sleeping for long and blocking one of the per-interval pool threads is not great. But since I was using Akka Actor messages to trigger the commands execution and these support delayed sending, it wasn’t too hard for me to implement the delay option.

That would be a number in seconds which has to be smaller than the interval, so for an interval: 60 the delay can be 59 seconds or less. That sets a fixed delay to the time the command will run.

I would set that to 600 for my hourly stats relying on reports and these would be scheduled to run around the 10th minute of the hour.

But that number can also be negative which means to randomize the delay up to the “interval” minus the absolute value of the negative number. E.g. with “interval: 60” and “delay: -1”, the actual delay will be a random number of seconds between 0 and 59 (60 - 1) seconds.

That can also be used to spread some checks at random times within the poll interval if that is desired.

rrdcheck

With RRD tool and files, one defines the structure of the file upfront and that doesn’t change. The actual structure represents the number of vars/timeseries together with their allowed min/max values and of course - for each of them - a few RRAs.

All other SMG object meta-data is technically part of the SMG configuration and can be easily changed but the file structure - not.

Eventually it would be possible for the file structure and config to get somewhat out of sync - e.g. if i change the rra_defs for an already existing object, that will do nothing. Same with the min/max values, all of these are part of the RRD file and do not change for the lifetime of the file.

I decided that I need a way to find and fix such discrepancies and implemented the RRDCheck plugin which can scan all local SMG objects for such issues (discrepancies between config and RRD structure) and is able to fix them.

The actual fix involves re-creating the rrd from scratch using the new conf but copying over the data from the existing RRD.

Save states

My in-memory data based monitoring checks would be quite efficient and fast. But since they are in memory - they are lost on SMG restart.

SMG tends to be restarted rarely - almost exclusively for a new version deploy. Yet when that happens I still don’t want to lose all silenced/acknowledged states and ideally also want to preserve the recent state values too.

To address this I implemented json serialization of the all monitor states and saving that to disk on shutdown. Then on startup it will load all state from disk and preserve all monitor states from before the restart.

All this serialization and loading can take a while so it may take an entire minute to do a SMG restart with lots of states. Yet that wouldn’t happen as often so it would be OK. I may decide to spend some time optimizing that (certainly some possibilities there) but that has never been a high priority for me.

In addition I exposed the “save states to disk” function via API. So its possible to schedule regular saves (e.g. hourly, via cron) to protect ourselves from losing any significant amount of human initiated actions (like silence/acknowledge) in the unlikely case that SMG crashes (but also cases like power reset)

JMX plugin

The initial version of the JMX plugin was quite fat - JMX objects would be defined in the (separate from SMG’s) JMX Plugin config in its own custom format. The Plugin code would take care of parsing this config and then generating and exporting SMG objects through the respective Plugin API.

But it would re-implement large portions of the yaml config parsing and then large portion of the polling logic. I didn’t like that whole lot but it was good enough for a while - I had native JMX support in my monitoring system which wasn’t available with a lot of monitoring systems at the time.

Later I was able to address these issues with the “plugin commands” concept when I essentially re-implemented the JMX plugin.

Guaranteed notifications delivery

SMG uses external commands to actually send notifications, whether via e-mail, PagerDuty etc. Of course these commands can fail. And since you can define multiple such commands for given alert it is possible that some will fail and some will succeed.

In the initial implementation I was considering the notification “success” if at least one of the notifications commands succeeded. For a long period that would be mostly a single command so the notion of “at least one” is kind of irrelevant. Also our main notification commands would be quite resilient:

the mail command would send a mail to a local postfix relay which will work even if the network is disconnected (and then you will get a lot of spam from Postfix once the network is restored)
the PagerDuty script is ported from the Nagios version and the way it works is that if it fails to send the alert it will write it to a local file/“queue”. Then one can setup a */5 cron job to run the script again with an option telling it to process the “queue”, until it eventually succeeds.

Once a notification is considered “success” the notification will not be triggered again until the specified notify-backoff time has passed.

But with the team growth and increase of the notification commands involved I decided to improve the implementation. Instead of treating any successful command as a “notification success”, I would track these (and their backoff time) separately.

So if I get an alert which needs to go out through two commands and one of them succeeds but the other fails, only the first one will be marked as success and the backoff time will start for it. The failed one will be considered failed and on the next minute/poll interval if the state has not changed it will again try to execute that failing command. So it will essentially retry until the notification command succeeds or the alert clears.

I have been thinking of other use cases for this feature, like maybe even running a command to restart some failed service automatically or in general - react automatically to monitoring events with arbitrary commands and with some predictable error handling.

Mute/Unmute

Sometimes although rarely (and usually when there are no other options) we would intentionally do short lived “maintenance windows” where we would intentionally stop the end-user traffic, do the desired maintenance and turn traffic back on.

That could cause a lot of alerts from various systems and knowing them all upfront and silencing them wouldn’t be feasible. With Nagios (and SMG before the Mute function) we would simply stop it before the maintenance, and then start it back after the maintenance. But that’s not great either cause we end up “flying half blind” during the maintenance window, where normally I would like to see the alerts in the UI, just don’t get notifications/e-mails.

For that I implemented a “global” (per-instance) Mute button available in the Monitor UI. That does just that - prevents all notifications to go out. During such period all notifications commands are “short-circuited” and assumed successful without actually being run. This matters for calculating backoffs etc.

Of course this can be dangerous so the recommended usage of the Mute button is indeed to be used only for planned maintenances or ongoing outages where you are aware of whats the cause and don’t need more alerts to tell you that something is broken.

Custom dashboards

We would normally have our most-important (KPI if you want) Indexes near the top of the Configured Indexes page and we would normally check these first when we suspect issues or just to see how we are doing. Unfortunately in order to view a few separate indexes one would have to open them in a few tabs and was not necessarily easy to get such unrelated indexes on the same /dash page.

Also people wanted to see an “Operations Dashboard” on a big screen where one could see such KPI metrics at a glance. So I decided that I will have “Custom dashboards”. The Custom Dashboard would represent a page with “items” displayed as a CSS “flex” grid. The actual “items” could be (lists of) graphs defined either individually or via Indexes so ultimately - could display graphs from unrelated existing indexes on the same page. But these could also include stuff like “monitor states” item (showing somewhat condensed view of current non-OK states), “monitor log” item and also including external images or even entire pages via iframe item. There would also be a “container item” which could in turn hold other items and helping with creating the desired layout.

A Custom Dashboard is defined via a “- $cdash:” definition in the yaml and can end up quite big. There is an example in the SMG git repo.

These would show up as a SMG “Menu items”, on the left side of the “Search” menu item.

A stable system

This was also a time with long period of SMG stability. There were long periods with no SMG changes at all or only very minor ones. A lot of minor (or UI-only, like the Custom Dashboards) changes would only be applicable only to the “Central UI” instance I was running so I wouldn’t even bother updating all worker instances with every latest build. I am pretty sure that I had SMG instances with uptimes above 1.5 years at some point.

But that was about to change :)

Part 5: The world has changed

In the mean time Smule grew - both in terms of users but also in terms of a team size. We now have a full-blown office in Sofia Bulgaria with more than 150 people just here. And as of 2019 I am also a happy Smule Bulgaria employee.

I surely understand people moving to a new company and frowning upon the idea of a “homegrown” monitoring solution especially one a bit rough around the edges. And they have likely used fancier Grafana UIs in previous companies and a lot of them would hype about Prometheus.

But guess what - I am still the “Monitoring guy” and I am still able to handle most (if not all) of our production systems monitoring configuration by myself. And with help from SMG that certainly does no take all of my time (for reasonably long periods I do nothing about it as its all automatic). And I find it funny that people are happy to forget about Prometheus and Grafana once they realize that they could get the monitoring done for them by someone else :)

Still Prometheus seems to have taken over the Monitoring world and is the most popular tool today. It deserves its own section.

Some notes on Prometheus

It certainly has some advantages over SMG including a query language used to pull and aggregate data in more flexible ways. Its TSDB is very efficient and can keep up with millions of metrics. And in general Prometheus is probably the best tool in the world for monitoring newly built http (micro-) services … and nothing else. Because Prometheus only talks HTTP(s) and requires the response to be in a specific “OpenMetrics” format.

What this means that to support anything which was not specifically built to be monitored by Prometheus one needs “exporters”. These would be local agents/services exposing the necessary http end-point in the necessary format, making it convenient for Prometheus to “scrape”.

These exporters themselves would probably talk to the service (thinking about stuff like redis/mysql) using its native protocol to get its built-in statistics and also verify that its up and running in the process.

I don’t know what everyone thinks about this but for me personally it is flawed - there is now another point of failure in the monitoring system and it is less reliable than it could possibly be. This can be a source of false positives - the service is up but the exporter is down which is actually not a big deal. The real problem would be if the service is down but the exporter reports it up and reporting - a false negative. One could say that this would be a bug in the exporter but imagine the following situation: you are monitoring a backend service like redis or mysql and some sysadmin accidentally closes the service port on the local firewall (or as another example - mis-configures the service to only accept local connections). The exporter will still be able to connect and correctly report stats but your application servers can no longer access their database … And I don’t think that would be a bug in the exporter - more like a design flaw.

Proper service monitoring should involve the same host/port/protocol as the actual service clients and ideally using their native clients. It should be the same or as close as possible as real clients. Period.

At a higher level it may sound very nice to unify and simplify the monitoring concept using “metrics” but unfortunately that’s not how the real world works. That unification layer ultimately aims to hide a lot of the complexity involved in monitoring but hiding it does not actually remove it - it is there to bite us and obfuscating it does not necessarily help.

I have some other issues with Prometheus - in order to use effectively you need to use a bunch of other services too. That alone rings a bell in my mind - multiple points of failure.

Prometheus does not send alerts on its own - you need a separate service for that. In addition Prometheus does not do old data aggregation the same way rrdtool does. A lot of people use InfluxDb with Prometheus to address that - it does have built-in aggregation and compression of old data. But that’s yet another dependency and a point of failure.

There is another potential trap with people used to Prometheus and its ability to swallow millions of these metrics. People tend to expose so many stats that chances are that you end up with a few important metrics and a million of things to filter out and ignore. Of course that can be considered and arguable opinion so feel free to disagree :)

Yet there is more - the Prometheus graphing UI is not great and most people don’t use it directly. Instead people use Grafana - a fancy JS based graphing front-end supporting different backends including Prometheus. That by itself is not necessarily a problem, albeit being another point of failure (that would be the least important one IMHO compared to the others). Using JS-based charts can make even modern browsers struggle if you want to display too many of them and it is certainly cheaper to display PNGs.

But the ultimate issue for me with Grafana is that someone has to create and maintain these dashboards, via the UI. There are options to import these from json-based config/sources but certainly not as convenient as text files on disk … for me that is not an advancement but more like we are back to the Cacti days.

Note that with the tons of metrics developer can now (and tend to) expose, the Dashboard is the “thing” which now encodes the knowledge about which are important stats.

Suddenly we have (at least) 3 separate systems with their own configurations which in turn need to be kept in sync. And at least one of these systems expect people to click on UI to manage them.

Compared this with SMG where I can define what to monitor, what are the important stats to view and what to get alerts about within a single template file. Times easier to maintain but also much less prone to errors.

For me that can no longer be the “Monitoring guy” job - it needs a “Monitoring team” to handle all of that and IMHO with all chances to be with degraded quality - I wouldn’t envy that team’s laborious and error-prone job. And I am certainly not going to be the “Monitoring guy” managing such system.

The Scrape plugin

Whether I like Prometheus or not - it doesn’t matter. The world is adopting it and tons of services these days come with built-in /metrics end-point and that that’s how they expect to be monitored. I needed a good way in SMG to support monitoring via such end-points.

In Prometheus one does not define what to expect from the end-point. It will accept whatever it gets and update its database accordingly. That doesn’t exactly fit with SMG’s strategy where you define what to expect form given service and if something is missing - chances are that this might be a problem (which is how monitoring should be defined IMHO). But again whether I like something or not does not change the world and I needed a plan for that. I decided that my new “Scrape” plugin would pull the /metrics data and generate SMG YAML configs for all the objects it finds into a dedicated scrape-private.d directory which the regular SMG config will $include and use.

The actual generated config would have one pre-fetch command to get the metrics, possibly using curl as an external command, then for every stat in the metrics output there would be a single SMG RRD Object defined. Its command would parse the metrics, find the metric it was defined for and output its current stat values. But it wasn’t very efficient to parse the data for every RRD Object (often there would be thousands of these defined via single /metrics end-point).

Somewhere at that time I decided that I have enough use cases to extend my Plugin interface to support “plugin commands”. Also extended my internal SMG “ParentData” concept to support structured data (via simple reference into JVM objects) in addition to the so-far String-only (stdin/stdout) passed from parent command to a child command. The use case for that would be to parse all the stats in a single plugin-implemented “command” which would result in a simple (immutable) Scala Map of stat names (keys) -> stat values. Then I could pass that “ParentData” to (another) plugin-implemented child command to actually do the map lookup and get the data. All of the individual leaf commands (RRD Objects) in my commands tree will only run such efficient in-memory hash lookup “commands”. The Plugin commands would have a special syntax - starting with a column and the plugin id like :plugin followed by command parameters. The way the generated “commands tree” would be structured would look like that:

command: curl -f .../metrcis
    command: :scrape parse
        command: :scrape get metrics_stat_1
        command: :scrape get metrics_stat_2
        ...
        command: :scrape get metrics_stat_1000
        ...

Eventually I decided that I can remove the curl command from the “picture” and do the HTTP GET + parsing in one shot, where the http part would be implemented Play’s native HTTP client. That’s probably a bit more efficient but I will leave it to the reader to say which is more reliable (and to be trusted) determining that a http end-point is up and successfully getting data from it - a curl command or the built-in http client. With SMG I have the option to choose on case by case basis :)

Yet there is one more issue to be solved. The Prometheus format has a stat “name” followed by key->value “labels”. These labels can represent dimensions but ultimately they are a unique string within the metrics output mapping to the stat numeric value. Yet SMG does not have “dimensions” in its database - there are object ids which hold one or more time series values. So I decided to just map the entire name+label string to a unique SMG-compatible object id. That can work by simply replacing all commas (and equals signs) with dots and also removing any quotes and other disallowed characters are replaced with underscore. That is resulting in a (possibly long) unique ID which can be constructed from the Prometheus stat more or less unambiguously. In the (very rare) case of two stats from the same /metrics data resulting in conflicting SMG object ids I would append a ._N suffix where N would be the positional number of the id among the other conflicting ones.

This still has a few potential issues:

Some metrics SMG object ids can become too long to fit within the OS filename size limit (and these become rrd filenames)
Although Prometheus docs explicitly state that people should avoid using high-cardinality labels (dimensions) my real life experience shows that this suggestions is not necessarily always followed.

Because of these the SMG Scrape plugin would have a special option named labels_in_uids. If set to false, SMG will not include any labels in the object ids but replace them with positional ._N suffixes. This lets the “Monitoring guy” decide whether specific /metrics end-point will have its object ids include labels or not, depending on their format.

This all works albeit being a bit hacky. But it more or less suffers from the same issues Prometheus has - like it will blindly (generate configs for and) scrape any /metrics end-point you throw at it but that alone does not solve our Monitoring needs - in a lot of cases I still have no clue about which are the important stats to display in a dashboard and define alerts for.

That still requires human knowledge, and that can in theory be encoded in SMG using Indexes but I was hoping that I can do better … and my attempt at that would come a bit later with the autoconf plugin

Hashed rrd sub-dirs

With the Scrape plugin in place I would find myself in a “graphs explosion” situation. I guess it is partially because of Prometheus’ quality to be extremely efficient in scrapping and storing tons of metrics developers would end up abusing that a lot. I can understand the line of thinking where since its now very cheap to add a metrics counter and then to store it it is always safe to opt for adding a new metric if in doubt. You know - its easy to ignore something but not possible to get it if not there.

Well I don’t necessarily agree that it is “easy” to ignore “some thing” if these “things” are the overwhelming majority of all. Maybe it is expected that someone will create a Dashboard only showing the important ones … and maybe I am ranting pointlessly.

But the graphs explosion was there and I needed to deal with it. Up to now it was good-enough to just keep all rrds lumped into a single dir. SMG never actually lists that dir (its like a key-value storage) but still - it is not a good idea to have more than 100K files lumped into a single dir.

To address this I implemented a new SMG config feature named $rrd_dir_levels which would be a column-separated list of numbers. Each number would represent a “level” and the number will indicate the character length of the dir level. An example could look like this:

- $rrd_dir_levels: "1:1"

With such global in place, when SMG needs to determine the RRD filename on disk for given object, it would calculate the MD5 hash of the object id and then take N (1 in th example) characters from it for each level and construct the directory name form these. For a hash like a0b1c2… that would mean a sub-dir named a/0/ … The “1:1” example means that there could be up to 16 * 16 = 256 sub-dirs created given enough objects and reasonably random MD5 values distribution.

This is a common approach I’ve seen used in popular open source software (including Nginx, Postfix) to handle tons of files under one “root” directory - and its not like I invented it :)

One caveat is that currently SMG does not support changing this value. If you change it after objects are created, chances are that they will be lost. Of course it would be possible to implement such a feature in SMG but I never had a use case for it. Later with K8s I would just know to start with something like “1:1” from the start and that is present in the example SMG k8s deployment yamls, part of the open source repo.

Docker

These days it is modern to distribute software using container images. That’s actually required for stuff to run in Kubernetes. Previously I would release SMG as a simple tgz - you only need Java to run the application after unpacking the tgz.

However in order to use it effectively you need to install all the clients for services you want to monitor, stuff like a redis client, mysql client etc. In my view this shouldn’t be a big deal for companies actually running and relying on these services. But with the container concept I can actually bundle a lot of clients in the image and cover all use cases I care about. And its trivial to add more clients by creating a new image based on the base one and installing there.

So my “official” SMG image is quite “fat”. I think that’s a fine price to pay for the convenience and reliability you get. And in a Kubernetes environment I use SMG as my troubleshooting/“jump” pod. I get a shell there and can explore the in-cluster network and services having all the network service clients I need handy.

The image expects you to mount two directories - a config and a data dir, where the first would be managed externally (e.g. by Chef via generated configs) and the data dir is normally managed by SMG itself and that will include all timeseries data in the form of RRD files.

The CommonCommands plugin

In the mean time Covid came to us. With limited travel and entertainment options I ended up having some time to optimize some stuff.

With the “plugin commands” I had implemented for my Scrape plugin and the ability to pass structured data (vs just the stdout string) from parent to child commands I realized that I could optimize a lot of my templates if i had some built-in “commands” replacing the need to use actual external command for parsing and outputting numbers. So I implemented the CommonCommands plugin.

Initially i would implement simple replacements for the common grep/cut/awk/sed bash stuff I was using to parse the parent stdout data and output the specific numbers we care about from it. After parsing the data would be kept in a convenient lookup format (a Map) so the leaf commands outputting data would only have to do a map lookup vs the same parsing for each instance.

Commands would look like that:

command: :cc _subcommand_ _params_...

Soon after I implemented native csv parsing support in that plugin which greatly helped improve my haproxy LBs stats monitoring throughput (to the point that I could drop a few dedicated to haproxy stats monitoring SMG instances).

Another useful sub-command would be the :cc rpn … sub-command. That can compute arbitrary expressions based on input values passed from a parent command as a list of numbers. The use case for that was to calculate the cache hit % from some CDN vendors providing separate “end-user” and “origin” traffic numbers and we wanted percentage.

This plugin is likely going to be extended a lot over time - the original “bash commands” concept is still perfectly fine to use at smaller scale and at larger scale any inefficiencies that come up with size can be addressed via a built-in “command”.

JMX Plugin revisited

I mentioned that my original JMX plugin implementation had some issues.

With the “plugin commands” concept I actually scrapped most of that and rewrote it from scratch. I could fit the JMX monitoring again with the parent/child commands pattern. The plugin would expose two commands - con and get:

command: :jmx con _host_:_port_
    command: :jmx get _host_:_port_ _jmx_path/value1_
    command: :jmx get _host_:_port_ _jmx_path/value2_
    ...

The “con” command would instruct the JMX Plugin to check if it already has alive connection to this host:port and if not - establish one. If that fails, the JMX server is likely down so that will result in FAILED state and the child commands will not be run at all.

The “get” commands will actually use that connection to retrieve values from the respective MBeans etc.

Internally the JMX Plugin would keep JMX connections persistent and keep all connection handles in a Map keyed by host:port. It also keeps an up to date “last used” timestamp for each connection. Then on every plugin run it would scan these active connections, find ones unused for at least 3 intervals and does a cleanup including closing the connection. That’s all the Plugin “run” will do, the actual polling and object definitions are now in the SMG core and the plugin is used via these two commands.

The current state of the plugin can still use some polishing but is already a huge improvement over the original quite bloated version.

Kubernetes and the Kube plugin

These days its modern to deploy workloads on Kubernetes cluster. I am not going to explain what Kubernetes (K8s) is - I am not the biggest expert and certainly not a fanboy.

Yet a lot of people are excited about it and we also wanted to experiment with it in Smule. I ended up setting up a couple of clusters from scratch and got reasonably familiar with the platform.

As far as Monitoring, Kubernetes supposedly has that solved - k8s makes it easy to deploy a few components (“micro-services”) and make them talk to each other (“orchestration”). That nicely fits with the mentioned above common Prometheus setup which would normally include Alert Manager and Grafana.

These are often wired together by an “operator” or a “helm chart” making them somewhat looking like a single integrated component. Yet they are not. And most of my rants about Prometheus and Grafana above are still valid IMHO.

Other than that Kubernetes does change the process about how one provisions, configures and runs services. Our “Chef ways” wouldn’t be applicable any more. We would still use Chef to bootstrap nodes and have them join a cluster. But what actually runs on these nodes after that is no longer controlled by Chef.

With Chef we would encode the installation and configuration process using one or more recipes and a bunch of config files and/or ERB templates used to configure it.

With Containers (and by extension - K8s) the software installation part is handled by the container build process (often - docker) and it is described in the form of a Dockerfile. It is generally trivial to “translate” a chef recipe to a Dockerfile and normally - not too hard to translate a Dockerfile to a chef recipe (only extra complexity with Chef is that one needs to think about the current state, before the change, with Dockerfiles, everything is applied cleanly as if “from scratch”). So from that perspective moving a chef-deployed service to K8s wouldn’t be such a big deal.

The service “Configuration” part however is encoded in the Kubernetes deployment yamls in some form (often ConfigMaps, but there are other options too). So to migrate some Chef-managed service to K8s I would just have the installation part in the Dockerfile and the configuration part to the K8s yamls and still not a big deal.

I will leave it to the reader to decide whether such separation is a good or a bad thing - as with a lot of things in life it possibly has benefits but also possibly - has some costs associated with it. I think the idea is that the development teams could build the containers and then “operations” teams will own the deployment yamls. Whether this works well is an organizational questions (totally outside scope of this writing), my gut feeling is that “developers throwing containers over the fence to operations” does not end well.

But there is one more fundamental change for me and my Monitoring system - I no longer have my central Chef database knowing all nodes and what is expected to run on each of them (which also means - it knows what to monitor on them).

Of course Kubernetes has its own and more advanced way to achieve that - since it actually makes the decisions (based on the yaml descriptors) about what to run where it also has that information available via convenient APIs. Internally K8s stores that information in an etcd cluster (which normally represents the K8s “masters”) and makes it available via HTTP-based APIs (and protected via Role-based access).

Prometheus itself has a way to use these APIs and “auto-discover” what to monitor. A naive approach is to have it just test all pod/endpoint/service/node (k8s concepts) ports whether they will successfully return HTTP response on a /metrics URL. That doesn’t work well and you could see some weird errors in the logs caused by Prometheus trying to make http calls to a non-http service port. A better and widely adopted approach is to instead use K8s annotations (which are sort of an extension point for any K8s objects and these could contain custom key/value pairs). So instead of trying to scrape everything it would only scrape annotated objects. The annotations itself could customize the scrape target port and URL path and could look like this:

  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9999"
    prometheus.io/path: "/metrics"

The actual annotation keys are configurable in Prometheus but the above are kind of a default / “standard”.

Of course the port would still have to be a http end-point returning metrics at some customizable path. Pretty much all of the “system” K8s services (and any “cloud-native” ones) are usually exposing such monitoring end-points and this is kind of a standard.

I already had my Scrape plugin capable of dealing with Prometheus-type of metrics so I just needed the auto-discovery part. And of course as with a lot of other stuff that would be a Plugin, at least initially. I also needed a Scala/Java client library to be able to talk to the K8s system APIs and went ahead with fabric8.io - a great library working fine for me and without too many surprises.

So my new “Kube” plugin would work like this:

On every run (which by default is every minute), it will list all K8s objects it was configured to list.
If not configured to filter by annotation it would work the same way as the Prometheus naive approach - probe all ports for metrics and use ones which result in such.
When configured to use annotations (which is the recommended way) it will use the annotated objects and the specified port and path.
For every such discovered host/port/path to monitor it would generate a “scrape target” config - a yaml which the Scrape plugin understands and based on which it will actually generate SMG yamls resulting in monitored objects.

That wasn’t too hard to implement and I was able to get it basically working quite quickly. I don’t necessarily love the fact that the Kube Plugin is generating configs for the Scrape Plugin which in turn is generating actual SMG configs. But I guess with K8s its all about layers on top of other layers so not a huge deal.

I also thought that it would be a good idea to have graphs from all “kubectl top …” numbers. K8s has an API for that and fabric8 supports it so I introduced some special :kube plugin commands to be available for use in templates. I made these result in OpenMetrics compatible format so I could hook a command like :kube top-nodes (there are also top-pods, top-conts) as a scrape plugin command, and the Scrape plugin would handle the SMG config generation from there.

One property of K8s object instances is that these tend to be quite dynamic. The “Kubernetes way” of making a service healthy would be to start a new instance elsewhere (possibly “auto-scaling”) and scrapping misbehaving pods. Normally any stats coming from given pod would include the pod name (to be distinguished from other pods with the same metrics). These names however contain some random-looking strings ensuring the object instance uniqueness. E.g. a Pod managed by a Deployment named my-service could be named like my-service-01234567-qwert. How many variable components an object name has would depend on the K8s object type. If used “as is” that would mean that I would have a new object generated under the new name for every deploy and effectively lose history. I am not exactly sure how that is handled in Prometheus and Grafana but since the number of extra name components is known (depending on the object type), I could simply strip these and replace them with a numeric index to get some more “stable” object ids which also get long term history. So in the above example my-service-01234567-qwert would become my-service._1.

That was OK-ish eventually but I was still not entirely happy about it. I had a lot of Indexes automatically generated to put some structure on all the metrics stats so it was still browseable. But often I would be lost in the thousands of metrics exposed by some service pod without much clue what is actually important. I guess that with Prometheus and Grafana that part was now moved to Grafana where people are supposed to click on UI to create and tweak Dashboards. Of course there are tons of readily available Grafana templates created and shared by people and dedicated to various service types (including tons of “Kubernetes System Dashboards”). Yet these rarely work 100% out of the box and do require some tweaking which for me is again almost the same as going back to the Cacti days.

The Autoconf plugin

The original SMG idea was to implement a monitoring system which can be managed easily and almost exclusively via Chef or another “Configuration management” system having information about all hosts and services but also having an easy way to output configuration files based on templates which SMG would use. We have been using Chef but SMG should be possible to use with any similar system like Puppet or the more modern Ansible.

In all of these cases the actual SMG config templates will reside in the Configuration management repo and more often than not the templates could have some custom logic in addition to simply doing variable substitutions. With Chef one could code pretty much anything ruby can do in the ERB templates. I had idea about creating some generic and open-source templates and even had some bundled examples already. But the intended usage of those would be to use something like sed to do variable substitutions and that would be it.

For example I wanted a generic “host” template but not all Linux hosts are the same. All machines have cpu and memory (and probably - some disks) but the number of CPU cores and disk configurations can certainly vary across machines. With Chef I already had information about the hardware (in the form of node attributes populated by its “ohai” system) so I would use that information to generate the proper template for each host which as mentioned can depend on hardware configurations. For that a simple variable substitution will not be sufficient - I also need at least “if” conditions and loops. I looked around and decided to use the Scalate Scala template engine for that. It supports a few different syntaxes but one of them is very close to the Chef ERB templates I was using so that would make it slightly easier to port any Chef templates I might want to.

I already had some Scala code to generate SMG configs built into SMG - that is exactly what the Scrape plugin would do. But that was only dealing with the specific fixed OpenMetrics format and would produce “uniform” lists of graphs from them.

But I thought to myself that if I had real template engine built into SMG I could do better than that and automatically generate SMG configs based on templates and some “runtime data” which can come from an arbitrary command, just like any SMG object or pre-fetch command. In theory that could even generalize the Scrape plugin case - the “runtime data” will be retrieved using the scrape command and the Scalate template combined with that data could result in a SMG yaml config file. That has not yet happened as of this writing but is certainly something which is lurking in the back of my head.

But the autoconf plugin was getting into shape. Its config would be a list of “targets”, similar to the Scrape plugin. But instead of “metrics” URL it would have to provide a template name (essentially the “type” of the service) and a command. It can also contain arbitrary yaml objects which are passed directly to the template as the “context”.

Such target configuration can be “static” and does not require “runtime data” (and a command) at all or “dynamic” which uses the output of a command (usually - a list of strings from stdout) as the “runtime data”.

For example “static” templates could be used for systems like redis and mysql where you know beforehand what stats to expect and need objects for. E.g. the result of redis info command can easily be translated into a “static” template because the key/values you will find there will probably have the same keys on every redis install with the same version. But OTOH for a generic Node template it needs to be “dynamic” and reflect the hardware configuration of the node. Thinking about how to do that (and use SNMP which would be our default choice for system stats monitoring internally) I realized that the main variable parts would be the disk and network configurations and I could get these using a couple of snmpwalk commands like this:

snmpwalk -v2c -c$COMMUNITY -mall $HOST hrStorage
snmpwalk -v2c -c$COMMUNITY -mall $HOST if

So I would wrap these two in a command named smgscripts/snmp-walk-storage-network.sh and use that as the “runtime data” command. The actual template would parse these on its own and generate the necessary disk space and network bandwidth graphs accordingly (the rest is pretty static).

Eventually we started testing and using some Cloud provider -provisioned machines in various Points Of Presence (POPs) across the world. We would use these to add small scale proxies closer to the end-users with hopefully better connectivity from them. These POPs would sometimes be very small footprint (possibly a single VIP backed by a couple of hosts in a HA pair) so It wouldn’t make sense to setup a full chef+SMG monitoring infrastructure for these. In such cases I would monitor these from the closest Data Center we have to them. In at least one such provider SNMP was filtered at the network level and I wouldn’t be able to poll my remote instances for system stats over SNMP. But these days a lot of people use Prometheus’ node_exporter to get the same/similar stats which are otherwise available over SNMP. I could access the node exporter http port so I had a way to get system stats just didn’t have the proper template.

This is when I created the node-exporter template. The result from that template is more or less the same as the result from the SNMP based template (originally “borrowed” from Cacti itself).

Eventually I implemented some more templates, including ones covering haproxy stats, redis and mysql and a bunch of others are in progress.

The actual plugin conf defines some sane defaults but it is still possible to tweak these by editing that file (or overwriting it via k8s/docker configs). It also defines a list of targets which can be actual targets (a bunch of commented out examples there) or $include definitions allowing me to include some “drop in” directories where I could add/remove target yamls (or possibly added/removed by Chef or K8s). The defaults would be these:

- include: "/etc/smg/autoconf.d/*.{yml,yaml}"
- include: "/opt/smg/data/conf/autoconf.d/*.{yml,yaml}"

The first one is intended to be used in “bare metal” setups and the second one - to be used in containerized environment where it might make sense to have just one SMG volume vs separate config and data volumes.

The plugin configuration also defines one more very important property - the conf_output_dir (default is conf_output_dir: “/opt/smg/data/conf/autoconf-private.d”). Similar to the Scrape plugin’s conf output dir, this one must be included by the regular /etc/smg/config.yml conf. This is already the case in the “official” SMG container image - the generated by the Dockerfile conf will $include that dir too.

As I mentioned the autoconf plugin could potentially consume large portion of the scrape plugin (the scrape plugin SMG config generation can be just one of the many templates autoconf supports) and it follows similar patterns. With that in mind I decided that I can extend my Kube plugin to fully support autoconf templates. This would be an extension to the scrape annotations support I already had (prometheus.io/scrape: “true”). So in addition to these I could add some new smg-specific annotations, here is how an example using my haproxy template:

  annotations:
    smg.autoconf/template: "haproxy"
    smg.autoconf/runtime_data: "true"
    smg.autoconf/port: "stats"
    smg.autoconf/command: "curl -sS -f 'http://%node_host%:%port%/haproxy?stats;csv'"
    smg.autoconf/filter_rxx: "stats.frontend"

This implies that haproxy is configured to serve stats on the /haproxy?stats URL on the “stats” port as defined in the k8s yamls. At run-time the Kube plugin will replace %node_host% with the respective Pod/Endpoint/Service IP and %port% with the numeric port named “stats”. The annotations syntax is flexible - the stuff after smg.autoconf can encode multiple autoconf target configs within the same annotations block, e.g. smg.autoconf-1/… vs smg.autoconf-2/… The part after the / is the actual context property name but that can also encode list values and map values using special prefixes like _list and _map (and such types of context can be passed to the template).

The documentation around that is certainly a bit lacking but updating is on the roadmap.

Part 6: What is next

Honestly - I don’t know for sure yet.

Almost all of the features SMG has were created because there were real world use cases for them. A lot of those would be about simplifying my job as the “Monitoring guy”, i.e. config generation and maintenance. But a lot of them would actually come because of requests or questions from some team member about how to do certain things.

So I can definitely say that so far the SMG “Product Managers” have been Smule’s extended operations and server development team members - thank you all for that!

I do have some plans … the autoconf plugin can use more templates for more services. Still need to add “first class” json support in the Common Commands plugin (and likely more built-in commands through it).

The documentation can also use some refreshments. While I have tried to keep it reasonably up to date and there should be no wrong info there, its quite possible that not all features are well documented. And the more recent plugins including Scrape/Kube/Autoconf may not have any documentation at all (so these are definitely a TODO item).

Also I have been thinking about adding a “syslog plugin” - essentially being able to extract and keep track of log-based time-series and do that in “near-time” (vs doing it over rolled over previous hour logs).

And eventually - I will probably need to add some form of authentication/authorization to SMG. If we don’t count the “silence” and “acknowledge” actions, SMG is essentially a read-only system. Everything it “does” is determined by the config files on the local disk which only someone with ssh access (or an automation system) can change. Because it is trivial to setup a http reverse proxy in front of it (in fact that is the recommended way to run it - have the images served by apache or nginx from disk) one can implement authentication there. Yet with larger teams there are still use cases to have the “user” concept in SMG. If not else - to track who silenced and/or acknowledged given state. So that is a TODO item.

Other ideas are welcome too :)

At this point I am not aware of anyone outside Smule to be using SMG. I guess that is largely my fault as I never really tried to popularize it - I have always been building it for myself - to help me do my “Monitoring guy” job, efficiently and reliably.

But its also possible that this is not true and someone is actually using it - I wouldn’t know that but would love to hear about it. Feel free to create a github issue and share any positive or negative feedback from using SMG - I would be happy to try to address any glitches you may have encountered :)

In any case as of this day I am still the “Monitoring guy” at Smule and thanks to SMG it certainly does not take up all of my time to be in that role.

That’s it for now and thank you for reading that far.

Jul, 2021

- asen