August 23, 2016

Basic site monitoring with Riemann

Riemann is a general-purpose event processing system, but its most typical application is as a place to send and generate metrics about applications. I recently set up a Riemann server for my personal projects, and I feel like my devops game is stepped up by 1000%.

Or, at very least, I feel like I know know about it as soon as one of my sites goes down.

There are plenty of articles going over the virtues of Riemann, but this one aims to be a practical guide to getting started for those that are already convinced. This short article will cover:

  • How to install and configure Riemann server on an Ubuntu Digital Ocean box
  • How to set up Riemann’s system health monitoring
  • How to write basic configuration files to notify yourself via email when something’s up

Installing Riemann Server

It’s probably worth your while if you’re a chronic hobbyist like me to drop the $5 per month on a dedicated Riemann server. Just head over to Digital Ocean (disclaimer: referral link) or the VPS provider of your choice and spin up an Ubuntu 16.04 instance if you want to follow along directly.

Before installing Riemann, you need java. Just run apt install default-jre on your server.

Installing Riemann is covered in their quickstart guide. This will install a Riemann server and a systemd service to run it automatically. If you’re like me and just want a deb instead of compiling and installing it yourself, you can get a URL for that from the chef recipe. Just make sure you update the version number in the url. Then, run dpkg -i <.deb file>.

Viewing your first event

To do this, we’ll need a copy of Riemann’s command-line client tools, which are available using gem. First, we’ll have to install ruby and java:

$ apt install ruby

After all that is done, we can use gem to install some extra Riemann tools:

$ gem install riemann-tools riemann-dash riemann-client --no-rdoc --no-ri

At this point, to see the magic happen, you first have to create a file called “config.rb” and pop the following in it:

set :bind, "0.0.0.0"

Then, in the same directory, run

$ riemann-dash

Since you haven’t set up ufw yet, head to http://<vps_ip>:4567/ and you’ll be able to see a dashboard there.

I’ll take a moment to note here that Riemann’s dashboard is very functional but super weird to configure, relying primarily on keyboard shortcuts and window-splitting. If you have a boss you want to impress take the time to install something like Grafana. You could back it with InfluxDB, receiving events from Riemann.

Anyhow, once you get a view of that glorious dashboard, your first order of business is to correct the server. In the grey text input at the top right, replace “127.0.0.1” with your vps’s IP.

Then, bumble your way into creating a new view (cmd+click the big “Riemann”, ctrl+shift+left, cmd+click empty space, e) select log, and enter whatever you like for the title and “true” for the query. This will show you all events as they come into Riemann. You should see a bunch of events that Riemann reports about itself by default, which is good! It means your installation is working. Once you’re up and using Riemann correctly, this will be useless because all you’ll see is junk flying by, but it’s nice for getting started.

Setting up monitoring for Riemann

You’re supposed to put your own oxygen mask on first, right? Let’s make sure our server can keep tabs on its own health.

Press “s” to save your dashboard with the log window, then open up a new ssh window (or tmux or whatever) so you can run another command on your server while the dashboard is still up:

$ riemann-health

Now, you should see a bunch more events flying by your log, with names like “memory”, “disk /”, and “cpu”. These are what we want to see. To see just one of these in the log, select it with cmd+click, press “e”, and change the query to (for example) service = "cpu". Now, you should only see the CPU events.

Spend a little time rearranging your dashboard (you can add a new dashboard page with the + at top right). I recommend creating a Gauge and a Flot for each of CPU, Memory, Disk /, and Load. Remember to save your work!

We’ll always want riemann-health running, so let’s set up a systemd job for it. Edit the new file /etc/systemd/system/riemann-health.service and enter the following:

[Unit]
Description=Riemann Health
After=network.target

[Service]
ExecStart=/usr/local/bin/riemann-health
PidFile=/var/run/riemann-health.pid
Restart=on-failure

[Install]
Alias=riemann-health.service

Then, you can start the service using $ service riemann-health start.

Making the dashboard permanent

You want to be able to see that pretty dashboard, right? Here’s a systemd service you can put in /etc/systemd/system/riemann-dash.service:

[Unit]
Description=Riemann Dashboard
After=network.target

[Service]
ExecStart=/usr/local/bin/riemann-dash
PidFile=/var/run/riemann-dash.pid
Restart=on-failure

[Install]
Alias=riemann-dash.service

However, unless you happened to create the config.rb file from before in your root directory, this will not bind to 0.0.0.0, but to localhost (127.0.0.1). That’s just fine, because for a variety of reasons you’ll want to run it behind nginx.

Install Nginx now. Also, let’s install the apache2-utils package to get the htpasswd utility:

$ apt install nginx apache2-utils

You’ll want some authentication, so let’s create a .htpasswd file for use with http basic auth:

$ htpasswd -c /etc/nginx/.htpasswd <your_username>

Follow the prompts to enable your password. After that, edit your /etc/nginx/sites-available/default file. Here’s a basic configuration (replace the whole file with the following):

server {
    listen 80 default_server;
    listen [::]:80 default_server;

    location / {
    auth_basic "Riemann Dashboard";
    auth_basic_user_file /etc/nginx/.htpasswd;
    proxy_pass http://localhost:4567;
    proxy_pass_request_headers      on;
    }

}

Now, assuming the riemann-dash and nginx services are both running, you can head to http://<your_ip_or_hostname>/ and see the dashboard only after logging in. Hurrah!

Enable SSL for the dashboard

You can get a free SSL cert via letsencrypt and certbot, so do so following their instructions. You should end up with a cert somewhere in /etc/letsencrypt/.

It’s easy enough to configure nginx to demand SSL. Replace your configuration with this:

server {
    listen 80 default_server;
    listen [::]:80 default_server;
    location / {
        rewrite ^(.*)$ https://example.com$1 permanent;
    }

}

server {

    # SSL configuration
    listen 443 ssl default_server;
    listen [::]:443 ssl default_server;

    ssl on;
    ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;
    ssl_trusted_certificate /etc/letsencrypt/live/example.com/fullchain.pem;

    # Index
    index index.html index.htm index.nginx-debian.html;

    location / {
        auth_basic "Riemann Dashboard";
        auth_basic_user_file /etc/nginx/.htpasswd;
        proxy_pass http://localhost:4567;
        proxy_pass_request_headers      on;
    }
}

Remember to replace the example.com with your domain.

However, we’re not done. The dashboard uses websockets, and you’ll start to get errors because it defaults to ws: connections. However, we can use nginx to proxy the unsecured websocket connections as well, upgrading them to wss: connections in the process. Just add the following to your nginx config file:

map $http_upgrade $connection_upgrade {
    default upgrade;
    '' close;
}

upstream riemann_ws {
    server 127.0.0.1:5556;
}

server {
    listen 5566;
    ssl on;
    ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;
    ssl_trusted_certificate /etc/letsencrypt/live/example.com/fullchain.pem;

    location / {
        proxy_pass http://riemann_ws;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;
    }
}

Now, you need to change your Riemann dashboard to use port 5566 instead of 5556. Just change the port in the textbox at top right on your dashboard (and remember to save).

Finally, you should make sure any ports that aren’t needed are locked down with ufw (5555 is Riemann’s client port):

$ ufw default deny
$ ufw allow ssh
$ ufw allow http
$ ufw allow https
$ ufw allow 5555
$ ufw deny 5556
$ ufw allow 5566
$ ufw enable

There you have it! You’ve got yourself a reasonably secure Riemann dashboard with auth. Now, to make it useful.

Adding monitoring to other servers

(You may need to install a newer ruby on older servers for this part. Brightbox’s ubuntu repos helped me out a lot with this.)

Any server can run riemann-health and report back to your new Riemann server. Here’s the general procedure:

  1. Install riemann-tools: $ sudo gem install riemann-tools
  2. Create a service to run riemann-health on startup, with the argument --host <my_riemann_server_host>
  3. Set up your Riemann dashboard to show stats from that server.

You’ve already seen the systemd config for riemann-health, here’s a matching upstart one:

description "Riemann health"

start on [2345]
stop on [!2345]

respawn

exec /usr/local/bin/riemann-health --host my-riemann-server.com

Monitoring nginx

You can monitor Nginx using the riemann-nginx-status utility included in the riemann-tools package, which works just like riemann-health. The only difference is that you’ll have to set up a stub status endpoint on the nginx server you want to monitor. Just add the following server declaration to your nginx config:

server {
    listen 127.0.0.1:9000;
    location "/status" {
        stub_status on;
    }
}

Then, set up a service to run riemann-nginx-status --host your_riemann_server.com --uri http://localhost:9000/status on your server and you’ll start getting metrics from it (if you don’t care about metrics, just pass a URI to monitor and it’ll only send “ok” messages).

Email notifications

To send you a notification, you’ll need to install sendmail on your Riemann server (apt install sendmail) and edit Riemann’s configuration.

Riemann’s configuration file is in fact a Clojure source file. This may or may not mean anything to you, so just in case, Riemann’s website has a quick introduction.

The config file being a full-fledged language means that you can do pretty complex things with your event streams, but most common ones involve aggregations and notifications. Here’s an example: open up /etc/riemann/riemann.config and add this at the top:

(require '[riemann.email :refer :all])

Then, replace the existing streams section with this:

(streams
 (default :ttl 60

  ; Index all events immediately.
  (index)

  ; Log expired events.
  (expired
   (fn [event] (info "expired" event)))

  (where (and (service "nginx health")
              (state "critical"))
         (changed-state {:init "ok"}
         (rollup 1 3600
                 ((mailer {:from "riemann@<my_riemann_server_host>"}) "[email protected]"))))
  ))

The last part can be read approximately as follows:

“For all events where the service is ’nginx health’ and the state is ‘critical’, if the “state” has changed, email me the first event and then a summary of all events over the past hour”

Further Thoughts

  • Since Riemann’s monitoring data expires so quickly, if you want to keep a longer-term dataset consider forwarding from Riemann to a timeseries store like Influxdb or Graphite.

  • For more consistent email alerts set up smtp or mailgun or something like it.

  • The other main use of Riemann is monitoring application-specific metrics. If you want to keep track of apps you’re running you can use a Riemann client library and have it report things back. For example, Later for Reddit reports the status of each post attempt, which is aggregated into posts per hour and posts per day for my amusement (and debugging). It also forwards any error-status log messages to be handled.

You can find lots more examples on Riemann’s howto. The hard part is getting the server up and running; once you have it there you’ll find all sorts of uses for it, I promise. That’s all for now!