<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>Mark's Blog</title><link>http://qq.is</link><description>Technical ramblings somewhere between development and operations.</description><lastBuildDate>Wed, 22 May 2013 16:23:06 GMT</lastBuildDate><generator>PyRSS2Gen-1.0.0</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Blogging to Victory</title><link>http://qq.is/article/blogging-to-victory</link><description>&lt;p&gt;Welcome to my new blog.&lt;/p&gt;
&lt;p&gt;Sure, I've been doing this writing online thing for about a thousand
years, but this is the first time I've actually decided to make a
career appropriate online presence wherein I talk about technical things
and actually make an attempt at writing regularly and covering a variety
of topics related to the technical side of things.&lt;/p&gt;
&lt;p&gt;This will not be a blog about my family, cat, house, or other things
except as they relate directly to the subject at hand. Those of you who
are afraid of that kind of content -- never fear. You don't have to
get any closer to the author than is absolutely necessary. This should
keep the signal to noise ratio high and make for a much better existence
for those of us who really just want to get some information and move on
about our day.&lt;/p&gt;
&lt;p&gt;This will, however, be a blog that is written by a real person and
presents facts with as much opinion as is necessary. Is memcached really
the solution to all that ails you? I don't know if it is, but it's a
damn nice tool and I will probably recommend it over leeches for curing
the common cold. You're guaranteed to get things as I see them -- no
sugar coating here -- and the best I can promise is that I will take
correction and instruction well. Let me know when I fuck it up.&lt;/p&gt;
&lt;p&gt;Oh yeah, and language -- I actually swear and like it. If you're opposed
to the use of certain four letter words, you best be considering
where you're reading this stuff at. I work for startups and I love
a high-adrenaline, fast-paced environment -- there tends to be foul
language.&lt;/p&gt;
&lt;p&gt;I should probably do a bit of talking now about my own history and what
my credentials are. I don't know. That always seems to be kind of a
waste of time. As it turns out, I've worked at a few places such as
Google, Mozilla, Danga/Six Apart (LiveJournal), StumbleUpon, and Bump
Technologies. I've founded an open source project (Dreamwidth Studios)
that is successful. I've even spoken at conferences (OSCON, Web 2.0
Expo, linux.conf.au). I go by the nickname xb95 on Freenode.&lt;/p&gt;
&lt;p&gt;Watch this space, there's more to come!&lt;/p&gt;</description><guid isPermaLink="true">http://qq.is/article/blogging-to-victory</guid><pubDate>Wed, 12 Oct 2011 08:00:00 GMT</pubDate></item><item><title>An easy to use Nagios API</title><link>http://qq.is/article/announcing-a-nagios-api</link><description>&lt;p&gt;Over the years I've been doing systems administration, I've spent
a lot of time (really, quite a lot) writing tools that pull data from
Nagios or try to make it do what I want. Command line apps to schedule
downtimes, IRC bots that parrot alerts, email/SMS gateways, status web
pages, etc etc.&lt;/p&gt;
&lt;p&gt;Every time I take on a project like this, I usually go through three
phases: first: lamenting that I don't have the code from the last time
I did it, second: weeping over the atrocious mid-90s look, feel, and
implementation of Nagios, and finally: actually sitting down and doing
whatever it is I need to do.&lt;/p&gt;
&lt;p&gt;It's time to cut out the first two steps. Enter nagios-api: a
REST-like, JSON API for Nagios. This allows you to quickly and easily
build command line tools, web interfaces, and other code that interfaces
with Nagios - without having to actually interface with Nagios. Leave
that to me.&lt;/p&gt;
&lt;p&gt;If you want to go check it out now, the code is available here:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/xb95/nagios-api"&gt;https://github.com/xb95/nagios-api&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Right now it's fairly simple and only lets you do a few things: get
the state (enough to implement a status page), schedule/cancel downtimes
(90% of what I have to do from the command line anyway), and tail the
log (the final 10% of what I'm typically up to).&lt;/p&gt;
&lt;p&gt;This is implemented on top of the &lt;a href="https://github.com/jamwt/diesel"&gt;Diesel
framework&lt;/a&gt; by Jamie Turner et al.
Since coming to Bump and discovering Diesel, implementing this kind of
network/loop driven system in Python has gone from 'annoying' to 'so
easy I can do it in my sleep'. Seriously good stuff.&lt;/p&gt;
&lt;p&gt;Future plans: add a lot more functionality, of course. There are many
verbs in the Nagios language and I want to be able to support most or
all of them. I'm sure much of that will come as I need to implement
them, and of course, from contributions by other people.&lt;/p&gt;
&lt;p&gt;And finally, of course, I want to replace Nagios with an entirely
new system. I've been doing some work on that on the side, but
I'll talk about that another day. Ideally, whatever interface
the nagios-api project settles on will be translatable to the new
replacement monitoring system I'm working on. That way any tools
written against this API will just continue to work against whatever the
other system is when it's done.&lt;/p&gt;
&lt;p&gt;Feedback is, as always, very welcome.&lt;/p&gt;</description><guid isPermaLink="true">http://qq.is/article/announcing-a-nagios-api</guid><pubDate>Mon, 17 Oct 2011 14:00:00 GMT</pubDate></item><item><title>Using Git Flow on GitHub</title><link>http://qq.is/article/git-flow-on-github</link><description>&lt;p&gt;A while ago while at StumbleUpon we looked at using &lt;a href="https://github.com/nvie/gitflow"&gt;git
flow&lt;/a&gt;, an implementation of
the workflow outlined in the post &lt;a href="http://nvie.com/posts/a-successful-git-branching-model/"&gt;A Successful Git Branching
Model&lt;/a&gt;. It
looked really interesting and I wanted to try it but never got around to
it.&lt;/p&gt;
&lt;p&gt;Lately, I have. Thanks to this very helpful post on the subject, I
have now worked this tool in to my daily open source work. It actually
integrates quite well with GitHub, now that I've gotten down a very
functional workflow.&lt;/p&gt;
&lt;p&gt;The first thing that I've done is forced myself to have very good
hygiene in my repository. I don't develop on master and smash everything
together now, I use feature branches for all of my development. Branches
are extremely cheap in git, so there is no real excuse not to separate
things out. It makes for a little more merging down the road, but that
trade-off seems worth it. Particularly when you are talking about
GitHub, which allows you to easily share code and contribute back to
other projects.&lt;/p&gt;
&lt;p&gt;Today I decided to contribute back some of the recent Perlbal changes
I had made. For this blog post, we'll look at one of those -- a small
change to add a DEFAULT command. (What the change does exactly doesn't
matter for the purpose of this post.)&lt;/p&gt;
&lt;h1&gt;Set Up (Forking, Config, Git Flow)&lt;/h1&gt;
&lt;p&gt;The code I wanted to change is on GitHub already, in the repository
&lt;a href="https://github.com/perlbal/Perlbal"&gt;perlbal/Perlbal&lt;/a&gt;. I clicked the
"Fork" button and a few moments later GitHub gave me my own copy of the
code to do with as I will.&lt;/p&gt;
&lt;p&gt;Most of you have probably used git before, so you won't be surprised by
the next step:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ git clone git@github.com:xb95/Perlbal.git
$ cd Perlbal/
$ git remote add upstream git@github.com:perlbal/Perlbal.git
$ git flow init
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Executed on my local Linux installation, this command checked out and
set up the code for me to interact with. It landed in a &lt;code&gt;Perlbal/&lt;/code&gt; folder
under the current working directory and then configured the upstream
remote to point back at the source for Perlbal so I can pull down
changes they make later.&lt;/p&gt;
&lt;p&gt;Finally, the Git Flow system is initialized. I recommend you accept all
of the defaults, they are reasonable and work well.&lt;/p&gt;
&lt;h1&gt;Write Some Code&lt;/h1&gt;
&lt;p&gt;Now we're ready to kick some code. From inside of the &lt;code&gt;Perlbal/&lt;/code&gt; folder,
you can instruct Git Flow that you are about to start developing on
something. The only thing you have to decide right now is what to call
it. For today's example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ git flow feature start default-command
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A few moments later, you will have a new branch named
&lt;code&gt;feature/default-command&lt;/code&gt;. It does a little other magic in the
background and some sanity checks to make sure you don't have
uncommitted changes, but otherwise, it's mostly just creating a branch
for you.&lt;/p&gt;
&lt;p&gt;Now do your development and commit, just like normal. (This part
I assume you know how to do and am not going to spend any time
discussing.) The only thing to keep in mind is that you need to make
sure all of your commits stay on the feature branch you're on.&lt;/p&gt;
&lt;h1&gt;Submit to GitHub&lt;/h1&gt;
&lt;p&gt;At this point, you normally would use Git Flow to finish the feature, it
would merge your changes back into the development branch, and you could
share that with other people. In our case, however, since we're using
GitHub and we want to upstream this change, we actually want to leave
the feature branch open.&lt;/p&gt;
&lt;p&gt;This is important: &lt;em&gt;do not use &lt;code&gt;git flow feature finish&lt;/code&gt; yet!&lt;/em&gt; If you
do, I'm not sure how to recover from that situation. (I'd love to know,
if anybody out there has some good advice on the matter.)&lt;/p&gt;
&lt;p&gt;What you should actually do is this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ git flow feature publish default-command
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command causes the branch you've created and worked on to be
created on the origin which is, in our case, &lt;code&gt;xb95/Perlbal.git&lt;/code&gt; on
GitHub. That's exactly what we want. A few moments later, you can see
the branch has been created if you visit the GitHub UI. Great!&lt;/p&gt;
&lt;p&gt;Now select your branch on GitHub and click the "Pull Request" button.
You will be taken to a page that allows you to start building your pull
request. Importantly, you should see that it only shows the commits that
you have made to this feature -- nothing else!&lt;/p&gt;
&lt;p&gt;Once you click the "Send pull request" button, you will have finished
what is, to me, the cleanest and easiest way to work on code and send it
upstream I have yet found. Bravo!&lt;/p&gt;
&lt;h1&gt;Write More Code&lt;/h1&gt;
&lt;p&gt;This process couldn't be simpler. If your pull request results in
someone asking for some changes, you can do that just like you were
doing your earlier development.&lt;/p&gt;
&lt;p&gt;First, make sure that you're on the right branch. If you have been
working on other projects in the meantime, you can switch back to your
feature branch like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ git checkout feature/default-command
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that you're on the branch, go ahead and make your changes. Do
whatever you need to do and then commit them. Finally, push your changes
up to your fork on GitHub:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ git push origin HEAD
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now if you go look at your pull request, you'll see that your commits
have shown up automatically since it is tracking your branch. When
you're ready for the upstream author to look at your changes again, it's
best to comment and let them know.&lt;/p&gt;
&lt;h1&gt;Cleanup&lt;/h1&gt;
&lt;p&gt;Great, now you've finished and the upstream author has accepted your
change. Your pull request has been accepted and you're done. Now you can
consider closing off that branch so that it doesn't continue to clutter
up your UI.&lt;/p&gt;
&lt;p&gt;If you're ready to do that, let's go back to the repository. First
make sure you are on the right branch (see the above section). When
everything looks good and you're ready, tell git flow you're finished
with this feature:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ git flow feature finish feature default-command
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If everything looks good locally, you can now delete the branch on
GitHub:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ git push origin --delete feature/default-command
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Done. You have now cleaned up your local branch as well as the remote
GitHub branch. That's all there is to it.&lt;/p&gt;
&lt;h1&gt;Next Steps&lt;/h1&gt;
&lt;p&gt;That's it for today. Thanks for reading and please let me know where
this guide can be improved. I hope to maintain it as a living document
to help people.&lt;/p&gt;
&lt;p&gt;And of course -- your next step is to go write some code! I look forward
to seeing some patches now. :)&lt;/p&gt;</description><guid isPermaLink="true">http://qq.is/article/git-flow-on-github</guid><pubDate>Sun, 23 Oct 2011 13:06:00 GMT</pubDate></item><item><title>Using the Nagios API</title><link>http://qq.is/article/using-the-nagios-api</link><description>&lt;p&gt;This entry talks about how to set up and test the &lt;a href="https://github.com/xb95/nagios-api"&gt;Nagios
API&lt;/a&gt; in your environment. We cover
the CLI and also using it from the web.&lt;/p&gt;
&lt;h1&gt;Getting the Code&lt;/h1&gt;
&lt;p&gt;For now, this project doesn't support packaging or have a setup.py file,
so you will have to do it by hand. This isn't very hard, but since the
project is in such a state of growth, it's easier this way.&lt;/p&gt;
&lt;p&gt;First, you need to check out the code on your local Nagios server. The
API daemon needs to run on a machine where it has access to the files
that Nagios creates -- the status file, log file, and external commands
pipe. This should be on your central Nagios server.&lt;/p&gt;
&lt;p&gt;If you are using Nagios in distributed mode, you want to run the daemon
on the central machine that receives all of the distributed check
results. I.e., the machine that sends the alerts.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ git clone git://github.com/xb95/nagios-api.git
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That command creates a directory appropriately named &lt;code&gt;nagios-api&lt;/code&gt;.
Inside this directory are several executables, some documentation, and a
library directory.&lt;/p&gt;
&lt;h1&gt;Test Run&lt;/h1&gt;
&lt;p&gt;Before we can run it, we have to figure out where your Nagios
installation is stashing the files we need. Most of these are probably
in &lt;code&gt;/etc/nagios3/nagios.cfg&lt;/code&gt;, so open that file up and look for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;status_file&lt;/code&gt; is the main file we need, that's where Nagios writes
   out the giant dump of status.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;log_file&lt;/code&gt; is a running tally of everything Nagios is thinking. This
   is optional, but the API daemon can follow this and allow other people
   to access log information through the API.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;command_file&lt;/code&gt; is only relevant if you have &lt;code&gt;check_external_commands&lt;/code&gt;
   on, but since most of us do, you should probably have this configured.
   The API will use this pipe to write out commands. If you don't give this
   to the API, it will operate still -- but in read-only mode.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With these three configuration options, you can now run the API daemon:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ./nagios-api -s STATUS_FILE -c COMMAND_FILE -l LOG_FILE
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;What you should see next is something like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[Thu Oct 27 14:03:20 2011] {nagios-api:info} Listening on port 6315, starting to rock and roll!
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you do -- congrats! The API daemon is now up and running. If you
don't, the most likely culprit will be that it can't find one of the
files you indicate. Also, if it can't bind on port 6315, then it would
fail. (You can change the port with &lt;code&gt;-p PORT_NUMBER&lt;/code&gt;.)&lt;/p&gt;
&lt;h1&gt;Testing the API&lt;/h1&gt;
&lt;p&gt;Great. The daemon is up and running ... now what? Well, let's make sure
that it worked. Let's break out the CLI program, &lt;code&gt;nagios-cli&lt;/code&gt;, and use
it. This should work:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ./nagios-cli hosts
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If everything is working, you should see a list of all of the hosts
defined in your Nagios configuration. This isn't particularly exciting
information, so let's use the raw mode and see exactly what the global
state object says:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ./nagios-cli --raw state | python -mjson.tool
{
    "absinthe": {
        "active_checks_enabled": "1",
        "comments": {},
        "current_state": "0",
        "downtimes": {},
        "last_check": "1319742372",
        "last_hard_state": "0",
        "last_notification": "0",
        "notifications_enabled": "1",
        "plugin_output": "PING OK - Packet loss = 0%, RTA = 0.19 ms",
        "problem_has_been_acknowledged": "0",
        "scheduled_downtime_depth": "0",
        "services": {
            "Adaptec RAID": {
                "active_checks_enabled": "1",
                "comments": {},
                "current_state": "0",
                "downtimes": {},
                "last_check": "1319742359",
                "last_hard_state": "0",
                "last_notification": "0",
                "notifications_enabled": "1",
                "plugin_output": "Logical Device 0 Optimal,Controller Optimal,Battery Status ZMM Optimal",
                "problem_has_been_acknowledged": "0",
                "scheduled_downtime_depth": "0"
            },
            "PING": {
                "active_checks_enabled": "1",
...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now! Data! You will see a rather large JSON format dump showing a lot
of information about every host, service, comment, and downtime defined
in Nagios right now. It is updated automatically from the status object
Nagios writes out every ~10 seconds.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;nagios-cli&lt;/code&gt; tool in raw mode is simply doing an HTTP request for
us. The above output could also be retrieved with a simple GET:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ curl http://localhost:6315/state | python -mjson.tool
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The next thing you might want to do is actually do something with the
CLI -- say, schedule a downtime for a host that you're about to do an
upgrade on. First, let's see what options the CLI has:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ./nagios-cli -h
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As of this writing there are downtime related options and then two
status commands for viewing hosts and services. More will be added
later, but let's play with the downtimes.&lt;/p&gt;
&lt;p&gt;First, let's pick one of our hosts to operate on. Let's pretend that
&lt;code&gt;web01&lt;/code&gt; needs an upgrade. From the CLI, we can easily put it into
downtime:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ./nagios-cli schedule-downtime web01 4h
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Simply, that command puts in a four hour fixed downtime starting
immediately for the host &lt;code&gt;web01&lt;/code&gt;. If you wanted to put in downtimes
for the host and all of the services on it, you can do that with the
&lt;code&gt;--recursive&lt;/code&gt; option:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ./nagios-cli schedule-downtime web01 4h -r
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To see all of the options this command supports:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ./nagios-cli schedule-downtime -h
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Later, when we're done with the upgrade, we can cancel that downtime:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ./nagios-cli cancel-downtime web01
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or cancel the downtime for the host and any services that are in downtime:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ./nagios-cli cancel-downtime web01 -r
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That's a short and easy introduction to the CLI.&lt;/p&gt;
&lt;h1&gt;Using the API from the Web&lt;/h1&gt;
&lt;p&gt;If you're like me, the CLI is your one-stop shop for everything. I
generally work from terminal because I can express whatever I need
easily and manipulate the text with a million and one tools for every
occasion. That's great.&lt;/p&gt;
&lt;p&gt;Sometimes, though, I just want a web GUI. I don't really want to spend
a lot of time debating the finer points of CLIs and GUIs, but here you
don't have to -- the API is a RESTful JSON system because it works great
from the command line &lt;em&gt;and&lt;/em&gt; the web browser.&lt;/p&gt;
&lt;p&gt;For now, let's kill the running &lt;code&gt;nagios-api&lt;/code&gt; and give it a new command
line option:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ ./nagios-api -s STATUS_FILE -c COMMAND_FILE -l LOG_FILE -o \*
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(You have to escape the asterisk, at leats from Bash.)
The &lt;code&gt;-o&lt;/code&gt; parameter instructs the daemon to send out a
&lt;code&gt;Access-Control-Allow-Origin&lt;/code&gt; header with every response. This
header is part of the relatively new &lt;a href="http://www.w3.org/TR/cors/"&gt;Cross-Origin Resource
Sharing&lt;/a&gt; spec.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;A Short History Lesson&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;For many years, your web browser has been locked in a box that only
allows JavaScript and other dynamic tools to talk to the same origin
that served them. I.e., if you load a JavaScript from foo.com on port
80, any HTTP requests that code makes &lt;em&gt;must&lt;/em&gt; target foo.com on port 80.&lt;/p&gt;
&lt;p&gt;This is called the &lt;a href="http://en.wikipedia.org/wiki/Same_origin_policy"&gt;same origin
policy&lt;/a&gt; and has
been a cornerstone of Internet security for many years. It was a
very smart idea that makes a lot of sense, but in the modern day of
"dynamic everything!", it has posed some interesting challenges to web
developers.&lt;/p&gt;
&lt;p&gt;Anyway, this has changed recently with the introduction of the CORS spec
linked above. This spec is supported in recent versions of all major
browsers (Opera does not support it) and allows us to write JavaScript
that targets the Nagios API, even if that API is running on a different
host or port. (Which it undoubtedly is.)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Now your API is configured to export the appropriate header (in this
case, "allow everybody") and you can write JavaScript that targets the
API. Let's test this out.&lt;/p&gt;
&lt;p&gt;First, you need to be able to reach your Nagios server from your
browser. Try navigating to it on the port you configured (default is
6315), and you should see something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{"content": "Invalid request URI", "success": false}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you don't see that, then you should stop here and figure out what's
wrong. Are you on the right network? VPN up? You know your configuration
better than I do.&lt;/p&gt;
&lt;p&gt;Once that works, now navigate your browser to
&lt;a href="http://jquery.com"&gt;jquery.com&lt;/a&gt;. We use this site because the next step
requires the jQuery library, and the easiest way to make sure it's
loaded is just to go to their site in the browser.&lt;/p&gt;
&lt;p&gt;Now, fire up your browser's development console. I'm only familiar with
this in Chrome, if you use Firefox or Safari, you will have to modify
these instructions.&lt;/p&gt;
&lt;p&gt;In the development console, you can paste the following code to define a
little processing function that we're going to call shortly.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;function get_status(data, t, j) {
    if (!data.success) return;
    for (var host in data.content) {
        console.log(host + ' ' + data.content[host].plugin_output);
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Make sure to hit enter. Okay, now we can actually hit the API and do
something. Adjust the following snippet with the appropriate URL, then
paste it in and hit enter:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$.getJSON('http://my-nagios-server:6315/state', get_status);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You should see, very shortly, a dump of all of the hosts in your Nagios
system with the most recent output from whatever host check you use. In
my case I see a bunch of PING results.&lt;/p&gt;
&lt;p&gt;And that's it! You can access the API from your browser.&lt;/p&gt;
&lt;h1&gt;Productionizing&lt;/h1&gt;
&lt;p&gt;To make sure that your API stays up and running, I would suggest you
consider the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Monitor the API with Nagios so you are alerted if it crashes. Since
   it's a JSON server, you can do an HTTP check to make sure that it
   responds to a simple command. Alternately you can just do a TCP port
   check.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use something like &lt;a href="https://github.com/jamwt/angel"&gt;Angel&lt;/a&gt; or
   supervisor (I can never find a good link to it). Basically, something
   that runs the daemon and restarts it if it crashes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If you are going to use the API from the web, you will want to consider
   setting an appropriate Access-Control-Allow-Origin header. See above.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Documentation. Because every operations team should document the
   daemons and other systems they have running.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That's it. Now the API should be resilient to failure and allow you to
depend on it in the rest of your infrastructure.&lt;/p&gt;
&lt;h1&gt;Further Development&lt;/h1&gt;
&lt;p&gt;The future of the Nagios API is somewhat dependent on what the community
needs. For my own purposes, it already does everything I need. Certainly
over time I will need a few more functions to be implemented, but that's
easy.&lt;/p&gt;
&lt;p&gt;Most of my future plans involve the Next Generation of Monitoring
Software, whatever it ends up being called, which is a Nagios
replacement that I've had cooking in my head for years now. I'll be
writing more about that soon, though.&lt;/p&gt;</description><guid isPermaLink="true">http://qq.is/article/using-the-nagios-api</guid><pubDate>Thu, 27 Oct 2011 12:45:00 GMT</pubDate></item><item><title>Next Generation Monitoring</title><link>http://qq.is/article/next-gen-monitoring</link><description>&lt;p&gt;I want to talk a bit about the thoughts in my head about building a
new monitoring system to replace &lt;a href="http://nagios.org/"&gt;Nagios&lt;/a&gt;. This is
something that I've been thinking about for years and years, but finally
I'm getting enough internal momentum to actually make it happen. First,
let's dive in and look at the existing landscape of monitoring tools (as
I know them).&lt;/p&gt;
&lt;h1&gt;Define Your Terms&lt;/h1&gt;
&lt;p&gt;For the purpose of this blog post, I define "monitoring" loosely as
the act of gathering information about your services &lt;em&gt;for the express
purpose of alerting you when there's a problem&lt;/em&gt;. The other side of
things, where you are creating pretty graphs to see how your servers
and services are behaving over time is what I will call &lt;strong&gt;performance
trending/analysis&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In short, Nagios is a monitoring system in that when your host goes
down, it pages you. &lt;a href="http://cacti.net/"&gt;Cacti&lt;/a&gt;, on the other hand, is a
performance analysis system that lets you keep track of how much RAM you
have free, etc.&lt;/p&gt;
&lt;p&gt;Many systems are both, too. But for the sake of this blog post,
I'm mostly focusing on the monitoring side of the equation. If you
want a good recommendation for performance analysis, please see
&lt;a href="http://opentsdb.net/"&gt;OpenTSDB&lt;/a&gt;.&lt;/p&gt;
&lt;h1&gt;Monitoring Today&lt;/h1&gt;
&lt;p&gt;There are, it seems, two main approaches to monitoring: Nagios and
everything else. Nagios is a fairly simple, relatively easy to use
system that is good at doing a few things and doesn't really have many
bells or whistles and doesn't do much else beyond monitoring your
services.&lt;/p&gt;
&lt;p&gt;Everything else seems to be a "Nagios and then some" system, providing
some manner of bells and whistles that the traditional Nagios
installation doesn't provide. That's fine, I don't really mind
functionality, but it really gets away from the thing that I really
need: something to let me know when my shit is broken.&lt;/p&gt;
&lt;p&gt;I've spent a while over the years using Nagios, but every so often
I go out and do a survey of the landscape. Sadly, the state of
the art really hasn't changed a lot in ... well, years. You have
&lt;a href="http://zabbix.com/"&gt;Zabbix&lt;/a&gt;, &lt;a href="http://opennms.com/"&gt;OpenNMS&lt;/a&gt;,
&lt;a href="http://zenoss.com/"&gt;Zenoss&lt;/a&gt;, &lt;a href="http://hyperic.com"&gt;Hyperic&lt;/a&gt;,
&lt;a href="http://icinga.org/"&gt;Icinga&lt;/a&gt;, &lt;a href="http://opsview.com/"&gt;Opsview&lt;/a&gt;, and I
might be missing a few...&lt;/p&gt;
&lt;p&gt;And, honestly, they're all probably good and accomplish the basic goals,
but what they don't do for me is allow me to quickly and easily, with
a minimum of fuss and nonsense, just monitor my infrastructure. I want
something simple and easy to use. No surprises. A nearly flat learning
curve. A UI that works. A CLI. (Preferrably one that works, too!)&lt;/p&gt;
&lt;p&gt;These tools are Enterprise. They've got sales reps, marketing videos,
VM appliances, and some of them are even built to do Windows, Unix,
Solaris, and VMS! It's great, I'm positive they fill needs that people
have and I don't think they're bad products. They're really just not
what I'm looking for. Far too big for my needs.&lt;/p&gt;
&lt;p&gt;The only thing that comes close to meeting my needs (forget my wants) is
Nagios Core.&lt;/p&gt;
&lt;h1&gt;So, why not Nagios Core?&lt;/h1&gt;
&lt;p&gt;Because the HTML it generates looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;table border=0 width=100% cellspacing=0 cellpadding=0&amp;gt;
&amp;lt;tr&amp;gt;
&amp;lt;td align=left valign=top width=33%&amp;gt;
&amp;lt;TABLE CLASS='infoBox' BORDER=1 CELLSPACING=0 CELLPADDING=0&amp;gt;
&amp;lt;TR&amp;gt;&amp;lt;TD CLASS='infoBox'&amp;gt;
&amp;lt;DIV CLASS='infoBoxTitle'&amp;gt;Current Network Status&amp;lt;/DIV&amp;gt;
Last Updated: Wed Nov 2 02:26:21 CDT 2011&amp;lt;BR&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Okay, a little more seriously: because it's basically
&lt;a href="http://en.wikipedia.org/wiki/Damaged_good"&gt;crippleware&lt;/a&gt;. Nagios Core
has been held back to the state it was in nearly a decade ago so
that the company can differentiate its enterprise offering, &lt;a href="http://nagios.com/products/nagiosxi"&gt;Nagios
XI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'm all for the company making money -- that's great -- but their
decision to leave the open source version of the product back in the
stone age makes it so that I can't really use it to meet my needs. Over
the years I've put hundreds of my own hours into efforts that I really
shouldn't have had to because the system lacks so much that I need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A functioning CLI. Doesn't exist. I'm starting to write one, though,
but I really shouldn't have to.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A UI that is at all modern. The code above demonstrates, but if you
actually interact with Nagios Core, you'll pretty quickly regret it.
It's hard to use and has arcane, confusing commentary. Just try to
schedule a downtime and do it right the first time!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Nice to have:&lt;/em&gt; An API that I can integrate with. I would like to build
my own UIs or dashboards, so please give me access to your data in a
reasonable fashion.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Reasonable behavior -- this is a very personal opinion, but Nagios
does a few things that confuse and consternate me.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To be fair -- Nagios is still, in my opinion, the only system that
allows me to get a monitoring environment up and running in an hour or
two of hacking. A basic setup is easy to accomplish and worth having.
I've used the software for many years now and I still choose it over
everything else, so it's not all bad.&lt;/p&gt;
&lt;p&gt;In fact, I recommend it if you're not sure what to use. It is currently
the best system out there for monitoring your infrastructure.&lt;/p&gt;
&lt;h1&gt;The Wheel, Again&lt;/h1&gt;
&lt;p&gt;Of course, I wouldn't have started this blog post if all I wanted to do
was bash Nagios. I really don't intend to be that hard on it. It's a
good system, it's just old and getting older. Today's infrastructures
demand a new, more interoperable monitoring system, and that's what I
want to talk about here.&lt;/p&gt;
&lt;p&gt;I'm starting to put together a design for building a monitoring system.
I have a few key points that I am keeping in mind while doing this, but
they're things that I think should resonate with many of you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Prioritize simple. I'm not building an Enterprise(TM) solution here,
I'm building for the busy sysadmin who needs to make sure things are
working. Configuration and usage should be damn easy. So should setup.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Keep it minimal. The core of this project can be defined as "make
software that tells me if my shit is broken". Other functionality can be
added by other software -- which I may or may not write, but won't be
part of the core.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Integrate with everybody. Provide a functional API that allows people
to write web interfaces, shell scripts, or whatever they want. I will
provide libraries to do just that, too, to make it easier to get
started.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those are my main three points right now: write something simple, make
it handle the few things it should, and allow other people to bolt
things on if they want. Add a widget to your dashboard that shows the
availability of a service? Great, that's a simple HTTP query that will
return JSON for you to consume. Make a shell script silence alerts?
Easy.&lt;/p&gt;
&lt;h1&gt;Implementation Notes&lt;/h1&gt;
&lt;p&gt;I've spent a lot of time considering my options here, and as
much as I love Perl, these days I'm a Python guy. I'm going
to stick to Python for now. I will probably also use the
&lt;a href="https://github.com/jamwt/diesel"&gt;Diesel&lt;/a&gt; library. That provides a lot
of network service and microthread functionality that certainly makes my
life a lot easier.&lt;/p&gt;
&lt;p&gt;Another goal (this may not be in v1, I'm not sure) is also to make it
so that the system can run on N machines for redundancy. These days,
there's very little reason to run your monitoring system in one place.
Why not run it on five machines and just have them sort out how to divvy
up the work? This is the way many things are moving, and I see no reason
that monitoring systems can't as well.&lt;/p&gt;
&lt;p&gt;In the name of allowing people to do some interesting and complicated
things with the system, I really want to support a full event system.
While this is actually not particularly complicated for a monitoring
system, it has a lot of implications for the rest of the ecosystem.&lt;/p&gt;
&lt;p&gt;For example, let's say that we have an event that fires when the
monitoring system has determined that a host is down. Next, we give
people the ability to write plugins for the monitoring system that can
listen to events. Alternately we allow people to subscribe to events
using a pubsub type model of some sort?&lt;/p&gt;
&lt;p&gt;Either way, someone could potentially write code that does a database
failover when the system detects that a database has gone down. Or maybe
they have code to automatically restart a process, reboot a server, etc
etc. The list of possibilities is endless and it doesn't compromise the
vision to build a simple system -- you never have to touch it. The power
is there, though.&lt;/p&gt;
&lt;h1&gt;Closing Thoughts&lt;/h1&gt;
&lt;p&gt;Monitoring is a really interesting subject to me. It seems to me that
the state of the art is really pretty woeful when you consider how
important our infrastructure is these days. Most people use a handful
of tools they've cobbled together combined with a few dozen scripts of
their own and nobody ever seems to have a really great handle on it.&lt;/p&gt;
&lt;p&gt;It would be good to simplify this and, to some extent, standardize it.
The LAMP stack has nearly been commoditized at this point, giving rise
to services like &lt;a href="http://heroku.com/"&gt;Heroku&lt;/a&gt; that allow you to just
write code and not worry about your backend. Those are great and for
those who can use them -- awesome. I envy you a bit.&lt;/p&gt;
&lt;p&gt;For the rest of us, though: I think it's high time to improve the state
of things. I welcome your feedback as I (continue to) embark on this
crusade.&lt;/p&gt;</description><guid isPermaLink="true">http://qq.is/article/next-gen-monitoring</guid><pubDate>Tue, 01 Nov 2011 23:50:00 GMT</pubDate></item><item><title>SSH key forwarding and screen/tmux</title><link>http://qq.is/article/ssh-keys-through-screen</link><description>&lt;p&gt;&lt;em&gt;If you just want the answer, skip to the end. This is written as an
educational post and has a lot more detail than just how to solve this
problem. Thanks!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;If you're like me, you spend a lot of time connected to various servers.
In any given day I'm using a dozen or more servers to accomplish
whatever it is I'm setting out to do. I'm also bouncing between networks
-- wired and wireless, typically, but also sometimes the wireless
drops, or I want to walk across the building, or even dare to go home
sometimes.&lt;/p&gt;
&lt;p&gt;For years now, I've been taking advantage of
&lt;a href="http://www.gnu.org/s/screen/"&gt;screen&lt;/a&gt; (and more recently, a newer
system called &lt;a href="http://tmux.sourceforge.net/"&gt;tmux&lt;/a&gt;) to allow me to keep
state when I'm reconnecting from various locations. If you haven't used
it, it's well worth the time to learn one of these tools.&lt;/p&gt;
&lt;p&gt;Next time you launch that six hour job and realize, three hours later,
that it's time to go home -- no problem. You can just leave it running
in the screen session and reconnect tomorrow or from home or wherever
you go next. No status lost.&lt;/p&gt;
&lt;p&gt;The biggest problem with using screen is that, unless you have properly
configured everything, you often run into a problem with SSH key
forwarding.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Before We Begin&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To really follow along here, you're going to need with you the machine
you're working on, a remote machine that you will connect to, and an
SSH key. Setting up SSH key access to your server is beyond the scope
of this particularl tutorial.&lt;/p&gt;
&lt;p&gt;Really, you will need to have two or more machines in your production
environment, because this is really an advanced technique designed for
places wehere you have to connect to many servers.&lt;/p&gt;
&lt;p&gt;I assume that you have SSH key forwarding working already. You should
be able to &lt;code&gt;ssh user@host&lt;/code&gt; and not have to type a password (except
maybe your SSH key passphrase).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;How SSH works, in brief&lt;/h1&gt;
&lt;p&gt;SSH is a layered system. If you are familiar with the &lt;a href="http://en.wikipedia.org/wiki/OSI_model"&gt;OSI
model&lt;/a&gt;, you know that there are
different layers that build up the networking stack that we're familiar
with. When you connect to a web site, the stack usually looks something
like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Layers 5-7&lt;/em&gt;: HTTP in your browser (Chrome, Firefox, Safari, IE, etc...)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Layer 4&lt;/em&gt;: TCP (provides reliable, ordered delivery of bytes)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Layer 3&lt;/em&gt;: IP (allows two machines to talk to each other across the Internet)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Layer 2&lt;/em&gt;: Ethernet (your NIC on your computer)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Layer 1&lt;/em&gt;: CAT-5/6 cable (or other physical connection)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each layer has its own set of responsibilities and allows the layers on
top of it to operate without knowing the intricacies of how everything
else works. When you want to connect to 8.8.8.8 on port 53, you don't
care that this involves an extremely complex system involving everything
from routing to physically sending electrical impulses. It just works.&lt;/p&gt;
&lt;p&gt;SSH has its own layers. When you fire up an SSH connection to a machine,
you are really establishing several things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The SSH transport layer&lt;/li&gt;
&lt;li&gt;User authentication to the remote machine&lt;/li&gt;
&lt;li&gt;A plethora of distinct SSH channels for moving data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The transport and authentication layers are responsible for establishing
your initial connection to the remote server. Once that's done, SSH
gives you channels for moving data back and forth. This is very similar
to how IP gives you the ability to send data to a specific port -- the
underlying data link layer (layer 2 in the OSI model) doesn't have that
concept or care.&lt;/p&gt;
&lt;p&gt;SSH uses a single TCP connection to a host to allow you to do many
things over that single connection. If you are using port forwarding,
SSH still uses a single TCP connection and multiplexes your forwarded
connections, your shell, and whatever else you're doing all through the
same pipe.&lt;/p&gt;
&lt;h1&gt;The problem statement&lt;/h1&gt;
&lt;p&gt;Now let's move to forwarding. In our example today, we're going to be
using three machines. Your laptop will be named &lt;code&gt;laptop&lt;/code&gt; (original, I
know) and you will be first connecting to the machine named &lt;code&gt;gateway&lt;/code&gt;.
You have a screen session on that machine and you want to then connect
to &lt;code&gt;web01&lt;/code&gt; and all of your other servers.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mark@laptop:~$ ssh gateway
mark@gateway:~$
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you type that command, SSH gets busy establishing a transport layer
and performing user authentication. Since we're not debugging auth right
now, let's just assume it works.&lt;/p&gt;
&lt;p&gt;You are now presented with a shell on your remote machine. From this
bare shell, you can connect off to your webserver and it should just
work:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mark@gateway:~$ ssh web01
mark@web01:~$
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Done. That was easy. If you just want to do this, there's really not
much you have to do. Assuming your original SSH client is forwarding,
you should be able to hop that to the next server.&lt;/p&gt;
&lt;p&gt;But let's go back to our gateway machine and fire up screen...&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mark@web01:~$ exit
mark@gateway:~$ screen
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now you will be back in a shell, but you will be inside of screen. I am
also not going to give you a screen tutorial in this blog post. I will
assume that you know how to basically use screen -- attach, detach, and
reattach are all you really need to know for this.&lt;/p&gt;
&lt;p&gt;From inside of screen, now SSH to your webserver. &lt;em&gt;It works!&lt;/em&gt; But wait,
you haven't done anything to configure anything yet! That's right, it'll
work ... for now. Go ahead and detach from screen (detach -- don't
terminate!) and then log out of your gateway machine.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mark@gateway:~$ ^ad
[detached from 23038.main]

mark@gateway:~$ exit
mark@laptop:~$
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You are now back on your laptop, but your screen is still running.
Reconnect to gateway and reattach your screen and then try to connect to
your web server:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mark@laptop:~$ ssh gateway
mark@gateway:~$ screen -r
mark@gateway:~$ ssh web01
mark@web01's password:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You get a password prompt -- you aren't allowed in! How did this happen?&lt;/p&gt;
&lt;h1&gt;SSH forwarding, how it works&lt;/h1&gt;
&lt;p&gt;On &lt;code&gt;gateway&lt;/code&gt;, after establishing the SSH connection, take a look at the
environment of your shell:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mark@gateway:~$ env | grep SSH
SSH_CLIENT=68.38.123.35 45926 22
SSH_TTY=/dev/pts/0
SSH_CONNECTION=68.38.123.35 48926 10.1.35.23 22
SSH_AUTH_SOCK=/tmp/ssh-hRNwjA1342/agent.1342
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The important one here is &lt;code&gt;SSH_AUTH_SOCK&lt;/code&gt; which is currently set to some
file in &lt;code&gt;/tmp&lt;/code&gt;. If you examine this file, you'll see that it's a Unix
domain socket -- and is connected to the particular instance of &lt;code&gt;ssh&lt;/code&gt;
that you connected in on. Importantly, &lt;em&gt;this changes every time you
connect&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;As soon as you log out, that particular socket file is gone. Now, if
you go and reattach your screen, you'll see the problem. It has the
environment from when screen was &lt;em&gt;originally&lt;/em&gt; launched -- which could
have been weeks ago. That particular socket is long since dead.&lt;/p&gt;
&lt;p&gt;From inside of screen, your shell has no idea that there is real SSH
authentication socket somewhere else. It just knows that the one you
have told it to use doesn't exist.&lt;/p&gt;
&lt;h1&gt;Solving the crisis&lt;/h1&gt;
&lt;p&gt;There are several ways of solving this problem. I believe the following
to be the easiest and most reliable of the ones I've tried. This works
in &lt;code&gt;bash&lt;/code&gt; and &lt;code&gt;zsh&lt;/code&gt; and probably will work in other shells as well.&lt;/p&gt;
&lt;p&gt;Solution: since we know the problem has to do with knowing where the
currently live SSH authentication socket is, let's just put it in a
predictable place!&lt;/p&gt;
&lt;p&gt;In your &lt;code&gt;.bashrc&lt;/code&gt; or &lt;code&gt;.zshrc&lt;/code&gt; file, add the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Predictable SSH authentication socket location.
SOCK="/tmp/ssh-agent-$USER-screen"
if test $SSH_AUTH_SOCK &amp;amp;&amp;amp; [ $SSH_AUTH_SOCK != $SOCK ]
then
    rm -f /tmp/ssh-agent-$USER-screen
    ln -sf $SSH_AUTH_SOCK $SOCK
    export SSH_AUTH_SOCK=$SOCK
fi
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That's it. Make sure to put this on every machine that you intend to
connect through, then you're done. SSH to &lt;code&gt;gateway&lt;/code&gt;, reconnect to your
screen, and you can immediately SSH over to &lt;code&gt;web01&lt;/code&gt; or wherever you want
to go. It just works.&lt;/p&gt;
&lt;p&gt;All this code does is, when you first SSH in to the machine, is set your
&lt;code&gt;SSH_AUTH_SOCK&lt;/code&gt; variable to a predictable value. It's a symlink that
points to whatever your current SSH authentication socket happens to be.
Every time you SSH in to this machine, that symlink gets rebuilt.&lt;/p&gt;
&lt;p&gt;Inside of screen, the environment never has to change. It dereferences
the symlink to find the correct socket and just works. No matter how
many times you reconnect.&lt;/p&gt;
&lt;h1&gt;Conclusion and room for improvement&lt;/h1&gt;
&lt;p&gt;It took me a while to settle on this method. Originally I tried
something fancy with getting screen/tmux to automatically import
the environment of the shell I was attaching from, but that proved
hard/impossible.&lt;/p&gt;
&lt;p&gt;I also tried building a wrapper around the SSH command to automatically
set the right environment variables. That turned out to work OK but
was clumsy and hard to maintain between different machines. It also
required building more and more wrappers to get other commands to work
and ultimately proved unsustainable.&lt;/p&gt;
&lt;p&gt;This particular solution came from, I'm pretty sure, somewhere else on
the Internet. I would attribute if I remembered where I got the idea
from. It's simple and just works.&lt;/p&gt;
&lt;p&gt;The only trouble I've had is when I leave a terminal up at home, then go
to work and connect from there (overwriting the symlink), and then when
I get back home I have to close that terminal. I can't just use it. This
happens so rarely that I haven't tried to engineer a fix to it. Let me
know if you come up with one, though.&lt;/p&gt;
&lt;p&gt;Thanks for reading. I hope this improves your systems administration
experience.&lt;/p&gt;</description><guid isPermaLink="true">http://qq.is/article/ssh-keys-through-screen</guid><pubDate>Thu, 17 Nov 2011 15:48:00 GMT</pubDate></item><item><title>MongoDB disappoints me again.</title><link>http://qq.is/article/mongodb-disappoints-me-again</link><description>&lt;p&gt;At my employer we use &lt;a href="http://www.mongodb.org/"&gt;MongoDB&lt;/a&gt; for one of our
core databases. I have never worked with it before I got here, but now
I'm responsible for maintaining it so I have spent some decent amount of
time banging on it and learning about it.&lt;/p&gt;
&lt;p&gt;I'm impressed with the ease of use, configuration, and general
maintenance. It seems to do things in a reasonably sane fashion most
of the time. I am happy to recommend it to people with small to medium
infrastructures who want to focus more on the application development
and worry less about the administration overhead on the backend. For the
most part, MongoDB just works.&lt;/p&gt;
&lt;p&gt;There are a few things that make me less happy with the system, though,
and lead me to recommend against using it for highly critical systems or
once you pass a certain size. That brings us to today.&lt;/p&gt;
&lt;p&gt;Last week, there was an odd issue where we restarted one of our MongoDB
instances and when it came back up, some of the journal files were owned
by root. This caused the database to stop processing the journal and it
started falling behind. It also couldn't download further journal data
from the master, so it was effectively doing no work.&lt;/p&gt;
&lt;p&gt;Our monitoring didn't catch it (it wasn't yet replicating so it
wasn't showing any replication lag), so it went a while without being
noticed. When I finally did realize it was broken, I fixed the ownership
of the files and restarted it. A while later, I checked back on the
status and saw that the replication state was &lt;code&gt;RECOVERING&lt;/code&gt;. Great! I
went about my business content in the knowledge that it was now
recovering from the problem and would be back up to speed at some point.&lt;/p&gt;
&lt;p&gt;That was Thursday. Today, the machine has still not recovered and seems
to be falling farther and farther behind. That's odd. We aren't doing so
many writes on this cluster that I would expect it to be that overloaded
-- and the other replica members aren't having these issues. In fact,
as I started to dig into it, I realized that it was doing &lt;em&gt;no useful
work at all&lt;/em&gt; -- not progressing even a tiny bit.&lt;/p&gt;
&lt;p&gt;I ended up in the log files and found:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Mon Jan 30 11:59:03 [replica set sync] replSet error RS102 too stale to catch up, at least from blahblahblah:27018
Mon Jan 30 11:59:03 [replica set sync] replSet our last optime : Jan 21 11:00:02 4f1aef12:d4
Mon Jan 30 11:59:03 [replica set sync] replSet oldest at blahblahblah:27018 : Jan 29 06:05:59 4f253627:90
Mon Jan 30 11:59:03 [replica set sync] replSet See http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+Replica+Set+Member
Mon Jan 30 11:59:03 [replica set sync] replSet error RS102 too stale to catch up
Mon Jan 30 11:59:03 [replica set sync] replSet RECOVERING
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is pretty obvious -- it's too far behind the master when it tried
to recover, so the master doesn't have enough journal data to send it
and it can't ever just come back up and recover. That's fine. I've been
a MySQL DBA long enough to know that this happens in any replicated
system. No foul here.&lt;/p&gt;
&lt;p&gt;The problem, though, is that MongoDB uses the state &lt;code&gt;RECOVERING&lt;/code&gt;. That
word has a very well understood meaning -- that something has happened
and that whatever it was will be over at some point in the future. It is
currently recovering from the failure. &lt;em&gt;It's really not, though!&lt;/em&gt; This
instance will &lt;strong&gt;never&lt;/strong&gt; recover from the state that it is in. A more
appropriate word would be &lt;code&gt;FAILED&lt;/code&gt; or &lt;code&gt;ERROR&lt;/code&gt; or something that actually
indicates that there is a problem that requires manual intervention!&lt;/p&gt;
&lt;p&gt;I appreciate that MongoDB is a system that lends itself to ease of
use and is very nice to set up. That's great. But if you want to be
successful at companies with real traffic and usage, you have to build
something that is reasonably sane for sysadmins to maintain. Our lives
are already complicated enough with trying to manage dozens of systems
built in thousands of ways -- if your system lies to me, I'm not going
to feel comfortable with it and sure as heck won't recommend it to other
companies!&lt;/p&gt;
&lt;p&gt;The status fields of any system &lt;strong&gt;must&lt;/strong&gt; be accurate. When you
execute a &lt;code&gt;SHOW SLAVE STATUS&lt;/code&gt; on MySQL, the &lt;code&gt;Slave_IO_Running&lt;/code&gt; and
&lt;code&gt;Slave_SQL_Running&lt;/code&gt; columns need to be correct! If they're wrong, you
suddenly can't trust the system and that takes it from a well-behaved
system that is sane to administrate to a black hole of fail that is
going to bite you in the ass at some point.&lt;/p&gt;
&lt;p&gt;For this and other reasons, we're in the process of moving off
of MongoDB. It was a great system when we were smaller, but
we're beyond that now. We need systems that we don't have to
fight. (To that end, I have a lot of positive things to say about
&lt;a href="http://riak.com/"&gt;Riak&lt;/a&gt;. That's a subject for a different day, though.)&lt;/p&gt;
&lt;p&gt;End of rant.&lt;/p&gt;</description><guid isPermaLink="true">http://qq.is/article/mongodb-disappoints-me-again</guid><pubDate>Mon, 30 Jan 2012 10:35:00 GMT</pubDate></item><item><title>Amazon Glacier</title><link>http://qq.is/article/amazon-glacier</link><description>&lt;p&gt;Today, Amazon announced a new service being provided under the AWS
umbrella:  &lt;a href="http://aws.amazon.com/glacier/"&gt;Amazon Glacier&lt;/a&gt;. In summary,
this is a service designed to replace off-site archival storage,
commonly used for backups and long-term storage of infrequently accessed
data.&lt;/p&gt;
&lt;h1&gt;The Use Case&lt;/h1&gt;
&lt;p&gt;Glacier is not for your standard backups. This is designed for storing
the long-term versions of backups that you only ever fall back to in
case of major catastrophe. As an example, I'm considering storing my
MySQL archives in Glacier. This wouldn't be my only backup, I maintain
last week's backup locally in my data center.&lt;/p&gt;
&lt;p&gt;In case of machine failure or operator error, I can restore from that
backup plus binlogs to get back up to right before the failure. Glacier
is not involved. Where this service comes in is if, somehow, my database
dies, my backup is deleted, and the mirrored database (slave or
standby-master) is also wiped.&lt;/p&gt;
&lt;p&gt;If three full copies of my data goes away, that's a catastrophe and I
will have to restore from Glacier.&lt;/p&gt;
&lt;p&gt;To date, most of us have bitten the bullet and used Amazon's S3 for
this, even though the cost for this service is quite exorbitant. At my
day job, we also use &lt;a href="http://www.tarsnap.com/"&gt;Tarsnap&lt;/a&gt; -- an encrypted
data storage service that is backed by S3. While a fantastic service,
the cost of storing many terabytes of data really starts to add up.&lt;/p&gt;
&lt;p&gt;S3 also provides a lot of functionality that isn't needed for doing
off-site archival backups. While the CDN-like nature of S3 is great, I
really don't care if my backups are easily downloadable by HTTP. I'd
actually rather they weren't -- which, thankfully, S3 lets you do. You
still end up feeling slightly like you're misusing this service and, in
effect, paying for functionality you just don't need.&lt;/p&gt;
&lt;h1&gt;Cost Effectiveness&lt;/h1&gt;
&lt;p&gt;Whenever Amazon announces a product, my first step is to understand
the product and see if it's useful to me. This one is. Next, the big
question: is it cost effective for my purposes? Let's try to figure that
out.&lt;/p&gt;
&lt;p&gt;For this back-of-the-envelope comparison, let's imagine we have 10TB
of data we want to archive and store. This includes some number of
copies of the database, a bunch of files we want to keep "just in case",
etc. We expect that this data will only ever need to be used as a last
line of defense.&lt;/p&gt;
&lt;p&gt;In Amazon Glacier, storing 10TB costs &lt;strong&gt;$102.40 per month&lt;/strong&gt; (10,240 GB
at $0.01/GB).&lt;/p&gt;
&lt;p&gt;(Compare this to Amazon S3 which would cost about USD $1126 per
month. Glacier is 10% of the cost of S3.)&lt;/p&gt;
&lt;p&gt;But what about comparing this to hosting it yourself? Let's assume that
you are going to build out your own hardware and store it in a data
center. There are a number of ways you can go to accomplish this, and
I'm going to be generous with the discounts and pricing.&lt;/p&gt;
&lt;p&gt;The best price I can find on a tape storage system puts the box
at slightly over USD $3000 for a machine capable of storing 18TB
uncompressed. (I'm assuming the 10TB above is compressed already and you
won't get much out of storage-level compression.)&lt;/p&gt;
&lt;p&gt;Assuming even a 50% discount (which you almost certainly wouldn't get),
that's still USD $1500 for raw storage for 18TB of data. This is just
the machine, though, now you need to put it somewhere. If you store it
in your office, the cost might be negligible -- but now you don't really
have secure backups. All of your company's data is now beholden to the
security of your physical location -- which may or may not be good.&lt;/p&gt;
&lt;p&gt;If you want to collocate your backup server, you've got an akward
situation -- unless you already have multiple data centers, storing it
in your existing location means it isn't off-site. If you have to rent a
spot somewhere else, it will be at least $100/month to get this machine
online and powered. By the time you've bought the hardware and put it
somewhere, you've &lt;em&gt;well exceeded the cost of Amazon Glacier for this
dataset&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Let's not even talk about the cost for tapes, the labor required when
you have hardware failures, and other such issues. For once, I can say
that Amazon's pricing on a service is well below what you could achieve
yourself for this use.&lt;/p&gt;
&lt;p&gt;The other option is rotational media -- but the cost for that is more
than tape. Disks also tend to have a higher failure rate than tape in my
experience, driving up your costs in labor and spares-on-hand.&lt;/p&gt;
&lt;p&gt;Even if you somehow managed to get the cost of one system down low
enough to be competitive, now you've built a system that is perhaps
slightly cheaper, but not redundant. Glacier is replicated across
multiple data centers and to multiple locations in each facility. I
can't imagine any way in which an end-user can beat that. Amazon has a
huge economy of scale.&lt;/p&gt;
&lt;p&gt;Amazon Glacier is easily cheaper than hosting your own backups, not to
mention more convenient.&lt;/p&gt;
&lt;h1&gt;Who It's Not For&lt;/h1&gt;
&lt;p&gt;So, then, why wouldn't everybody use this service?&lt;/p&gt;
&lt;p&gt;For one, if you already have facilities in several locations and spare
power. Adding a server won't change your opex appreciably, and sinking
a little into capex is often a better plan for most businesses than
increased opex. This also assumes that you have people going to those
facilities already, so the added cost of having someone swap in a tape
is pretty minimal.&lt;/p&gt;
&lt;p&gt;Also, at scale -- Glacier is a linear service. 100TB costs ten times as
much as 10TB, but that's not the case if you're doing it yourself. At
some point when you can start buying petabyte-level storage, you almost
certainly already have the infrastructure such that you won't save much
money by using Amazon.&lt;/p&gt;
&lt;p&gt;Finally, security. Whatever data you submit to Amazon is encrypted
in-transit, but they don't encrypt your data on their end. You lose
a little control of the security of your data. You could encrypt it
locally before sending, but that requires some effort on your end. It's
not terribly hard, but it does require some consideration.&lt;/p&gt;
&lt;p&gt;In my experience, this rules out large companies that wouldn't consider
using Amazon's services anyway. For the rest of us who work in startups
or small to medium businesses, though, Glacier looks great.&lt;/p&gt;
&lt;h1&gt;Some Caveats on Pricing and Usage&lt;/h1&gt;
&lt;p&gt;One thing that is important to mention: this is the equivalent of Iron
Mountain or similar long-term archival storages. It's like a glacier --
large, small-moving, and very, very frozen.&lt;/p&gt;
&lt;p&gt;Technically, this means that you need to be storing things you
don't intend to retrieve very frequently. In fact, if you over-use
retrieval, it will cost you to get your data back out: &lt;a href="http://aws.amazon.com/glacier/#pricing"&gt;Amazon Glacier
Pricing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Importantly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;You can retrieve &lt;strong&gt;up to 5%&lt;/strong&gt; of your stored-data monthly, for
free. More than 5% requires you to pay, and this &lt;em&gt;starts at&lt;/em&gt; USD $0.01/GB.
(This 5% is slightly misleading, too, as you are actually limited to
retrieving 5% per-month, but no more than 1/30th of that per day. In
other words, if you have 10TB stored as in our example, you can only
retrieve about 17GB/day before you start paying.)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If you delete something that has been stored for less than 90 days,
you pay a USD $0.03/GB fee.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This last point is important, as it means that you are promising Amazon
that you will be storing data for at least three months. If you don't,
you will be paying for three months of storage &lt;em&gt;anyway&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;That said, I will be moving my archive backups to Glacier. It looks good
and the caveats are well within reason. Kudos to Amazon for providing a
useful service that really fills a need.&lt;/p&gt;</description><guid isPermaLink="true">http://qq.is/article/amazon-glacier</guid><pubDate>Tue, 21 Aug 2012 10:50:00 GMT</pubDate></item><item><title>Singularity, an Introduction</title><link>http://qq.is/article/singularity-intro</link><description>&lt;p&gt;Today I want to talk about Singularity, a system I've been developing to
help with certain administration/operation related tasks. Some time ago
I wrote about my ideas on a new monitoring system -- this is not that.
This may be able to do that, but right now this is something else.&lt;/p&gt;
&lt;p&gt;Singularity is, in essence, a software agent that you run one all of
your servers. It gives you certain functionality that I find really nice
to have. Nothing that is earth-shattering -- yes, you can get this same
functionality through other systems, but there is nothing I've found
that works as easily and completely as Singularity. Let me show you what
I mean.&lt;/p&gt;
&lt;h1&gt;Singularity as Remote Execution&lt;/h1&gt;
&lt;p&gt;Originally I wanted something faster than &lt;a href="http://fabfile.org/"&gt;Fabric&lt;/a&gt;.
It's a fantastic system and very flexible, but it uses SSH and it's
serial. I don't need SSH here (it's an entirely internal network) and I
want it to be parallel. Above a certain point, serial is just way too
slow!&lt;/p&gt;
&lt;p&gt;Singularity lets you execute something on a remote host:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ sng-client -H app1 exec /usr/bin/blah
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or multiple:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ sng-client -H app1,app2,app3 exec /usr/bin/blah
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or perhaps you want to do something globally:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ sng-client -A exec "service puppetd start"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, you can specify roles. If you assign a machine to a role (and
a machine can have many roles), then you can execute things on those
roles. I use this for, say, our Riak nodes, App nodes, etc.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ sng-client -H app1 add_role app
$ sng-client -H app2 add_role app
$ sng-client -R app exec /usr/sbin/blah
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That final command executes on app1 and app2.&lt;/p&gt;
&lt;h1&gt;Singularity as Locking Service&lt;/h1&gt;
&lt;p&gt;A design pattern that I use is sometimes I want cron to start something
if it's offline, but otherwise, do nothing. This is easily done with
any init script that supports a status command -- or you can check for
a pid file -- or you can use a tool purpose built to do locking on the
filesystem.&lt;/p&gt;
&lt;p&gt;All of these will work, but you will have to figure out how you want to
do it. Singularity lets you do it easily:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ sng-client -L mylock exec /usr/bin/somecommand
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will attempt to get the local (i.e., on this machine only) lock
called mylock and, if successful, will then run that command. That's
great, nothing special...&lt;/p&gt;
&lt;p&gt;Well, now realize that you can do it remotely, fetching a lock on the
machine and only running if the lock can be gotten.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ sng-client -H app1 -L mylock exec /usr/bin/compact-files
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also use &lt;em&gt;global locks&lt;/em&gt;, which can only be held once across the
entire infrastructure. (We use &lt;a href="https://github.com/ha/doozerd"&gt;doozer&lt;/a&gt;
for the central locking/PAXOS service.)&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ sng-client -G globalmylock exec /do/something/big
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Global locks can be useful for cron jobs. Imagine if you have the same
cron job on your four app nodes, and you need there to be only one copy
of it running anywhere globally. It's an important payment job. You tell
Singularity this, and only one of those nodes will ever run your job.&lt;/p&gt;
&lt;p&gt;If the machine running your job goes away, then one of the other cron
jobs will succeed and start up since that global lock will no longer be
claimed.&lt;/p&gt;
&lt;h1&gt;Borrowing from Puppet&lt;/h1&gt;
&lt;p&gt;Another interesting thing that Singularity does, but isn't fully exposed
yet, is that we depend on &lt;a href="http://puppetlabs.com/"&gt;Puppet&lt;/a&gt;'s program
called Facter. This gathers a lot of information about the machine it
runs on and exports RAM, disks, OS, and other useful information.&lt;/p&gt;
&lt;p&gt;This information will allow Singularity to make intelligent choices
about where to put processes. (More on that later when we talk about my
plans for the future of this project.)&lt;/p&gt;
&lt;p&gt;This information also allows us to export inventory style information.
Ever wanted to build a UI that shows what kind of hardware you have, but
didn't want to go through the work of keeping it up to date? Singularity
is already gathering all of the information you need automatically and
collating it.&lt;/p&gt;
&lt;h1&gt;Under the Hood&lt;/h1&gt;
&lt;p&gt;This project is written in &lt;a href="http://golang.org/"&gt;Go&lt;/a&gt; and uses
&lt;a href="http://www.zeromq.org/"&gt;ZeroMQ&lt;/a&gt; and &lt;a href="https://developers.google.com/protocol-buffers/"&gt;Protocol
Buffers&lt;/a&gt; internally for
all communication. This helps ensure reliability and will eventually
ensure speed and flexibility.&lt;/p&gt;
&lt;p&gt;The Go language is a really good fit for this kind of systems project.
Low footprint, compiled distribution, fast execution, and the built-in
concurrency is fantastic. If you haven't used Go, I recommend you give
it a shot.&lt;/p&gt;
&lt;p&gt;The organization of components is the doozer PAXOS service in the
middle. You can configure doozer as a HA system with failover. The
Singularity agents then connect to your doozer cloud and use that to
coordinate what they're doing -- i.e., to make sure only one of the
agents is running the global scheduler.&lt;/p&gt;
&lt;p&gt;Everything is designed with distribution in mind. There are global lock
clearers that make sure that if a machine crashes, locks are released.
Or if a machine is taken offline, it gets removed from the cloud of
machines in Singularity.&lt;/p&gt;
&lt;h1&gt;Singularity -- Soon&lt;/h1&gt;
&lt;p&gt;Once I started hacking on this project, I realized that there are
so many things we do in operations that we could just replace with
something like Singularity and make our lives so much easier. For
example, cron -- it's an archaic system that we all love to hate, but
it could be so much better. Instead of just building a better cron that
understands "I want this job to run, but it could run on any app node",
that seems a better fit for something like an integrated inventory/cron
system.&lt;/p&gt;
&lt;p&gt;Soon, you will be able to give Singularity configurations to run, and it
will manage them for you. I.e., you could do something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;log_rotate:
    role: app
    command: /usr/sbin/logrotate
    daily: 2am
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That example is easily understood, but you can already do that with
cron. More interesting is if you add in some of the other features and
things that Singularity can do:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;profiler:
    local_lock: profiler
    command: /usr/sbin/profiler
    every: 1m
    constraint:
        - load_avg.1m &amp;lt; 3
        - cpu.idle &amp;gt; 20%
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example configuration specifies a profiler that runs every minute.
However, only ever run one at a time -- if it takes more than a minute,
the lock constraint fails and you don't end up stacking up profilers.
Additionally, it specifies to only run on machines with a load average
under 3 and more than 20% idle.&lt;/p&gt;
&lt;p&gt;That would be a little more difficult to do in standard cron.&lt;/p&gt;
&lt;p&gt;I have some more ideas for this system. Events, chaining inputs
and outputs, integration with &lt;a href="http://opentsdb.net/"&gt;OpenTSDB&lt;/a&gt; for
monitoring, &lt;a href="http://pagerduty.com"&gt;PagerDuty&lt;/a&gt; for alerting, etc. The
future is exciting.&lt;/p&gt;
&lt;h1&gt;Source and Development&lt;/h1&gt;
&lt;p&gt;The code is available on GitHub:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/xb95/singularity"&gt;https://github.com/xb95/singularity&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;There is no documentation and a lot of gotchas. I am writing this post
to help sort out my thoughts, and to get something online. You are
welcome to play with it if you want, and feedback is always welcome.&lt;/p&gt;</description><guid isPermaLink="true">http://qq.is/article/singularity-intro</guid><pubDate>Tue, 13 Nov 2012 17:21:00 GMT</pubDate></item><item><title>Handling an Outage</title><link>http://qq.is/article/outage-handling</link><description>&lt;p&gt;Last night the colocation provider I use for Dreamwidth,
&lt;a href="http://www.serverbeach.com/"&gt;ServerBeach&lt;/a&gt;, was down for nearly four
hours from 0000 CST to about 0330 CST. This blog post is a customer-side
postmortem about the company's handling of this outage.&lt;/p&gt;
&lt;h1&gt;Outage Notification&lt;/h1&gt;
&lt;p&gt;I generally classify outages as trivial (seconds to, say, 2-3 minutes),
minor (3-10 minutes), or major (10+ minutes). I will refer to these in
the rest of this post as the handling does change -- to help balance
resolution time with customer comfort, mostly. Of course, it's worth
mentioning that one size does not fit all, and what I find as best
practices might be somewhat different if you have an SLA or if you are
in another industry.&lt;/p&gt;
&lt;p&gt;Outage handling starts when you are notified of a problem. This usually
happens at the bottom somewhere -- a customer service rep, a customer,
or your monitoring solution will alert someone to the problem. There are
pretty good odds that this is someone who is external to the company or
someone who has nothing to do with your technical operations.&lt;/p&gt;
&lt;p&gt;For Dreamwidth, I never notice an outage first. It's always a user or
one of our volunteers or other staff who will find out that we're down
before I ever do. Even Nagios isn't as fast as a human who is actively
using the site.&lt;/p&gt;
&lt;p&gt;In some situations, too, your monitoring system won't work. In last
night's outage, the problem was that the entire data center went off
the air. Our monitoring system, being purely internal, had no way of
alerting us. (We have in the past had external monitoring, but I found
Pingdom unreliable and other options proved too expensive. Maybe there
are better options now?) Even if your monitoring system is working fine,
some outages just aren't noticed by it. We've all had situations where
something isn't correctly monitored or is giving false positives!&lt;/p&gt;
&lt;p&gt;For these reasons, it is important to provide a method for users
(internal and external) to advise of an outage. Dreamwidth has
a Twitter account that end users can talk to, and our mid- to
senior-level volunteers and staff all have phone numbers for our systems
administrators and know that they can call us 24/7 to advise of an
outage.&lt;/p&gt;
&lt;p&gt;That's how I found out we were down last night: one of our employees
called me and advised me within minutes of the site being down. By that
point of course, the outage was considered a minor outage since it had
been more than a few minutes.&lt;/p&gt;
&lt;p&gt;We have trained our users that, during downtimes, our Twitter account
is the place to go for updates. We put up a message advising them that
we were down and we were investigating and had no ETA. This took less
than a minute of our time and the effects were immediate -- users knew
we were aware, that we were on the problem, and they could relax. They
responded by being pleasant and thankful and went off to do other things
on the Internet instead of continually refreshing the site and getting
more and more angry.&lt;/p&gt;
&lt;p&gt;The effect this small bit of information can have on your customers
is worth overstating: it is the difference between a bad experience
where your customer debates finding another provider and one where the
customer feels confidence in you and that you are on top of the problem.
Internally, it doesn't matter what's going on, &lt;strong&gt;let your customers know
you're aware&lt;/strong&gt;. Be calm and confident, but communicate!&lt;/p&gt;
&lt;p&gt;Now the caveat I mentioned about outage sizes: I first check to see if
the problem is something I can resolve in a minute or less. I.e., if
it's a trivial outage, it's better to get the site up immediately and
then post a notification that it was down. However, if I realize the
outage is at least a minor outage, then it's vital to notify people that
there is a problem.&lt;/p&gt;
&lt;p&gt;This gives us the first two pieces of a solid outage handling process:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Have a way for users/staff to notify you of downtimes&lt;/li&gt;
&lt;li&gt;Acknowledge the downtime immediately if it's minor (3+ minutes)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;ServerBeach failed on both fronts. I had no way of notifying them of
a downtime except calling their tech support line, which I tried to
do, but their phone lines were failing (not picking up at all) and I
couldn't get through. I ultimately was able to reach PEER1 (the parent
company) but they couldn't really help me.&lt;/p&gt;
&lt;p&gt;I could have assumed they knew about the outage, but it would have only
been an assumption -- and it's not a good business practice for me to
just assume that an outage is being fixed! -- so I had to keep trying
for 20 minutes to reach somebody. That was a huge waste of my time, and
all because they didn't provide notification that they were aware of a
problem.&lt;/p&gt;
&lt;p&gt;The first notification I can find was nearly an &lt;strong&gt;hour&lt;/strong&gt; after the
downtime started. Completely unacceptable. The entire data center was
offline -- thousands of customers -- and they took an hour to let us
know.&lt;/p&gt;
&lt;h1&gt;Ongoing Outages&lt;/h1&gt;
&lt;p&gt;In this case, the outage was a long one. Nearly four hours of
downtime. Outages of that caliber start to get very unnerving for the
users, because now you've graduated from "annoyance" to "potentially
catastrophic". Why is it taking four hours to come back up? Was there a
fire or flood? Meteor strike? Did the government come in and seize the
building because of Mega?&lt;/p&gt;
&lt;p&gt;At this point someone needs to be on point for communication. It should
be someone who can, every so often (I find 30-60 minutes is frequent
enough) post and let users know that you're still aware of the problem
and, yes, you're still working on it. Even if, like in Dreamwidth's
case, we had no information and were just crossing our fingers that
ServerBeach would fix things sometime soon. Your very presence is
comforting to your users, though, and lets them know that they're
important. That feeling is extremely valuable to have -- if you don't
encourage goodwill, the lack thereof will be bad for your business.&lt;/p&gt;
&lt;p&gt;In this outage, I got most of my information from other customers
on Twitter. I followed the #peer1 and #serverbeach hashtag and was
collecting information from other people who were customers. &lt;strong&gt;This
is stupid and bad!&lt;/strong&gt; I shouldn't have to rely on other &lt;em&gt;customers&lt;/em&gt; to
give me information about what's going on. It makes the company look
incompetent. Seriously.&lt;/p&gt;
&lt;p&gt;It was &lt;strong&gt;two hours&lt;/strong&gt; after the outage started before there was official
information about the problem. Unfortunately, this information was in
a place I never looked -- because the person I did get on the phone
earlier told me to look at the PEER1 network status forums, which
are separate from the ServerBeach status forums. (Even though my
management portal and branding is all PEER1, but because this data
center was acquired via ServerBeach, they have a different area for
status updates.) This is also ridiculous.&lt;/p&gt;
&lt;p&gt;The important parts of the process here:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Have a predictable place to find status updates&lt;/li&gt;
&lt;li&gt;Keep users informed of status and ETA (if available)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The first point cannot be stressed enough. Users need to know where to
go to find things out. Staff needs to know, too, so they can give the
right information to users. Giving a customer wrong information is worse
than no information. I spent the whole night thinking that ServerBeach
never posted anything -- which was wrong, they had; even if it was way
too slow.&lt;/p&gt;
&lt;p&gt;In retrospect, now that I'm reading the outage thread on ServerBeach's
side, once they got the ball rolling they were following a good flow
for updating. They posted every 30-45 minutes and updated with as much
information as they had, which was great. Kudos to them for having a
good flow once things got going.&lt;/p&gt;
&lt;h1&gt;Outage Closing&lt;/h1&gt;
&lt;p&gt;The end of an outage should be handled with the same ideas repeated. Let
people know that you're back up, then let them know what happened in as
much detail as you have and advise if you will be giving a postmortem.
Commit to followthrough so that people know what to expect.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Post an end-of-outage notification, advise if there will be a
postmortem&lt;/li&gt;
&lt;li&gt;If providing a postmortem later, make it predictably
located and linked&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;ServerBeach did well here (excepting of course that I didn't know where
to find this information): they posted an update at the end, said what
happened, why it caused an outage, and what they would be doing to fix
it. This is a good response -- although since they mentioned a number of
things that were unexplained or unclear, it will need to be followed up
with a response when those things are clarified.&lt;/p&gt;
&lt;h1&gt;Communication&lt;/h1&gt;
&lt;p&gt;It is my opinion that service providers should overcommunicate. You will
almost never fail by telling the users exactly what is going on, and you
will probably find that people are remarkably forgiving if they feel
included in the process. Because of this outage, Dreamwidth was down
for nearly four hours, but our users were polite and &lt;strong&gt;thankful&lt;/strong&gt;. Just
because we let them know what was going on.&lt;/p&gt;
&lt;p&gt;ServerBeach's outage was bad, but everything breaks sometime. The
real fault here is their handling of the notification process and how
disconnected they were from the users. This is inexcusable in a service
provider, particularly these days when hosting providers are a dime
a dozen and competition is not so much about price. For that matter,
I'd pay a premium to ensure that I am hosted with a service that can
actually communicate when something is going on.&lt;/p&gt;
&lt;p&gt;Now, can someone tell ServerBeach to post a postmortem about their
handling of the communication during this outage? :-)&lt;/p&gt;
&lt;p&gt;End of rant.&lt;/p&gt;
&lt;h1&gt;Update&lt;/h1&gt;
&lt;p&gt;I just received a call from Dax Moreno, Director of Customer Experience.
He reached out to talk about the outage. I was able to convey most of
this content to him in a less ranty, more constructive way. (Or I hope
it came across that way!)&lt;/p&gt;
&lt;p&gt;Major points to ServerBeach/PEER1 for reaching out like that. It is
never good to have an outage, and the handling at the beginning leaves
much to be desired (which Dax agreed with), but having a personal
contact from someone nets a huge gain in goodwill from the customer
and makes them feel more in control of what is, by its nature, an
uncontrollable experience.&lt;/p&gt;
&lt;p&gt;Good on 'em for that, then.&lt;/p&gt;</description><guid isPermaLink="true">http://qq.is/article/outage-handling</guid><pubDate>Tue, 12 Feb 2013 15:00:00 GMT</pubDate></item></channel></rss>