Handling an Outage
Last night the colocation provider I use for Dreamwidth, ServerBeach, was down for nearly four hours from 0000 CST to about 0330 CST. This blog post is a customer-side postmortem about the company's handling of this outage.
Outage Notification
I generally classify outages as trivial (seconds to, say, 2-3 minutes), minor (3-10 minutes), or major (10+ minutes). I will refer to these in the rest of this post as the handling does change -- to help balance resolution time with customer comfort, mostly. Of course, it's worth mentioning that one size does not fit all, and what I find as best practices might be somewhat different if you have an SLA or if you are in another industry.
Outage handling starts when you are notified of a problem. This usually happens at the bottom somewhere -- a customer service rep, a customer, or your monitoring solution will alert someone to the problem. There are pretty good odds that this is someone who is external to the company or someone who has nothing to do with your technical operations.
For Dreamwidth, I never notice an outage first. It's always a user or one of our volunteers or other staff who will find out that we're down before I ever do. Even Nagios isn't as fast as a human who is actively using the site.
In some situations, too, your monitoring system won't work. In last night's outage, the problem was that the entire data center went off the air. Our monitoring system, being purely internal, had no way of alerting us. (We have in the past had external monitoring, but I found Pingdom unreliable and other options proved too expensive. Maybe there are better options now?) Even if your monitoring system is working fine, some outages just aren't noticed by it. We've all had situations where something isn't correctly monitored or is giving false positives!
For these reasons, it is important to provide a method for users (internal and external) to advise of an outage. Dreamwidth has a Twitter account that end users can talk to, and our mid- to senior-level volunteers and staff all have phone numbers for our systems administrators and know that they can call us 24/7 to advise of an outage.
That's how I found out we were down last night: one of our employees called me and advised me within minutes of the site being down. By that point of course, the outage was considered a minor outage since it had been more than a few minutes.
We have trained our users that, during downtimes, our Twitter account is the place to go for updates. We put up a message advising them that we were down and we were investigating and had no ETA. This took less than a minute of our time and the effects were immediate -- users knew we were aware, that we were on the problem, and they could relax. They responded by being pleasant and thankful and went off to do other things on the Internet instead of continually refreshing the site and getting more and more angry.
The effect this small bit of information can have on your customers is worth overstating: it is the difference between a bad experience where your customer debates finding another provider and one where the customer feels confidence in you and that you are on top of the problem. Internally, it doesn't matter what's going on, let your customers know you're aware. Be calm and confident, but communicate!
Now the caveat I mentioned about outage sizes: I first check to see if the problem is something I can resolve in a minute or less. I.e., if it's a trivial outage, it's better to get the site up immediately and then post a notification that it was down. However, if I realize the outage is at least a minor outage, then it's vital to notify people that there is a problem.
This gives us the first two pieces of a solid outage handling process:
- Have a way for users/staff to notify you of downtimes
- Acknowledge the downtime immediately if it's minor (3+ minutes)
ServerBeach failed on both fronts. I had no way of notifying them of a downtime except calling their tech support line, which I tried to do, but their phone lines were failing (not picking up at all) and I couldn't get through. I ultimately was able to reach PEER1 (the parent company) but they couldn't really help me.
I could have assumed they knew about the outage, but it would have only been an assumption -- and it's not a good business practice for me to just assume that an outage is being fixed! -- so I had to keep trying for 20 minutes to reach somebody. That was a huge waste of my time, and all because they didn't provide notification that they were aware of a problem.
The first notification I can find was nearly an hour after the downtime started. Completely unacceptable. The entire data center was offline -- thousands of customers -- and they took an hour to let us know.
Ongoing Outages
In this case, the outage was a long one. Nearly four hours of downtime. Outages of that caliber start to get very unnerving for the users, because now you've graduated from "annoyance" to "potentially catastrophic". Why is it taking four hours to come back up? Was there a fire or flood? Meteor strike? Did the government come in and seize the building because of Mega?
At this point someone needs to be on point for communication. It should be someone who can, every so often (I find 30-60 minutes is frequent enough) post and let users know that you're still aware of the problem and, yes, you're still working on it. Even if, like in Dreamwidth's case, we had no information and were just crossing our fingers that ServerBeach would fix things sometime soon. Your very presence is comforting to your users, though, and lets them know that they're important. That feeling is extremely valuable to have -- if you don't encourage goodwill, the lack thereof will be bad for your business.
In this outage, I got most of my information from other customers on Twitter. I followed the #peer1 and #serverbeach hashtag and was collecting information from other people who were customers. This is stupid and bad! I shouldn't have to rely on other customers to give me information about what's going on. It makes the company look incompetent. Seriously.
It was two hours after the outage started before there was official information about the problem. Unfortunately, this information was in a place I never looked -- because the person I did get on the phone earlier told me to look at the PEER1 network status forums, which are separate from the ServerBeach status forums. (Even though my management portal and branding is all PEER1, but because this data center was acquired via ServerBeach, they have a different area for status updates.) This is also ridiculous.
The important parts of the process here:
- Have a predictable place to find status updates
- Keep users informed of status and ETA (if available)
The first point cannot be stressed enough. Users need to know where to go to find things out. Staff needs to know, too, so they can give the right information to users. Giving a customer wrong information is worse than no information. I spent the whole night thinking that ServerBeach never posted anything -- which was wrong, they had; even if it was way too slow.
In retrospect, now that I'm reading the outage thread on ServerBeach's side, once they got the ball rolling they were following a good flow for updating. They posted every 30-45 minutes and updated with as much information as they had, which was great. Kudos to them for having a good flow once things got going.
Outage Closing
The end of an outage should be handled with the same ideas repeated. Let people know that you're back up, then let them know what happened in as much detail as you have and advise if you will be giving a postmortem. Commit to followthrough so that people know what to expect.
- Post an end-of-outage notification, advise if there will be a postmortem
- If providing a postmortem later, make it predictably located and linked
ServerBeach did well here (excepting of course that I didn't know where to find this information): they posted an update at the end, said what happened, why it caused an outage, and what they would be doing to fix it. This is a good response -- although since they mentioned a number of things that were unexplained or unclear, it will need to be followed up with a response when those things are clarified.
Communication
It is my opinion that service providers should overcommunicate. You will almost never fail by telling the users exactly what is going on, and you will probably find that people are remarkably forgiving if they feel included in the process. Because of this outage, Dreamwidth was down for nearly four hours, but our users were polite and thankful. Just because we let them know what was going on.
ServerBeach's outage was bad, but everything breaks sometime. The real fault here is their handling of the notification process and how disconnected they were from the users. This is inexcusable in a service provider, particularly these days when hosting providers are a dime a dozen and competition is not so much about price. For that matter, I'd pay a premium to ensure that I am hosted with a service that can actually communicate when something is going on.
Now, can someone tell ServerBeach to post a postmortem about their handling of the communication during this outage? :-)
End of rant.
Update
I just received a call from Dax Moreno, Director of Customer Experience. He reached out to talk about the outage. I was able to convey most of this content to him in a less ranty, more constructive way. (Or I hope it came across that way!)
Major points to ServerBeach/PEER1 for reaching out like that. It is never good to have an outage, and the handling at the beginning leaves much to be desired (which Dax agreed with), but having a personal contact from someone nets a huge gain in goodwill from the customer and makes them feel more in control of what is, by its nature, an uncontrollable experience.
Good on 'em for that, then.
View Comments // posted on 2013-02-12 at 15:00
Singularity, an Introduction
Today I want to talk about Singularity, a system I've been developing to help with certain administration/operation related tasks. Some time ago I wrote about my ideas on a new monitoring system -- this is not that. This may be able to do that, but right now this is something else.
Singularity is, in essence, a software agent that you run one all of your servers. It gives you certain functionality that I find really nice to have. Nothing that is earth-shattering -- yes, you can get this same functionality through other systems, but there is nothing I've found that works as easily and completely as Singularity. Let me show you what I mean.
Singularity as Remote Execution
Originally I wanted something faster than Fabric. It's a fantastic system and very flexible, but it uses SSH and it's serial. I don't need SSH here (it's an entirely internal network) and I want it to be parallel. Above a certain point, serial is just way too slow!
Singularity lets you execute something on a remote host:
$ sng-client -H app1 exec /usr/bin/blah
Or multiple:
$ sng-client -H app1,app2,app3 exec /usr/bin/blah
Or perhaps you want to do something globally:
$ sng-client -A exec "service puppetd start"
Finally, you can specify roles. If you assign a machine to a role (and a machine can have many roles), then you can execute things on those roles. I use this for, say, our Riak nodes, App nodes, etc.
$ sng-client -H app1 add_role app
$ sng-client -H app2 add_role app
$ sng-client -R app exec /usr/sbin/blah
That final command executes on app1 and app2.
Singularity as Locking Service
A design pattern that I use is sometimes I want cron to start something if it's offline, but otherwise, do nothing. This is easily done with any init script that supports a status command -- or you can check for a pid file -- or you can use a tool purpose built to do locking on the filesystem.
All of these will work, but you will have to figure out how you want to do it. Singularity lets you do it easily:
$ sng-client -L mylock exec /usr/bin/somecommand
This will attempt to get the local (i.e., on this machine only) lock called mylock and, if successful, will then run that command. That's great, nothing special...
Well, now realize that you can do it remotely, fetching a lock on the machine and only running if the lock can be gotten.
$ sng-client -H app1 -L mylock exec /usr/bin/compact-files
You can also use global locks, which can only be held once across the entire infrastructure. (We use doozer for the central locking/PAXOS service.)
$ sng-client -G globalmylock exec /do/something/big
Global locks can be useful for cron jobs. Imagine if you have the same cron job on your four app nodes, and you need there to be only one copy of it running anywhere globally. It's an important payment job. You tell Singularity this, and only one of those nodes will ever run your job.
If the machine running your job goes away, then one of the other cron jobs will succeed and start up since that global lock will no longer be claimed.
Borrowing from Puppet
Another interesting thing that Singularity does, but isn't fully exposed yet, is that we depend on Puppet's program called Facter. This gathers a lot of information about the machine it runs on and exports RAM, disks, OS, and other useful information.
This information will allow Singularity to make intelligent choices about where to put processes. (More on that later when we talk about my plans for the future of this project.)
This information also allows us to export inventory style information. Ever wanted to build a UI that shows what kind of hardware you have, but didn't want to go through the work of keeping it up to date? Singularity is already gathering all of the information you need automatically and collating it.
Under the Hood
This project is written in Go and uses ZeroMQ and Protocol Buffers internally for all communication. This helps ensure reliability and will eventually ensure speed and flexibility.
The Go language is a really good fit for this kind of systems project. Low footprint, compiled distribution, fast execution, and the built-in concurrency is fantastic. If you haven't used Go, I recommend you give it a shot.
The organization of components is the doozer PAXOS service in the middle. You can configure doozer as a HA system with failover. The Singularity agents then connect to your doozer cloud and use that to coordinate what they're doing -- i.e., to make sure only one of the agents is running the global scheduler.
Everything is designed with distribution in mind. There are global lock clearers that make sure that if a machine crashes, locks are released. Or if a machine is taken offline, it gets removed from the cloud of machines in Singularity.
Singularity -- Soon
Once I started hacking on this project, I realized that there are so many things we do in operations that we could just replace with something like Singularity and make our lives so much easier. For example, cron -- it's an archaic system that we all love to hate, but it could be so much better. Instead of just building a better cron that understands "I want this job to run, but it could run on any app node", that seems a better fit for something like an integrated inventory/cron system.
Soon, you will be able to give Singularity configurations to run, and it will manage them for you. I.e., you could do something like this:
log_rotate:
role: app
command: /usr/sbin/logrotate
daily: 2am
That example is easily understood, but you can already do that with cron. More interesting is if you add in some of the other features and things that Singularity can do:
profiler:
local_lock: profiler
command: /usr/sbin/profiler
every: 1m
constraint:
- load_avg.1m < 3
- cpu.idle > 20%
This example configuration specifies a profiler that runs every minute. However, only ever run one at a time -- if it takes more than a minute, the lock constraint fails and you don't end up stacking up profilers. Additionally, it specifies to only run on machines with a load average under 3 and more than 20% idle.
That would be a little more difficult to do in standard cron.
I have some more ideas for this system. Events, chaining inputs and outputs, integration with OpenTSDB for monitoring, PagerDuty for alerting, etc. The future is exciting.
Source and Development
The code is available on GitHub:
https://github.com/xb95/singularity
There is no documentation and a lot of gotchas. I am writing this post to help sort out my thoughts, and to get something online. You are welcome to play with it if you want, and feedback is always welcome.
View Comments // posted on 2012-11-13 at 17:21
Amazon Glacier
Today, Amazon announced a new service being provided under the AWS umbrella: Amazon Glacier. In summary, this is a service designed to replace off-site archival storage, commonly used for backups and long-term storage of infrequently accessed data.
The Use Case
Glacier is not for your standard backups. This is designed for storing the long-term versions of backups that you only ever fall back to in case of major catastrophe. As an example, I'm considering storing my MySQL archives in Glacier. This wouldn't be my only backup, I maintain last week's backup locally in my data center.
In case of machine failure or operator error, I can restore from that backup plus binlogs to get back up to right before the failure. Glacier is not involved. Where this service comes in is if, somehow, my database dies, my backup is deleted, and the mirrored database (slave or standby-master) is also wiped.
If three full copies of my data goes away, that's a catastrophe and I will have to restore from Glacier.
To date, most of us have bitten the bullet and used Amazon's S3 for this, even though the cost for this service is quite exorbitant. At my day job, we also use Tarsnap -- an encrypted data storage service that is backed by S3. While a fantastic service, the cost of storing many terabytes of data really starts to add up.
S3 also provides a lot of functionality that isn't needed for doing off-site archival backups. While the CDN-like nature of S3 is great, I really don't care if my backups are easily downloadable by HTTP. I'd actually rather they weren't -- which, thankfully, S3 lets you do. You still end up feeling slightly like you're misusing this service and, in effect, paying for functionality you just don't need.
Cost Effectiveness
Whenever Amazon announces a product, my first step is to understand the product and see if it's useful to me. This one is. Next, the big question: is it cost effective for my purposes? Let's try to figure that out.
For this back-of-the-envelope comparison, let's imagine we have 10TB of data we want to archive and store. This includes some number of copies of the database, a bunch of files we want to keep "just in case", etc. We expect that this data will only ever need to be used as a last line of defense.
In Amazon Glacier, storing 10TB costs $102.40 per month (10,240 GB at $0.01/GB).
(Compare this to Amazon S3 which would cost about USD $1126 per month. Glacier is 10% of the cost of S3.)
But what about comparing this to hosting it yourself? Let's assume that you are going to build out your own hardware and store it in a data center. There are a number of ways you can go to accomplish this, and I'm going to be generous with the discounts and pricing.
The best price I can find on a tape storage system puts the box at slightly over USD $3000 for a machine capable of storing 18TB uncompressed. (I'm assuming the 10TB above is compressed already and you won't get much out of storage-level compression.)
Assuming even a 50% discount (which you almost certainly wouldn't get), that's still USD $1500 for raw storage for 18TB of data. This is just the machine, though, now you need to put it somewhere. If you store it in your office, the cost might be negligible -- but now you don't really have secure backups. All of your company's data is now beholden to the security of your physical location -- which may or may not be good.
If you want to collocate your backup server, you've got an akward situation -- unless you already have multiple data centers, storing it in your existing location means it isn't off-site. If you have to rent a spot somewhere else, it will be at least $100/month to get this machine online and powered. By the time you've bought the hardware and put it somewhere, you've well exceeded the cost of Amazon Glacier for this dataset.
Let's not even talk about the cost for tapes, the labor required when you have hardware failures, and other such issues. For once, I can say that Amazon's pricing on a service is well below what you could achieve yourself for this use.
The other option is rotational media -- but the cost for that is more than tape. Disks also tend to have a higher failure rate than tape in my experience, driving up your costs in labor and spares-on-hand.
Even if you somehow managed to get the cost of one system down low enough to be competitive, now you've built a system that is perhaps slightly cheaper, but not redundant. Glacier is replicated across multiple data centers and to multiple locations in each facility. I can't imagine any way in which an end-user can beat that. Amazon has a huge economy of scale.
Amazon Glacier is easily cheaper than hosting your own backups, not to mention more convenient.
Who It's Not For
So, then, why wouldn't everybody use this service?
For one, if you already have facilities in several locations and spare power. Adding a server won't change your opex appreciably, and sinking a little into capex is often a better plan for most businesses than increased opex. This also assumes that you have people going to those facilities already, so the added cost of having someone swap in a tape is pretty minimal.
Also, at scale -- Glacier is a linear service. 100TB costs ten times as much as 10TB, but that's not the case if you're doing it yourself. At some point when you can start buying petabyte-level storage, you almost certainly already have the infrastructure such that you won't save much money by using Amazon.
Finally, security. Whatever data you submit to Amazon is encrypted in-transit, but they don't encrypt your data on their end. You lose a little control of the security of your data. You could encrypt it locally before sending, but that requires some effort on your end. It's not terribly hard, but it does require some consideration.
In my experience, this rules out large companies that wouldn't consider using Amazon's services anyway. For the rest of us who work in startups or small to medium businesses, though, Glacier looks great.
Some Caveats on Pricing and Usage
One thing that is important to mention: this is the equivalent of Iron Mountain or similar long-term archival storages. It's like a glacier -- large, small-moving, and very, very frozen.
Technically, this means that you need to be storing things you don't intend to retrieve very frequently. In fact, if you over-use retrieval, it will cost you to get your data back out: Amazon Glacier Pricing.
Importantly:
-
You can retrieve up to 5% of your stored-data monthly, for free. More than 5% requires you to pay, and this starts at USD $0.01/GB. (This 5% is slightly misleading, too, as you are actually limited to retrieving 5% per-month, but no more than 1/30th of that per day. In other words, if you have 10TB stored as in our example, you can only retrieve about 17GB/day before you start paying.)
-
If you delete something that has been stored for less than 90 days, you pay a USD $0.03/GB fee.
This last point is important, as it means that you are promising Amazon that you will be storing data for at least three months. If you don't, you will be paying for three months of storage anyway.
That said, I will be moving my archive backups to Glacier. It looks good and the caveats are well within reason. Kudos to Amazon for providing a useful service that really fills a need.
View Comments // posted on 2012-08-21 at 10:50