DevOps 101: Best Practices for Optimizing and Automating Your Infrastructure

Whether you’re coming across the term for the first time, or you’ve been listing it on your LinkedIn profile for a couple of years now, it’s likely that you are still faced with the question: What exactly do we mean by DevOps?

Whatever your situation, the good news is that if you like automating things and want to see what’s going on with those tech hipsters and their fancy conference T-Shirts, you’ve come to the right place to learn more!

In this post I’ll attempt to outline some of the common themes across the best environments practising DevOps, and communicate some of the top DevOps practices you can bring to your own dev environment.

DevOps Defined

As a term, ‘DevOps’ is only slightly less ambiguous than the word ‘Cloud’. When pressed to give a concise definition, the best I can come up with is that DevOps is the marrying of process, infrastructure, and product.

Actually that’s a terrible definition, but it sounds nice when you say it, so I’ll try to do better as we go along.

Put another way, DevOps is basically the cool-kids way of stringing stuff together with shell scripts. It exists because we can now wrap things in shell scripts that we used to only dream of; the world is now programmable at a much larger scale, and we have many new tools and techniques for taking advantage of it.

So, why is that important? With this newfound power of continuous automation and integration, the exciting promise of DevOps is that it can let a small team of developers multiply their effectiveness and compete with much larger teams who find themselves more encumbered with traditional processes.

And that, in and of itself, has extraordinary value.

1. The DevOps Philosophy: Use Things You Can Program, and Program the Things You Use

Once upon a time, expanding capacity meant buying new servers, racking these servers, potentially reconfiguring the networking to accommodate the new servers, and then, if you were ahead of the curve, installing an image onto these servers and making it fit into your existing fleet.

For a surprising number of people, these are still the realities of life. Needless to say, all of this meant that to be passably efficient, you had to do all of this well, without disrupting your existing systems.

Of course, the simplest system to manage and maintain is the one you don’t, and if you’re running in an agile environment, this is easily one of your biggest advantages.

With the emergence of the “as-a-Service” family, (can we call them aaSes? are we there yet?), you can choose the tradeoff between the level of customization you need and the time and capacity you have for this kind of system management.

This of course brings us to one of my favourite things to have come out of the DevOps movement: We’re now at the point where we need to seriously talk about competitive advantage and opportunity cost.

The core reasoning comes down to this: You’re in a business that hopefully does something more efficiently than everyone else, or at least better than most. That is the core of your business model.

Mobify, for example, is amazing at adapting web experiences for mobile devices, that’s what we focus on; the reason we have clients and customers is because they’ve realized that they’re better at what they’re doing than they would be at what we do.

Odds are you are not better at cleaning your office than you are at carrying out your core business model, so you hire cleaning services — economists will rightfully object that there are nuances to relative opportunity costs, but I’m simplifying here for illustration.

Likewise, you are probably less efficient at managing the complex and demanding tasks that go along with network and datacenter maintenance than a typical IaaS provider.

What this means for those of us who are selling something other than our incredible ability to rack and provision servers is that there is some logical trade off where it makes more sense for us to leverage the services of those who are best at it.

LAMP-style stacks are boring, but Heroku solved that problem for all of us. Email has been a nightmare to configure for as long as I can remember, but Google was kind enough to free us from that drudgery. Moving servers around in data-centres at 2AM is some kind of sisyphean punishment for abhorrent behaviour and offence given to the gods of Olympus, and I for one am happy to pass that off to the Rackspaces and Amazons.

What’s more important is that all of these services come with APIs, and we can write those small shell scripts to bring up new capacity at the click of a button, whenever we want it.

The advantages here go well beyond rapid scaling, though that’s nice too. We’re now at a point where we deal with problems by burning things to the ground and rebuilding. It doesn’t matter if there’s a weird behaviour on server X-761331-b7, just destroy it and rebuild with the configuration manager of your choice.

“Ah”, you say, “you could certainly move everything over, but it would cost you a fortune!”. For some of you that’s certainly true, but for most I would argue that the time you’re saving more than makes up for the cost.

If your business model is truly competitive, then hopefully you have a way to convert your time to money; even if you’re not on the core product or service team, I’d wager that your time could likely better be spent automating and tooling something new for them.

Netflix is a great example of a team of mad DevOps gurus leveraging Amazon to the hilt, and then going one step further and sharing a bunch of their tools on github.

Or maybe you’re thinking: “Ah, but my servers are special. I do [insert whatever voodoo you’ve got going], and I couldn’t possibly do that on a rented VM”. I understand. Your boxes are special, and I don’t understand your problem domain.

That being said, the menagerie of infrastructure available is growing in diversity all the time. I would really encourage you to take another look.

Best practice takeaway #1: Rent wherever you can and get on with your business. They do it better, and ain’t nobody got time for that.

2. Backups, Version Control, and Clusters, oh my…

Alright, so you’re hopefully convinced that having an infrastructure that you can program is superior to whatever else you’re doing. Now we can start talking about all of the wonderful things you get for free because of it.

Lets start with backups. If you’re not making backups already, you’re either a purely functional programmer and knight of the lambda calculus, or you know your shame and are only reading this to avoid confronting the horror of the imminent catastrophe in your future.

Servers can now be expressed as configuration files, which means they can go into version control, which is a value proposition I hope you can appreciate.

This means new servers can be programmed to come online with preconfigured state, as defined from a central, version-controlled authority; they can download whatever state you have in the rest of cluster and hit the ground running. Beyond that, we’re now in a world where the cost of having standby capacity is a non-issue.

You want to up the number of servers behind your load-balancer? Do it. There’s no requisitioning process. If you’re on Amazon, let them do it for you. For once, you can sleep through the emergency.

What’s better is that now you can architect greedily.

By way of example, when you were young and naive, you probably had a hard drive somewhere and experienced the joy of losing all your data when the thing failed. Then you learned, and made a backup, swearing that would never happen again. If you were a keener, maybe you had a job that would run those backups for you automatically. Then you learned about RAID. RAID was cool, wasn’t it? Until they were all stolen.

Eventually, a stable solution would have to look like a storage cluster with geographically separated nodes, much like what many corporate environments run on. A node failure can now be routine; in fact, on many systems, you can have the servers tell the vendor that a component failed and they’ll send a technician to replace it without you having to do a thing. Neato.

Guess what? You can do the same thing with your services.

One of the most beautiful network diagrams I’ve ever seen was the one put out by the Obama re-election campaign. That was a paragon of DevOps if I’ve ever seen one. A key take away was the clustered nature of the system, and this is a pattern seen in other pioneers of modern architecture (again, think Netflix).

Even in smaller environments where you only have a handful of servers, looking at your architecture from a cluster perspective will give you plenty more flexibility than you would otherwise have, and will seriously redefine what your backup policies will have to look like.

If any of this is new to you, you’re probably wondering at the necessary level of complication introduced by all of these changes, perhaps even thinking that it might outweigh the benefits. True, you’d have to learn something new, but like with most things, the benefits outweigh the pain (the obvious exception being Esperanto; no one ever benefitted from knowing that).

Best practice takeaway #2: Treat your server configuration like developers treat code. Clusters are the new black. Rapid scaling isn’t just about service spikes.

3. Control Your Environment

Speaking of configuration management, let’s have a quick word about that.

This is a vein of DevOps that can be traced back to shortly after the beginning of the Unix Epoch, when we had shell scripts to take a fresh install and move it into something usable. People have been doing this for ages, often with tools they wrote themselves or were maintained within the company or university.

Things have advanced. Though there are a handful of options to choose from, tools like Puppet and Chef are rapidly becoming integrated into platform and infrastructure vendors, and it would certainly be to your advantage to look into them.

There’s a second strand, which we’ll call the school of the Golden Image, which favours clones of a single and well maintained system image. This branch also has a long history, but with the advent of virtual machines has pretty much exploded, making Star Wars references obligatory in nearly any DevOps talk.

Generally, people look down on Golden Images, as it usually results in a terrible mutant horde intent on unleashing the demons of entropy on their creator, or to be more precise, the state of the machine tends to change over time and become an unmanageable mess.

That being said, I’ve seen it pulled off successfully, and if that’s what you have to do to enjoy rapid scaling, it’s better than abusing your poor systems team with caffeine and sleep deprivation. Simply scripting the components that need to be customized per-environment is often enough.

What’s important is that you have some automated way of rapidly bringing up more of the systems you need, and of making sure that they’re configured appropriately. Plenty of teams have had to create strange and unusual processes to manage this in the past, but now is a great time to take a second look at some of those old systems, and then do the right thing and kill them with fire.

Best practice takeaway #3: Configuration management is cool, you should use it, but if you’re stuck with a Golden Image model, we can still be friends.

4. Oh Right, the Developers

So yes, systems are fun, but that’s only half the story, and this would really just be an operations story if it weren’t for the tools you need to start thinking about for your developers.

Going along with our theme of replacing process and procedure with shell scripts, let’s take a quick look at automated testing, continuous integration, and continuous deployment. These are among the practices at the heart of modern agile practice, and are really the reason we have DevOps to begin with.

Testing is something we all do, right? I’ll just pretend you said yes, because now you can do things like have unit tests run as part of your version control system (pro tip: Github makes this super easy).

Automated test runs are a huge boon to the workflow, and can be used with a continuous integration suite to push out changes to your staging and production environments. Aside from catching fat-fingered errors in your code commits, your developers can take a huge load off of your QA department, and fold a lot of those responsibilities back into their process.

A test suite or series of test suites, combined with tools like Fabric, lets you start to automate and wrap a lot of your process into commands that can do most of the heavy lifting for you, like some kind of steroid-infused version of bash aliases.

Having operations-interfacing tools as part of the development process also lets you tie any other sort of big-picture automation into the build. If you have a web component, maybe you want to run a cache purge. Or maybe your build should require a database backup before running, or send some kind of notification. For the sheer novelty of it, maybe you’ve connected an Arduino up to put on some kind of Rube-Goldberg display when certain parameters are met.

Let your inner-geek shine. The point is that you code your process and implement it, whatever that looks like.

Best practice takeaway #4: You’ve heard this before — automated testing, continuous integration, and continuous deployment are at the heart of DevOps, start here.

5. Huff and Puff, and Blow the House Down

Of course, what would be the point of all of these lovely tools, redundancies, and procedures if you didn’t have confidence in them?

It’s important that everyone involved trust the tools to do exactly what they’re supposed to do, so that when push comes to shove, you’re not afraid to pull to switch. This is where fire drills come in.

You wouldn’t ship your code without testing, (ahem), and you shouldn’t build out your systems without first making sure they’ll perform under their edge cases as planned.

This one is of course a harder sell to whoever it is that conducts your performance reviews; wanting to pull the rug out from under your main systems, cry havoc, and let loose the chaos monkeys of war sounds like something for which you press charges rather than applaud.

However, these are systems designed to fail and recover, and if they can’t do that then trying to pretend that it’s not an issue is only compounding the problem.

This is not the Tao of DevOps. Be brave. Pull the plug. sudo rm -rf /*. See what happens. Blame the intern. Then, like a majestic phoenix, watch your system rise from the ashes and restore itself to all of its former glory and smile.

Simulating failure, or causing it, is the best way to identify brittle components and to purge them.

Do the same with security. Plenty of us have had the joy of working with pen-testers — those beautiful deviants who get paid to do unspeakable things to your precious servers. Even if you aren’t prepared to budget for the cost of one of these digital dominatrices, you have an idea of what would happen if they had their way with you, and you can simulate that easily enough. What’s nice about these tests is that you get the chance to channel all of that malevolent angst you’ve built up against your uncooperative systems and imagine what terrible evils you could inflict. It’s therapy and responsible administration all in one.

Best practice takeaway #5: Plan for failure, then fail, and get back up again. Repeat. Whip it good.

Kind Words of Parting

Again, I need to stress that despite O’Reilly’s remarkable turnaround time on publications, DevOps is not yet a practice with a dogma, but rather an emerging and exciting collection of practices being embraced by some of the smartest and most capable tech shops around.

The fundamental spirit is one of flexibility, agility, and automation.

There are definitely practical ways to bring these tools into your current environment that don’t require some kind of intervening cataclysm.

The old caveat of automating anything you’ve had to do more than twice can start you in the right direction; just keep in mind that there’s an end goal and nudge it all towards an integrated framework.

If there’s nothing in the shop approaching automated testing, consider starting at the top with integration tests and slowly pushing the code back down the ladder. Maybe buy the developers drinks to lull them into a false sense of confidence, then expense them as cost of infrastructure.