New Project: CoI

I’ve posted the code for a project that I’ve been thinking about for a while; CoI.

This project is, at this point, a draft or an early work in progress but I wanted to get it actually started and work on some code; something I haven’t done enough of lately.

The goal with CoI is to have a single place to record and track incident post-mortems. I’ve worked quite a few places and most had terrible post-mortem practices that left things unresolved, untracked, and unfixed and it’s driven me crazy.

If you know that something can cause a production outage because it has and you’ve identified the fix should you really accept that being thrown into a team’s backlog and just.. left there? It’s not a new and exciting feature. It’s not something that is going to move the needle for customer adoption. It’s probably just not all that interesting. That fix can go ignored by the engineering team and project managers for months and while it waits to be addressed your site is still vulnerable.

The intent with CoI is to surface those action items and clear ownership over the original incident and who needs to do the work identified to prevent it from happening again. While there are solutions that people have come up with to do this using other issue tracking systems I’ve seen those attempts fail.

In any case; the draft is up and I plan on working on it occasionally to build it into something more ready to use.

-Nathan

Project release: resque-state gem

I’m posting this late as the code has been available for a bit now but I’ve published my first Ruby Gem (fork) on Github; resque-state. It adds more features to the original gem (resque-status) including more interactive-like controls to allow you to run semi-interactive jobs via Resque. The biggest addition was adding pause and revert functionality.

This project came from something I built (and hope to eventually publish) that runs automated rolling deployments to AWS. What the pause functionality gave me was the ability to let a user do a one-box or canary ahead of a full roll as well as the ability to pause a job that might be having troubles. This lets an engineer launch a deployment to an Auto Scaling Group (ASG) and initially add just a single machine. Once that instance is healthy the job then pauses and waits for the engineer to give the deployment the green light to continue. The pause/unpause functionality became one of the critical features to enable safer production releases.

Just added was a revert feature. You could accomplish something similar with on_failure but I thought that might be overloading that functionality a bit. I believe these are two different cases. If a job fails you may not want to undo it because the failure may have been fatal for the job process but not something that actually needs reverted. Maybe there was a network blip the automation didn’t handle well or perhaps you’re able to course-correct without actually pulling back what was done. Revert gives you a separate path for cases where you specifically want to pull back what was done. This can be done from the paused state (for example; a deployment one-box that is no good) as well as just while the job is running.

PRs and constructive feedback are welcome. 🙂

Automation is wonderful. People… not so much.

I’ve been working on automating a previously manual process riddled with potential human error and recently I’ve found myself referencing this article from Doug Seven as the poster child for how things can go wrong quickly when the process isn’t very good or your understanding of the system isn’t complete.

Thorough change management practices and peer-reviewed automation can be something that not only saves your time but sometimes your job too. If you can automate it and remove potential human error that’s nearly always the right path. Software can still fail but you can test software and do peer reviews. It’s a little harder to peer review every decision and click someone has to deal with when deploying something into production.

If you haven’t read this before… this is a level of failure you don’t get to see too often.

Knightmare: A DevOps Cautionary Tale