Project release: resque-state gem

I’m posting this late as the code has been available for a bit now but I’ve published my first Ruby Gem (fork) on Github; resque-state. It adds more features to the original gem (resque-status) including more interactive-like controls to allow you to run semi-interactive jobs via Resque. The biggest addition was adding pause and revert functionality.

This project came from something I built (and hope to eventually publish) that runs automated rolling deployments to AWS. What the pause functionality gave me was the ability to let a user do a one-box or canary ahead of a full roll as well as the ability to pause a job that might be having troubles. This lets an engineer launch a deployment to an Auto Scaling Group (ASG) and initially add just a single machine. Once that instance is healthy the job then pauses and waits for the engineer to give the deployment the green light to continue. The pause/unpause functionality became one of the critical features to enable safer production releases.

Just added was a revert feature. You could accomplish something similar with on_failure but I thought that might be overloading that functionality a bit. I believe these are two different cases. If a job fails you may not want to undo it because the failure may have been fatal for the job process but not something that actually needs reverted. Maybe there was a network blip the automation didn’t handle well or perhaps you’re able to course-correct without actually pulling back what was done. Revert gives you a separate path for cases where you specifically want to pull back what was done. This can be done from the paused state (for example; a deployment one-box that is no good) as well as just while the job is running.

PRs and constructive feedback are welcome. 🙂

Automation is wonderful. People… not so much.

I’ve been working on automating a previously manual process riddled with potential human error and recently I’ve found myself referencing this article from Doug Seven as the poster child for how things can go wrong quickly when the process isn’t very good or your understanding of the system isn’t complete.

Thorough change management practices and peer-reviewed automation can be something that not only saves your time but sometimes your job too. If you can automate it and remove potential human error that’s nearly always the right path. Software can still fail but you can test software and do peer reviews. It’s a little harder to peer review every decision and click someone has to deal with when deploying something into production.

If you haven’t read this before… this is a level of failure you don’t get to see too often.

Knightmare: A DevOps Cautionary Tale