Lessons learned from our hapi upgrade

Share this post

To skip versions, or to not

Delaying a framework upgrade can mean falling several versions behind. You might be tempted to jump to the latest version, but consider how that might play out? Most of the community did the migration between the version you are on and the next version. Any upgrade material will likely focus on moving from version C to D and not version C to G. Every developer’s best friend Stackoverflow probably contains questions (and answers) about issues arising from a C to D migration. Tread carefully here.

At Lob, we set out to upgrade hapi from v16 to v17 and found the task was enormous. It included 13 repos, several third-party libraries, and over 100 plugins. A team of four engineers worked on the project with other departments contributing. For an idea of scale, a typical upgrade, like the subsequent hapi v17 to v18 required just one engineer. Be sure to resource your team appropriately.

Almost every request handler in our environment was going to break. Though mostly syntax changes, once those were made, all tests had to be updated accordingly; we had several hundred.

All plugins from hapi’s ecosystem also required an upgrade to work with v17. We had a number of custom plugins we’d written that needed our attention, along with third-party plugins we either had to upgrade or replace.

Our update process was as follows:

Make a decision on the third party plugins
Update our internal plugins
Update all the route handlers and tests

We did this for every single endpoint (e.g., postcards, then letters, and so on) one by one.

Here's an example of updating an internal plugin from v16 to v17+ (this and this). We broke each update into multiple commits:

One for updating the code
One for the admittedly more difficult task of updating the build tooling
One to enable GitHub actions to test PRs.

Shoulda Woulda Coulda

In retrospect, if he had to do it all over again, Software Engineering Manager Sowmitra Nalla said he would have written a script to find-and-replace—with this approach we could have upgraded a repo in about two days. However, the overall thought at the time was that with a number of engineers on the upgrade, we could churn through it versus building a tool. Also, the goal was to improve Lob’s API performance, not upgrade the entire engineering organization’s stack.

Deployment Strategy

Rather than pause all deployments to our API for several weeks while we upgraded, we decided to spin up a v17 side-by-side with hapi v16—an approach we dubbed “double-rainbow”—jokingly represented in Slack with the puking rainbows emoji.

“We did a type of canary deployment but with 'feature flags' at the route level. Normal feature flags are at the app level; our toggles were at the load balancer level. Depending on which REST paths we wanted to route, we would drive traffic appropriately,” said Nalla.

We started with 5% of traffic going to this new state, and used a dashboard to compare errors, CPU, and other metrics. As soon as we saw an error, we would divert traffic back to the current state, then investigate the problem. Diverting a small percentage of traffic (in an effort to mitigate risk), we saw a very small number of errors. A small number of errors was not a red flag as we assumed there would be some errors here and there. We learned that was not quite right. Instead of just looking at the number of errors, we needed to look at the percentage of errors. If the percentage of errors increases in one cluster versus the other, then there's something else going on—we did not forget that when we upgraded to hapi 18 and 20.

We had a major incident early on resulting in all traffic being diverted back to v16. As it turned out, one of the internal libraries being upgraded had two versions. We’d made changes on an earlier version that were not merged back in. Looking at the main branch, which was running the “latest” version of that library led to the incident.

Even in the best executed project, unforeseen errors can happen. Fortunately the rollout strategy allowed for limited interruption while we debugged, then we resumed flow to v17. We did end up combing through all the other plugins to ensure this was a one-off mistake—an arduous, but necessary task.

What results did we achieve?

We saw an incredible 100% improvement in API throughput (requests per second). At first, we saw some scary dips in our graph, but realized they were a side effect from testing the number of connections each container has to the database. Results of these tests led to understanding that better connection handling on the database side would increase throughput as well.

Conclusion

While admittedly pretty painful, the upgrade was absolutely worth it. The positive impact to performance on Lob’s API is the most obvious benefit, but on the whole it made our teams more efficient moving forward.

Hapi Version 18 included minor improvements for performance and standards compliance. This was followed by Version 20, another small release. Less significant changes certainly meant quicker subsequent upgrades for us, but we also applied the processes we put in place along with lessons learned from the initial upgrade.

The project was a powerful reminder to take the time upfront for better estimation. (Check out Why Developers Suck at Software Estimation and How to Fix It.) How many resources do you really need? Are there patterns or duplicative work; if yes, would automation/a tool help? We followed a uniform process for updating each plugin; this consistency made the process as efficient as possible under the circumstances. Our “double-rainbow” deployment allowed for a smoother cutover and the opportunity to debug without impact (and we learned to prioritize percentage of errors over number of errors).

We will definitely employ these methods to make similar upgrades less sucky—and hope you can too.

‍

Major thanks to Software Engineering Manager Sowmitra Nalla for his contribution to this story.

This blog provides general information and discussion about direct mail marketing and related subjects. The content provided in this blog ("Content”), should not be construed as and is not intended to constitute financial, legal or tax advice. You should seek the advice of professionals prior to acting upon any information contained in the Content. All Content is provided strictly “as is” and we make no warranty or representation of any kind regarding the Content.

JavaScript Framework Updates Suck, How to Make Them Suck(less)

To skip versions, or to not

Shoulda Woulda Coulda

Deployment Strategy

What results did we achieve?

Conclusion

Continue Reading

Everything You Need to Know About the USPS Integrated Technology Promotion

What are the hidden costs of manual mail?

How Lob's Machine Learning Models Aim to Cut Cost and Waste in Direct Mail

Lob's website experience is not optimized for Internet Explorer. Please choose another browser.

JavaScript Framework Updates Suck, How to Make Them Suck(less)

To skip versions, or to not

Shoulda Woulda Coulda

Deployment Strategy

What results did we achieve?

Conclusion

Continue Reading

Everything You Need to Know About the USPS Integrated Technology Promotion

What are the hidden costs of manual mail?

How Lob's Machine Learning Models Aim to Cut Cost and Waste in Direct Mail

Lob's website experience is not optimized for Internet Explorer.
Please choose another browser.