…in which we live-migrate a customer to high-scale VPC with zero downtime
Continuing from where we left off in part 1...
Three (Not so) Secret Weapons
Most practitioners of cloud-scale software have had the experience of having a great load test, only to find the results do not hold in real-world production. This is a danger not to be taken lightly, and we did three things to make sure that didn’t happen to us on Migration Day.
- We took advantage of capture-replay. We had already built a toolset to capture a subset of actual production traffic (in this case, traffic for our migrating customer). Now we replayed it to the new VPC.
- We used around-the-clock soak testing. Instead of subjecting the new environment to peak traffic for several hours per day, we ran the load test around the clock – for four straight days. We discovered and fixed several additional issues, and validated that the system could recover (by taking key services entirely offline – Simian Army-style) under peak traffic.
- We leveraged Kubernetes, which takes infrastructure-as-code and platform-as code to a whole new level.
Replicating high-performance infrastructure between environments can be a challenge. Leveraging Kubernetes can help turn the variation into code and make the process automated and more repeatable. Even with Kubernetes, we encountered a few key issues during scalability testing that required us to investigate why one cluster was performing while another was not. The biggest obstacle is an issue well known to the community: It’s very important to provide node-local DNS caching to minimize the stress of DNS lookups when calling external services. Our previous solution had this critical capability provided at the infrastructure level, and after discovering that it did not translate when building a new cluster, we chose to run the node-local cache as a Kubernetes pod. Having this capability in place helped make the process of creating future clusters more repeatable.
A lot of careful analysis was performed on the customer’s traffic patterns, but we did not want to rely solely on the simulation results on Migration Day. Some of the patterns we saw in the multi-tenant environment included spikes at the magnitude where auto-scaling could not scale up rapidly enough to prevent increased latency. Based on that, and other unknowns, we decided to slightly over-provision the system for Migration Day. Then we gradually scaled it down over the next few days until it could be fully managed by the Kubernetes auto-scaling policies which were carefully tuned during the load and performance test phases. This allowed us to avoid any negative impact on the end-user experience on Migration Day.
We wanted to ensure that our customer was fully aware of the infrastructure costs they would incur in the VPC solution, so they could budget appropriately. We also wanted to make sure that the VPC provided a cost-effective solution. Based on those considerations, VPC planning started with cost modeling, and we continued to revisit assumptions as the scalability testing evolved. This kept the model up to date and also provided us with guidance for some of the architecture choices.
End-User Experience Monitoring
In addition to the infrastructure and APM monitoring, we also monitored the actual end-user experience by running synthetic tests over the network, as if it were an actual person, device, or mobile app. The tests emulate use cases such as registration, posting and retrieval of datapoints, signing into GUI consoles, performing actions through the mobile apps, etc. The tests run at a high frequency to continuously monitor the customer experience and collect metrics. We were able to monitor the impact on end-user experience during the large-scale migration through the various phases of the project, and we tweaked the plans as necessary. From the end-user perspective, the migration was uneventful.
Planning for Rollback
Critical for reliability is the assumption that any change or action may need to be rolled back. The bigger and more impactful the change, the more important this principle is. Our plan included multiple checkpoints for a potential rollback, considered how to minimize the impact, and what would need to happen to return to a previous “happy” state.
One of the final keys to success was working with our customer on extensive end-to-end acceptance testing of: the VPC environment, device behavior, and the migration process. Critically, this revealed that not all devices respond to a change in DNS routing in the way we expected from our research. Most devices only re-resolved DNS when they received a reset instruction. However, some devices picked up new DNS settings earlier than expected, which led them to communicate with services in the VPC and multi-tenant environment simultaneously. Having faced this challenge in previous migrations, we applied our existing solution of sending equivalent instruction to both sides – ensuring all devices received a timely reset and migrated to the new VPC – whichever path they took.
By Migration Day, all the hard work of preparation distilled into a detailed script and an extensive checklist. The engineers assembled in a large conference room-turned-command center, and opened a phone bridge with our customer‘s team, collectively gathered for the “Big Day”. The teams checked off the pre-flight checklists, executed the migration script, and comfortably powered through the Ready (data migration) and Set (bringing up and validating services).
In the Go phase (re-routing the device population using DNS), the teams stayed alert as a fleet of 100,000s of devices migrated in a span of minutes. We managed through this process effectively by using “canaries” as our early warning system. The canaries, a set of designated devices within reach of Ayla and our customer, were the first devices to be migrated in the Go phase, and we looked at them as a proxy for customer experience across the device fleet.
Learnings and Takeaways
Thank you for taking the time to read about our journey in getting a customer onto their new VPC at scale. This migration at scale helped underscore the value of the best practices we followed:
- Thoroughly researching data models, device, and application behavior
- Creating a detailed plan spanning devices, applications, data migration, infrastructure, monitoring, as well as planning for capacity, contingencies, and cost
- Repeatedly practicing the migration plan and testing with real devices to find issues
- Performing extensive scalability testing, and leveraging capture/replay along with 24x7 soak
- Acceptance testing with customers and “canary” end-users to identify real-world issues
Some learnings from this experience that we will use to make future migrations even smoother:
- Devices will behave in a variety of ways to networking changes: Plan for it
- Different versions of applications can behave differently: include all iterations in acceptance tests
- Extensive preparation is key to having a smooth migration day
A VPC can help customers with a large fleet of deployed devices. By tuning the VPC to their specific workload, we can help customer optimized their infrastructure cost while improving performance, and resulting in a more consistent user experience.
If you have thoughts or questions about our migration diary, please leave us a comment. We’d love to engage with you. If you are new to Ayla, we encourage you to start with a free trial.