…in which we live-migrate a customer to high-scale VPC with zero downtime

At Ayla, we frequently help our customers through an IoT journey that starts with a small-scale proof-of-concept, and takes them to full-scale deployment with hundreds of thousands to millions of devices in the field. Large scale brings unique scalability challenges, and frequently a VPC (Virtual Private Cloud) deployment can help.

One of our customers created an awesome product with rapid growth and a highly engaged user base. As the product’s adoption reached several hundred thousand users, it became clear that migrating the users to an Ayla VPC would help improve the customer experience and reduce operational cost at this scale. The Ayla VPC (an instance of the Ayla cloud platform dedicated to the customer) allows us to tune the platform to the specific customer’s scenario, making experience-critical aspects more responsive and less-critical aspects more cost-efficient. Scalability is particularly interesting here – as each device has a background mode that sends data points several times an hour, and an active mode that sends multiple data points per second. During peak times, when many devices are in active mode at the same time, data velocity approaches 10,000 transactions per second (TPS).

This story starts weeks before an epic Migration Day involving 100,000s devices and 10,000s requests per second and 0 downtime, with a plan and a drive to anticipate every eventuality…

Screen Shot 2019-08-27 at 3.10.39 PM.png


Migration Day Game Plan

First, we built a high-level Game Plan for Migration Game, ensuring we have everything we need to make it go smoothly. The key steps are:

  1. Ready: Get all the data and infrastructure ready to receive traffic in the VPC.
  2. Set: Bring up and validate all the services at scale so we are ready for real traffic.
  3. Go: Re-route the device and application traffic to the VPC (using the power of DNS).

    And then the real work started…

    Devices and Applications

    Working backwards from Go, we partnered with the customer to research the devices and applications in the field, what code they ran, and how the devices and application would react to a change in the DNS definition. It’s a good thing we started early, because this analysis lead us to some changes we needed to make to the applications (and time to let end users adopt the changes). 

    Taking nothing in the real world for granted, we worked with the customer to create an internal test pool of device and applications wired to URLs created specifically for testing. This test pool of devices and applications is migrated back and forth, over and over, in the preparation phase until all glitches in the Game Plan were found and resolved.

    Data

    Next, we created a plan to migrate the data, and exercised it over and over until we got it right. This was especially challenging because we needed to separate the VPC customer’s data out of a multi-tenant instance of the Ayla Cloud Platform. A simple database/data cluster replication was not going to work in this case. 

    We learned a few lessons along the way, some surprising and some less so:

    1. Work on a detached replica – Once the data volume gets reasonably large (~100 GB for SQL/1 TB for noSQL), and there are large data operations to perform, working on a system that’s not experiencing real load will make everything more predictable.
    2. Warm up the caches – The power of transparent caching in modern databases can be taken for granted, until you run large queries through a database with a cold cache. For some steps, we find that warming up the caches by first running SELECT * FROM large_table > /dev/null (yes, really) speeds up the overall process by minutes.
    3. Break down large steps – High-level operations such as dump/import and snapshot/restore can save a lot of time in data engineering. However, they can also create long poles in your process with variability – and with little control. Prepare to break down long tasks that happen “auto-magically.” Or, at least have a backup plan if they take too long.
    4. Know your data models – A lot of thought goes into understanding how data models fit together, and how data changes with user actions. This enabled us to design the data migration process for operations determined to be safe for these specific sets of data.
    5. Know your data’s limits– Sometimes suspending a capability you can live without is better than designing extra complexity. In this case, we reduced scope and risk by suspending new user and device registrations for a critical window of time on Migration Day.

    Scalability

    Scalability was a particular challenge for this effort. Most highly scalable systems (including Ayla’s biggest multi-tenant environments) grow into their scale, starting small and then getting 3x the traffic, then 10x the traffic, etc. This VPC has to be ready for around 10,000 TPS in its first day on the job. 

    Picture1.png

    We used a set of extensive Gatling scenarios to simulate that load. To make the test more realistic, the customer provided deep insight on how their products use Ayla, and we used that insight in combination with production logs to create a load test scenario that emulates our customer’s unique workload.

    After a dry-run data migration (also good data migration practice), we started the scalability flywheel in the VPC, but under load test instead of organic load. We found that most of our scalability learnings were captured in our infrastructure and platform automation, and translated well to the VPC. However, we did discover and fix some issues creating bottlenecks along the way, including: tunings, scaling policy details, and selective upgrades of key low-level services. 

    This gave us increasing confidence that the VPC was up to the task. In order to be sure, we needed to dig deeper, leading us to three (not so secret) weapons that we leveraged to great effect. 

    Join us tomorrow for Part 2 of Diary of an Ayla VPC Migration.