No items found.
Company

9 Lessons Learned Migrating From Heroku To Kubernetes with Zero Downtime

For the first couple years, we ran Periscope Data on Heroku. As a young startup, Heroku was a fantastic platform because it significantly reduced our operational complexity, letting us focus almost all our time on shipping product. We were able to get started quickly because setting up the DevOps pipelines to move our application from development to production was easy and we didn't need to create and maintain a continuous delivery system in house. We required minimal infrastructure for the app to be live, and could easy scale the app as needed.

This was our architecture on Heroku:

At the time, that was just what we needed. We were a small team trying to find product/market fit. We were focused on building and iterating on our product. But as we grew and our focus shifted to scaling the product, working on Heroku became challenging.

Why we left Heroku

As we grew, so did our infrastructure requirements. Heroku wasn't able to keep up with all of those requirements and that's why we moved to hosting Kube ourselves. Here are some of the new requirements that drove our decision:

Requirement #1: Granular control over server resources

Heroku offers 4 distinct types of dynos: 1x, 2x, Perf M, Perf L. Being locked into these four options is very limiting when you’re working at scale. We needed much finer control over our resources to economically scale our application's infrastructure.

Requirement #2: Reliable scheduling of background processes

The Heroku scheduler is a very minimalistic application. We had limited control over how frequently we could schedule our jobs (daily, hourly or 10 minutes). In addition, there was no good way to control versions. If a job needed to be disabled, we would lose the job definition. Worst of all, it was a "best effort" service from Heroku — sometimes the scheduler would skip runs, and that was expected behavior.

Requirement #3: No databases or backend microservices should be internet-facing

Very early on, we moved our postgres databases out of Heroku. That allowed us to have better control over the databases. We needed super user privileges and better security: Heroku databases could not be firewalled and we were unwilling to allow our database to be internet-facing. Since private networks on Heroku were unusually expensive, we ended up sending our database connections from Heroku to RDS (in our VPC) through SSH tunnels. This increased operational overhead and impacted performance. We wanted to add connection pooling using pgbouncer, but that along with the SSH tunnels meant another layer of complexity. Each new microservice needed a similar networking setup, and the overhead became too much to manage.

Requirement #4: Minimal external dependencies affecting product uptime

Any dependency on an outside tool creates a ceiling for the performance of a system. With Heroku, this meant a ceiling on the uptime guarantee that we could provide to our customers. And Heroku's uptime was below what we wanted to offer our customers long-term.

Heroku also impacted our release schedule. We shipped code multiple times a day, but we made it a practice to never ship when any relevant Heroku component had a yellow or red status. We'd been burned by compounding Heroku outages with our own mistakes and were no longer willing to risk shipping during a Heroku event, even for seemingly unrelated issues. This often led to us delaying or skipping deploys, which reduced our ability to ship product to customers.

Requirement #5: Infrastructure should be developed with scalability in mind

With every customization and add-on, we were building our infrastructure the Heroku way. The sooner we moved out of Heroku, the easier it would be to migrate the application to a more scalable architecture that could serve us long term. That included managing the operational infrastructure, monitoring, continuous integration & deployment pipelines, etc. In the long run, it was always going to be more expensive to not manage our own core operations.

While we'd still recommend Heroku to any young startup, it was no longer the right platform for us. It was time to bring more of the Periscope Data infrastructure in-house, with fewer external dependencies and more direct control.

Why we chose Kubernetes

After we decided to bring our infrastructure in-house, the decision to use Kubernetes was a relatively easy one. We wanted to move to a container-based platform for the flexibility it provides in building and managing a cloud-native application.

Kubernetes offers an excellent set of tools to manage containerized applications. You can think of it as managing a desired state for the containers. Here are some of the features we were excited about:

  • Scalability: Applications can be easily scaled up by increasing the resources allocated. Scaling out is as easy by increasing the number of replicas for a controller. Horizontal autoscalers can automatically scale based on some resource values.
  • Versioning: A new version of the application creates a new replica set when deployed; in case of a regression, it can be easily rolled back to the old version.
  • Distribution: A rolling update strategy allows you to release a new version of the application n pods at a time, thus limiting the number of unavailable containers at any given time.
  • Load Balancing: Kubernetes gives containers their own IP addresses and a single DNS name for a set of containers, with the possibility of load-balancing across them. [more info]

We had been testing Kubernetes with our services internally. After getting a taste of the power it offered, it was clear to us why Kubernetes is quickly becoming the gold standard of the industry.

The Migration

Building a plan

At the time, our web app had over 50 background jobs and about half as many microservices. Our web app is Rails and background jobs are mainly written in ruby, with a rake interface. Our microservices are in golang, with the exception of one, which is in Java. We compile our web assets using yarn.

As part of this migration, we decided to move our microservices first. This would lay the foundation for a live Kubernetes setup, and subsequently an internal setup for development and test. Moving the microservices first was also lower risk; it's much safer to route to microservices hosted at different domains than to redo the networking in front of the web servers. We would also need to select a technology for DevOps; to set up a build and release pipeline for Kubernetes, and build/store docker images.

Next, we would move our background jobs to the new system. This would allow us to start addressing the ruby side of our application.

The last step would be to migrate the web app. This would involve compiling web assets, routing and serving http requests, builds and releases and load testing for performance. The key challenge here would be to manage the rollout and to move live traffic from Heroku to Kubernetes.

How did we do it?

A comparable architecture of our system in Kubernetes follows and the sections below talk about some of the lessons we learned during migration.

Lesson #1: A reverse proxy app can be a powerful tool to manage HTTP requests

The Heroku router returned error codes such as H12 (Request timeout), H15 (Idle connection) and H18 (Server Request Interrupted), which were useful in hiding complexity from the alert configs. For example, our “Too many request timeouts” alert was configured as: count:1m(code='H12') > 50.

If the rails app took longer than 30 seconds, the Heroku router returned a 503 response before the request was completed. We decided to implement something similar on our end using an HAProxy app. The key challenge here involved setting the right HTTP options based on the desired behavior and tuning the client, request and response timeouts between the proxy and the load balancer. This resulted in a very efficient management of the requests [more info].

Early on, we ran into an HAProxy issue where the requests would fail as the proxy app was unable to resolve the server host. It turned out that for the DNS resolution to happen correctly, we needed to configure health checks for the proxy app [more info].

Lesson #2: Horizontally scalable services don't always scale across different deployments

While moving our first background job to Kubernetes, we ran into a concurrency bug that would result in the same job running multiple times. In our product, users can clone dashboards and doing so schedules a background job for the cloning process. The job was being picked up by multiple microservice instances and the dashboard would be cloned multiple times. The problem was rooted in our use of an env variable in our homegrown job runner to disambiguate between running jobs and those that were killed during a release process. The job runner thread would lock a job with a specific release version to ensure it didn’t get picked up by other threads. Our jobs in Kubernetes had a different version format than on Heroku. The versioning logic that managed locking became too complex, so we eventually switched to a much cleaner solution that would make it version-agnostic. This prompted us to re-evaluate all Heroku env variables a lot more carefully.

Lesson #3: A carefully chosen ratio of concurrency resources can lead to a highly optimized setup

One of the experiments we undertook during this migration was to get the right balance of the pod size (CPU and memory), the number of Phusion Passenger application processes and the number of threads per process. This somewhat depended on the type of requests. Long-running requests would behave differently compared to short requests. CPU-bound requests would have a different usage pattern compared to IO-bound requests. The type of 5xx errors from the proxy app, along with the number of LivenessProbe/ReadinessProbe failures on the pods were a good indicator of whether a combination was worth pursuing. We were using 8 PerfL dynos with 4 processes and 7 threads each on Heroku, and we switched to 75 smaller pods on Kubernetes with 1 process and 12 threads each. Smaller pods minimized the impact to customers when a pod went down due to an error state. HAProxy docs were a great resource for understanding the errors and we found various articles online that assisted our team in coming up with a good starting point for the ratios of resources/processes/threads [more info].

Lesson #4: Achieve zero downtime during migration by using a reverse proxy app

Before the Web went live on our new system, it was put through load testing via JMeter. Since no amount of testing could truly simulate live traffic, we decided we would dynamically send traffic to both the Heroku and Kubernetes systems using a percentage-based rollout. This gave us an ideal platform to manage traffic in real time and make adjustments as needed. We created a reverse proxy app using HAProxy to sit in front of the web application. This somewhat simulated the Heroku router functionality, but with a potentially different behavior and the introduction of a certain amount of latency in the process. The proxy allowed us to redirect partial traffic to Kubernetes using our cookie-based routing logic (described below). Introducing the proxy app in front of all live traffic was one of the most nerve wracking moments of this migration.

In the Periscope Data app, customers have the option to create multiple "sites," and an authorized user has access to one or more sites. We used site-level cookies to implement routing. We checked each request to see if it contained a Kubernetes-enabling cookie and if it did, routed traffic to the Kubernetes server. By default, the traffic would go to Heroku at first. This process made routing simple and fast. The request-level timeouts on the proxy app required some tuning, but we soon had a working system for traffic redirects. We also maintained a blacklist to exclude sites that had specific networking needs until we were able to address them. We could simply change a flag for sites or change the rollout percent, and the next requests for the affected sites would be routed to Kube.

The cookie-based logic looks something like this:

if kube_enabled && !cookies[:periscope_web_kube].present?
   cookies[:periscope_web_kube] = { value: 1, expires: 1.year.from_now }
elsif !kube_enabled
   cookies.delete(:periscope_web_kube)
end

And the proxy logic:

# if cookie is present, then redirect to Kubernetes
acl cookie_found hdr_sub(cookie) periscope_web_kube
use_backend kube if cookie_found

Over a period of 2 weeks, we rolled out our web application from being almost entirely in Heroku to almost entirely in Kubernetes.

Breakdown of traffic during the transition from Heroku (purple) to Kubernetes (green)

Lesson #5: Managing releases to two production environments is highly error prone

For a significant amount of time, we deployed to both Heroku and Kubernetes to ensure we would have a fallback. Since the methodology and the amount of time to deploy to the two systems was different, it was difficult to accurately control when the new deploy would be fully available to all requests. The application needed to account for the possibility that different versions of the code could be active for a significant amount of time. This meant that every deploy had to be backward compatible. We put features behind flags, so they could be turned on and off in a predictable manner. Database migrations were deployed as a separate earlier release from the corresponding code. If there was a regression, the rollbacks happened on both systems. Deployments generally took longer early on, and were prone to errors because of the dependencies involved.

Lesson #6: Database connection pooling is a great optimization in conserving database resources

Since Heroku did blue/green deploys, the number of database connections almost doubled for a brief period of time during deployment. Because we were sending partial traffic to Kube, that further added to the connection count. This resulted in some instability and we experienced an increased number of 4xx and 5xx errors from the application. During this time, we manually maintained a running count of the number of expected database connections at any given point. This was an error-prone process. Eventually, we created a separate pgbouncer service for pooled connections and that resolved our connection count issues.

Lesson #7: Make sure to explicitly specify CPU and memory, request and limit values

We monitor our infrastructure using Datadog and Scalyr. For each application we moved, we started with a slightly overprovisioned set of resources, then scaled down as possible. Although it was difficult to get the exact equivalent metrics across platforms (CPU share vs. CPU core vs. vCPU), it was very easy to scale in or out with Kubernetes. The docker.cpu.usage and kubernetes.memory.usage metrics were very useful in determining resource utilization.

One issue we ran into early on was when either the request or the limit or both were not specified for a replication controller, it would lead to unpredictable resource usage on the host, making the system unstable. Specifying resource requests and limits was the simple fix that worked.

Lesson #8: CPU throttling looks like 5xx errors

We used New Relic to track our application level metrics. The key metrics we tracked during migration were the Average Response Time and the Request Queuing Time per request. Since we had introduced a reverse proxy app in front of the requests, we knew it would introduce some latency. We carefully monitored the request queueing times and tuned the proxy configurations (mainly http options, timeouts) to minimize the queueing time.

We also tracked the number of 4xx and 5xx error responses from the application. As a starting point, we wanted to ensure these numbers were comparable to what we saw on Heroku. Our requests seemed to be more CPU bound and when there weren’t enough resources to support the demand, we would see failing requests with 500s, 502s, 503s and 504s. Simply adding more replicas solved the issue (a horizontal autoscaler would have surely come in handy).

CPU throttling when there are not enough resources to handle requests

Lesson #9: Be proactive with training and enablement

An often ignored (but a very important) thing to plan for is spreading knowledge of the new systems internally. This needed to be done at the same pace as the migration to allow internal teams to support live systems, while other teams continue to develop new features. Wikis and scripts were created to make operational tasks repeatable. Oncall alerts were evaluated and reworked to prevent any false positives.

Periscope Data on Kubernetes

We’ve completed the move to Kubernetes and done a lot more with it, which we'll cover in future articles. We still use Heroku for one thing: PR Apps. We’ve always appreciated how well Heroku manages that and setting up an environment to do that internally would be rather tricky. Bringing the operational infrastructure in-house is a huge undertaking. It may be less efficient early on, but it has opened the gates for scaling Periscope Data to what it is today and has laid the groundwork for future growth and optimization.

Thanks for this project go to team Stabilitude (Jason Friedman, Jeff Watts, Shobhit Garg, Chris Tice, Ajay Sharma, Ilge Akkaya, Stephen Hsiao) for their contributions.

Tags: 

Want to discuss this article? Join the Periscope Data Community!

Dharmesh Shah
Dharmesh has played various roles in developing software for over 15 years. He is passionate about his work and seeks continuous improvement in products and processes around him. Dharmesh is a senior backend generalist at Periscope Data.