UPDATE 2/14: We've addressed a few of the popular suggestions and reactions, click here to find out more
UPDATE 2/18: See bottom of this annotation
UPDATE 2/14: We received some helpful suggestions that we have added to our simulations.
Choice of two routing is the naive method from before, with the twist that when you assign a request to a random dyno, if that dyno is already busy then you reassign the request to a second random dyno, with no regard for whether the second dyno is busy
Unicorn allows a single dyno to process multiple requests simultaneously, and each dyno sends its requests to its workers in the intelligent style. For example, say dyno D has 2 workers: A and B. If request R is sent to D, and both A and B are already busy, D will intelligently assign R to the first worker, A or B, that becomes available
We also received a lot of feedback that our response times are too slow, and we need to “kill the tail” of our slow responses. We wholeheartedly agree, and we work every day to improve it, but it seems that given any even remotely realistic distribution of request times, the random routing becomes a problem as you scale.
Here are some graphs that show the 4 routing styles: intelligent, naive, naive with choice of two, and naive with unicorn. The first graph uses Rap Genius’s empirical distribution of request times (median 46 ms, 99th percentile 3144 ms)
This second graph uses a simulated Weibull distribution with median 50 ms, 99th percentile 537 ms, 99.9th percentile 898 ms
The graphs show that both choice of two and unicorn offer meaningful improvements over naive routing, but even with the much faster response times profile, you will still see non-trivial queueing as your traffic grows. Here’s a final graph with the faster response times profile, but doubling the requests per minute to 18,000
Peter Bailis writes:
FWIW, I’m pretty sure that your “Choice of Two” load balancing differs from standard “Power of Two Choices” load balancing
The key difference is that, in standard “Power of Two Choices” balancing, you query two Dynos and send the request to the less loaded: “each request is placed in the least loaded of d >= 2 Dynos chosen independently and uniformly at random”
So I updated results to include this “Power of two” approach. The choice of two and power of two have the same % of requests queued, which makes sense because in either case, if one of the two dynos is free, the request will get assigned to that dyno.
However, the power of two method lowers the average queue time, because in the case where both selected dynos are busy, the “choice” method picks randomly, while the “power” method assigns to the dyno with fewer requests queued (note that fewer requests queued does not necessarily mean it will be available first, but more often than not that will be true)
Here are results using a scenario with 9,000 requests per minute, following our simulated Weibull distribution:
The up-to-date R code is available on Github
To help improve the meaning of these lyrics, visit "Heroku's Ugly Secret" by James Somers (Ft. Andrew Warner, ATodd, Chrissy & LEMON) and leave a comment on the lyrics box