Hotstar and the art of managing traffic spikes

Hotstar and the art of managing traffic spikes

Ajit Mohan has over the past year or so invested heavily in tech resources

Hotstar

MUMBAI: Ajit Mohan sits back in his chair on the 26 floor of Urmi Estate in Mumbai his chest swelling with pride as he reads what he has just posted on LinkedIn. “Reading about YouTube TV crashing for the England vs Croatia game and being reminded again that scaling for live is no accident. Feel proud about Hotstar Tech and our VIVO IPL 2018 scale,” the post states.

What the CEO of Hotstar is referring to is the huge spike of 10.33 million concurrent viewers that India’s leading video streamer could handle during the IPL 2018 finals. 

“Over the past three and a half years, we have built live tech that is truly world class and that can handle massive surges in traffic. It is not about just scaling the video infrastructure , it is about making sure all parts of our tech can scale, including the gaming and social TV experience. I do think we have built something unique and special in live tech and we are proud that the bar has been set by an Indian service,” he says.

In fact, Ajit has over the past year or so invested heavily in tech resources – in terms of teams and in-house hardware, monitoring tools,  and what have you. So much so that most of the Hotstar tech today is run by its own engineers with very little reliance on third parties.

Today Hotstar’s command centre in Mumbai hosts more than 100 techies, most of them youngsters between the ages of 23 and 35 only. “It’s the youngsters who are driving leapfrogs in innovation,” says Star India managing director Sanjay Gupta.

Cubicles are buzzing with data scientists, programmers and hardware and software geeks peering over and at screens monitoring hotspots where traffic is unusually high and ensuring that Hotstar stays up at all times. “We want to be and probably are the gold standard in streaming experience – not just in India but the world as well,” says Ajit.

It is this almost maniacal obsession with giving Hotstar users a consistent streaming experience while they are watching live cricket or shows from its linear channels that has made it the envy of the likes of leading media tech company Netflix’s CEO Reed Hastings who has referred to it on several occasions during investor calls and briefings.

As compared to that, larger companies such as Optus down under simply collapsed unable to bear the weight of a few thousand subscribers during the group phase of the FIFA World Cup 2018.  Customers were subject to repeated drop outs or blurred and low quality streaming with the spinning progress wheel continuing for minutes at an end.  They came out in hordes slamming the service labeling it #FloptusSport. . So much so that it was forced to turn off the pay button and give free access to subs until 31 August and even issue refunds. Optus will also be offering customers the first three rounds of the Premier League for free.

Another major which simply disintegrated during the current football frenzy was media tech titan Google’s YouTube TV which costs viewers a hefy $40 a month.  Customers were once again left frustrated when the service got logjammed unable to handle the thousands of concurrent live streams. YouTube aplogised profusely but to no avail. Soccer fans took it to the cleaners. Tweeted one of them: “..it’s completely down. If Google can’t keep it online in a surge like this, nobody can.”

Google engineers could probably try knocking at Hotstar’s doors and learn a trick or two from Ajit and his tech team.  That would probably give their customers a better video experience.

Akash Saksena, one of the Hotstar engineers, posted on a blog what went into making Hostar the smooth streamer it turned out to be during the World Cup. Read on to find out more.

“Your cloud provider also has physical limits of how much you can auto-scale. Work with them closely and ensure you make the right projections ahead of time. Even then, nothing can make it better for you if you are inefficient per server. This calls for rabid tuning of all your system parameters. Moving from development to production environments requires knowledge of what hardware your code will run on and then tweaking it to suit that system. Be lean on your single server and yield results with more room to scale horizontally. Review all your configurations with a fine tooth, it’ll save you the blushes in production. Each system must be tuned specifically to the traffic pattern and hardware you choose to run it on.

No Dumb Clients
At Indian cricket scale, we cannot afford to have clients that rely completely on the server systems to make decisions. Tsunami’s can overwhelm the back-end. Retries will make the problems worse. Clients must be smart about inferring when things don’t look right, and add “jitter” to the requests they make to the servers. Caching, exponential back-offs and panic protocols all come together to ensure a seamless customer experience.

Three pillars

Our platform has three core pillars, the subscription engine, meta-data engine and our streaming infrastructure. Each of these have unique scale needs and were tweaked separately. We built pessimistic traffic models for each of these basis which we came up with ladders that controlled server farms depending on the estimated concurrency. Knowing what your key pillars are and what kind of patterns they are going to experience is pivotal when it comes to tuning. One size does not fit all.

Once Only

Scaling effectively at such a scale means that you drive away as much traffic as possible from the origin servers. Depending on your business patterns, using caching strategies on the serving layer as well as smart TTL controls on the client end, one can give breathing room to their server systems.

Reject Early

Security is a key tenet, and we leverage this layer to also drive away traffic that doesn’t need to come to us at the top of the funnel. Using a combination of white listing and industry best practices, we drive away a lot of spurious traffic up front itself.

The Telescope

Like any other subscription platform, we’re ultimately beholden to the processing rates that our payment partners provide us. Sometimes during a peak, this might mean adding a jitter to our funnel to allow customers to follow through at an acceptable rate to enable a higher success rate overall. Again, these funnels / telescopes are designed keeping in mind the traffic patterns that your platform will experience. Often these decisions will need to involve customer experience changes to account for being gentler on the infrastructure.

The Latency Problem

As the leading OTT player in India, we’ve been steadily making improvements to our streaming infrastructure. It remains a simple motto of leaner on the wire, faster than broadcast. As simple as this sounds, its one of the most complex things to get right. Through the year we have brought down our latency numbers from being roughly 55s behind broadcast, to approximately 15–20s behind broadcast and only a couple of seconds behind on our re-done web platform.

This was a result of highly meticulous measurement of how much time each segment of our encoding workflow took and then tweaking operations and encoder settings to do better. We did this by applying profiling of the workflow to instrument each segment. This is another classical tenet, tuning cannot happen without instrumentation.

We continue to tweak bit-rate settings to provide a un-compromised experience to our customers while at the same time be efficient in bandwidth consumption for Indian conditions.

Lower latencies and smarter use of player controls to provide a smooth viewing experience to customers also helps with smoother traffic patterns as fewer customers are repeating the funnel, which can cause a lot of ripple through the whole system with it’s retries and consequent additional events that pass through the system.

Server Morghulis (aka Client Resiliency)

The Hotstar client applications have been downloaded multiple hundred million times so far. Suffice to say that when game time comes, millions of folks are using Hotstar as their primary screen. Dealing with such high concurrency means that we cannot think of a classical coupling of client with the back-end infrastructure.

We build our client applications to be resilient and gracefully degrade. While we maintain a very high degree of availability, we also prepare for the worst by reviewing all the client : server interactions and indicating either gracefully that the servers were experiencing high load or by a variety of panic switches in the infrastructure. These switches indicate to our client applications that they should ease off momentarily, either exponential back-off or sometimes a custom back-off depending on the interaction so as to build a jitter into the system that provides the back-end infrastructure time to heal.

While the application has many capabilities, our primary function is that to render video to our customers reliably. If things don’t look completely in control, specific functionality can degrade gracefully and keep the primary function un-affected.

Ensure that the primary function always works and ensure resiliency around server touch-points. Not every failure is fatal, and using intelligent caching with the right TTL’s can buy a lot of precious headroom. This is an important tenet.