New Details Revealed About Netflix Data Platform

Netflix’s chief executive played down any aggressive shifts to more interactive content or the Metaverse during an earnings call on Jan. 20, saying it wouldn’t be in anything “for the sake of a press release.” The company, which saw $50 billion wiped from its stock price after missing fourth-quarter subscription growth projections, appears poised to face tough questions about new monetization options amid competition from increasingly intense when it comes to streaming.

A nascent shift to more interactive forms of media is already putting pressure on the company’s complex cloud-native data platform, which is heavily based on open-source software like Apache Druid, Apache Flink, Apache Iceberg and Apache Kafka. – with the company’s engineers. recently revealed the development of an innovative new tool that they use to detect and manage failing workloads in production environments.

In 2021, Netflix launched a mobile game streaming service and now offers 10 games.

But asked about “how much” Netflix executives want to push harder in mobile gaming, the metaverse, or more interactive experiences, co-CEO Ted Sarandos said, “We have to be different in this area.

There’s no point in just being in it. It’s very dilutive of the whole proposal,” he said on an earnings call. “When [our] mobile gaming is the world leader… and we are at [where we are at with] movie today, two of the top 10 for our games, so you might be asking, ‘OK, what’s next’. Because we are definitely “crawl, walk, run”.

His response came as Netflix engineers revealed they were working on a “self-diagnosis and correction system” called Pensive for what they described as one of “the most more complex in the cloud on which our data scientists and engineers run batch and stream workloads” – noting that as Netflix enters the world of gaming, pressure on its batch workflows and its real-time data pipelines are growing rapidly

“The data platform is built on multiple distributed systems, and due to the inherent nature of these systems, it is inevitable that these workloads will periodically experience outages” engineers Vikram Srivastava and Marcelo Mayworm noted in a January 14 blog post.

Netflix has always been refreshingly open about its underlying digital infrastructure – and has been a major contributor to open source, releasing tools spanning runtime containers, libraries, and services that power microservices, cybersecurity tools like security monkeyand Distributed Big Data Orchestration Service genius.

“At our scale,” the two engineers noted in their blog, “even a tiny percentage of downed workloads can generate a substantial operational support load for the data platform team when troubleshooting involves steps manual. And we can’t ignore the productivity impact this has on users of the data platform. (To contextualize this scale: Netflix streams media and a custom interface to stream it to over 221 million subscribers worldwide.)

Their team created a tool called “Pensive” that supports self-diagnosis and troubleshooting on Netflix’s batch and streaming workloads. For batch workflows, for example, the tool collects failed job logs and then pulls stack traces. As they detailed: “Pensive relies on a regular expression-based rules engine that has been curated over time. Rules encode information indicating whether an error is due to a platform problem or a user bug and whether the error is transient or not. (If unstable, the scheduler retries the step.)

What Netflix is ​​doing now is feeding unknown errors into a machine-learning process that can come up with new regular expressions for common errors, the two said, admitting it’s work in progress: with the classification of the error source and if it is transient in nature. In the future, we are looking to automate this process.

thoughtful netflix data platform fix tool
Credit: Netflix

Real-time stream processing tasks in the Netflix Data Platform are supported by Apache Flink.

“Most Flink jobs run under a managed platform called keystonewhich summarizes the details of the underlying Flink job and allows users to consume data from Apache Kafka feeds and publish it to different data stores like Elasticsearch and Apache Iceberg on AWS S3″ the blog notes.

Pensive uses Kafka and the Apache Druid real-time analysis database to identify streaming errors that can usually be diagnosed fairly quickly once isolated. Once the individual diagnoses are stored in a Druid table, our monitoring and alerting system called Atlas performs aggregations every minute and sends alerts when there is a sudden increase in the number of failures due to platform errors. This has led to a dramatic reduction in the time needed to detect hardware issues or bugs in recently deployed data platform software,” they noted.

The Top 10 Apache Projects in 2021, from Superset, to NuttX and Pulsar

The team wants to extend Pensive’s reach beyond failed jobs and optimize it to determine why jobs have become slow, they said. (We’ll keep an eye out for any future open source of what could be an extremely useful tool.) They also want to refine it so they can “automatically configure batch workflows to complete successfully or become more fast and use fewer resources when possible. An example where this can significantly help is Spark jobs, where memory tuning is a big challenge. View their full article here.

the last quarter brought Netflix to a total of 221.8 million subscribers, but growth is slowing sharply: in 2020, it added 37 million new subscribers; in 2021 half of that, 18 million, under pressure from Amazon Prime, HULU, Apple TV, Disney+, Youtube TV and other streaming rivals. Net profit for Netflix was $607,000 for the quarter.

See also: Netflix Open Source “Battleproof” Domain Graph Service

Comments are closed.