Data Processing in Netflix Using Kafka And Apache Chukwa

When you click on a video Netflix starts processing data in various terms and it takes less than a nanosecond. Let’s discuss how the evolution pipeline works on Netflix. 

Netflix uses Kafka and Apache Chukwe to ingest the data which is produced in a different part of the system. Netflix provides almost 500B data events that consume 1.3 PB/day and 8 million events that consume 24 GB/Second during peak time. These events include information like:

  • Error logs
  • UI activities
  • Performance events
  • Video viewing activities
  • Troubleshooting and diagnostic events.

Apache Chukwe is an open-source data collection system for collecting logs or events from a distributed system. It is built on top of HDFS and Map-reduce framework. It comes with Hadoop’s scalability and robustness features.

  • It includes a lot of powerful and flexible toolkits to display, monitor, and analyze the result.
  • Chukwe collects the events from different parts of the system and from Chukwe you can do monitoring and analysis or you can use the dashboard to view the events.
  • Chukwe writes the event in the Hadoop file sequence format (S3). After that Big Data team processes these S3 Hadoop files and writes Hive in Parquet data format.
  • This process is called batch processing which basically scans the whole data at the hourly or daily frequency.

To upload online events to EMR/S3, Chukwa also provide traffic to Kafka (the main gate in real-time data processing).

  • Kafka is responsible for moving data from fronting Kafka to various sinks: S3, Elasticsearch, and secondary Kafka.
  • Routing of these messages is done using the Apache Samja framework.
  • Traffic sent by the Chukwe can be full or filtered streams so sometimes you may have to apply further filtering on the Kafka streams.
  • That is the reason we consider the router to take from one Kafka topic to a different Kafka topic.  

System Design Netflix | A Complete Architecture

Designing Netflix is a quite common question of system design rounds in interviews. In the world of streaming services, Netflix stands as a monopoly, captivating millions of viewers worldwide with its vast library of content delivered seamlessly to screens of all sizes. Behind this seemingly effortless experience lies a nicely crafted system design. In this article, we will study Netflix’s system design.

Important Topics for the Netflix System Design

  • Requirements of Netflix System Design
  • High-Level Design of Netflix System Design
    • Microservices Architecture of Netflix 
  • Low Level Design of Netflix System Design
    • How Does Netflix Onboard a Movie/Video?
    • How Netflix balance the high traffic load
    • EV Cache
    • Data Processing in Netflix Using Kafka And Apache Chukwa
    • Elastic Search
    • Apache Spark For Movie Recommendation
  • Database Design of Netflix System Design

Similar Reads

1. Requirements of Netflix System Design

1.1. Functional Requirements...

2. High-Level Design of Netflix System Design

We all are familiar with Netflix services. It handles large categories of movies and television content and users pay the monthly rent to access these contents. Netflix has 180M+ subscribers in 200+ countries....

2.1. Microservices Architecture of Netflix

Netflix’s architectural style is built as a collection of services. This is known as microservices architecture and this power all of the APIs needed for applications and Web apps. When the request arrives at the endpoint it calls the other microservices for required data and these microservices can also request the data from different microservices. After that, a complete response for the API request is sent back to the endpoint....

3. Low Level Design of Netflix System Design

3.1. How Does Netflix Onboard a Movie/Video?...

3.1. How Does Netflix Onboard a Movie/Video?

Netflix receives very high-quality videos and content from the production houses, so before serving the videos to the users it does some preprocessing....

3.2. How Netflix balance the high traffic load

1. Elastic Load Balancer...

3.3. EV Cache

In most applications, some amount of data is frequently used. For faster response, these data can be cached in so many endpoints and it can be fetched from the cache instead of the original server. This reduces the load from the original server but the problem is if the node goes down all the cache goes down and this can hit the performance of the application....

3.4. Data Processing in Netflix Using Kafka And Apache Chukwa

When you click on a video Netflix starts processing data in various terms and it takes less than a nanosecond. Let’s discuss how the evolution pipeline works on Netflix....

3.5. Elastic Search

In recent years we have seen massive growth in using Elasticsearch within Netflix. Netflix is running approximately 150 clusters of elastic search and 3, 500 hosts with instances. Netflix is using elastic search for data visualization, customer support, and for some error detection in the system....

3.6. Apache Spark For Movie Recommendation

Netflix uses Apache Spark and Machine learning for Movie recommendations. Let’s understand how it works with an example....

4. Database Design of Netflix System Design

Netflix uses two different databases i.e. MySQL(RDBMS) and Cassandra(NoSQL) for different purposes....

Contact Us