Data Processing in Netflix Using Kafka And Apache Chukwa
When you click on a video Netflix starts processing data in various terms and it takes less than a nanosecond. Let’s discuss how the evolution pipeline works on Netflix.
Netflix uses Kafka and Apache Chukwe to ingest the data which is produced in a different part of the system. Netflix provides almost 500B data events that consume 1.3 PB/day and 8 million events that consume 24 GB/Second during peak time. These events include information like:
- Error logs
- UI activities
- Performance events
- Video viewing activities
- Troubleshooting and diagnostic events.
Apache Chukwe is an open-source data collection system for collecting logs or events from a distributed system. It is built on top of HDFS and Map-reduce framework. It comes with Hadoop’s scalability and robustness features.
- It includes a lot of powerful and flexible toolkits to display, monitor, and analyze the result.
- Chukwe collects the events from different parts of the system and from Chukwe you can do monitoring and analysis or you can use the dashboard to view the events.
- Chukwe writes the event in the Hadoop file sequence format (S3). After that Big Data team processes these S3 Hadoop files and writes Hive in Parquet data format.
- This process is called batch processing which basically scans the whole data at the hourly or daily frequency.
To upload online events to EMR/S3, Chukwa also provide traffic to Kafka (the main gate in real-time data processing).
- Kafka is responsible for moving data from fronting Kafka to various sinks: S3, Elasticsearch, and secondary Kafka.
- Routing of these messages is done using the Apache Samja framework.
- Traffic sent by the Chukwe can be full or filtered streams so sometimes you may have to apply further filtering on the Kafka streams.
- That is the reason we consider the router to take from one Kafka topic to a different Kafka topic.
System Design Netflix | A Complete Architecture
Designing Netflix is a quite common question of system design rounds in interviews. In the world of streaming services, Netflix stands as a monopoly, captivating millions of viewers worldwide with its vast library of content delivered seamlessly to screens of all sizes. Behind this seemingly effortless experience lies a nicely crafted system design. In this article, we will study Netflix’s system design.
Important Topics for the Netflix System Design
- Requirements of Netflix System Design
- High-Level Design of Netflix System Design
- Microservices Architecture of Netflix
- Low Level Design of Netflix System Design
- How Does Netflix Onboard a Movie/Video?
- How Netflix balance the high traffic load
- EV Cache
- Data Processing in Netflix Using Kafka And Apache Chukwa
- Elastic Search
- Apache Spark For Movie Recommendation
- Database Design of Netflix System Design
Contact Us