How Netflix Scales its API with GraphQL

Netflix is said to be a subscription-based streaming service that allows users to watch TV shows and movies on any device, given it is connected to the internet. It is really popular because of streaming exclusive content and in 4K resolution. But behind the scenes how do they manage it? How do they reach out to so many users seamlessly?

It was possible because of many things and GraphQL was one of them. GraphQL is an open-source query language and server-side runtime that helps to specify how clients should interact with application programming interfaces. In this article, We will focus on “How Netflix Scales its API with GraphQL” in detail.

Introduction

  • Netflix became popular for its streaming services by reaching out to a large number of users. They scaled their services to accommodate this growth. Dealing with the increasing complexity of data and its relationships was a challenge in the past.
  • According to the Netflix TechBlog 2020, the Netflix API team found the Apollo Federation specification as the ideal way to scale their GraphQL architecture.
  • In this model, the individual GraphQL schemas (tables) tend to become subgraphs (flowchart-like things) which are composed into a unified supergraph (bigger chart). In this way, they also retained their integrated “Consumer Edge” API which is actually how businesses get to know their consumer behavior. As a result, they could provide faster delivery without compromising the customers usability experience.
  • They do not have any official documentation about their API use but the developers get to access the data about movie reviews, ratings etc from their (Netflix) data catalog.

History

  • They launched their public API on October 1st 2008, they had a blog, code samples from developers etc. They also developed applications on their API like InstaWatcher, WhichFlicks etc.
  • Earlier Netflix app used a different graph-API technology called Falcor. In 2012, GraphQL did not exist so Falcor was used both have similar concepts but by 2020, GraphQL was way more popular so the latter is used.
  • They use federation in their API. It can be explained as a way of breaking the API into pieces that can be further developed independently, as it tends to handle a single domain. By July of 2019, Netflix started building a GraphQL gateway based on Apollo’s reference implementation.
  • They used Kotlin (used for Java) to get access to their Java ecosystem for efficient fetching etc. This federation has resulted in explosive growth over the years.

GraphQL and Federation

  • GraphQL is an open-source query language and server-side runtime which is built around the concept of “get exactly what you asked for” without any under or over fetching of data. For example, consider GraphQL as the grocery list for our API.
  • We just need to specify what data we need and the server (grocer) delivers just that. It is considered as a successor to REST APIs. GraphQL makes it easier to gather data from multiple sources and uses a type system to describe data (rather than multiple endpoints).

The principle behind Federation

  • Firstly it breaks down a large API into smaller, independent services, called microservices, mainly focused on some specific data domains like, user data, movie data. Then each microservice may be further developed, scaled, and updated independently, increasing flexibility.
  • Now a central unified gateway collects the data from all of these microservices and also combines the individual schemas of the latter into a unified GraphQL schema for that whole API.

Note: Federated Graph service is all about combining multiple GraphQL APIs into a single, federated graph. This federated graph enables clients to interact with multiple APIs through a single request only.

How is Federation implemented and used at Netflix?

At the Netflix application, we see the LOLOMO screen first. It stands for the list-of-list-of-movies (the mainpage of Netflix filled with lots of recommendations and popular TV shows or movies). And it is actually built by fetching data from many microservices such as:

  • Service that returns a list of top 10 movies or series.
  • Artwork service that provides personalised images (like posters or thumbnails) for each movie.
  • Movie metadata service that returns the movie titles, actor details, and descriptions.
  • LOLOMO service that provides what lists to actually sets up the user’s home page based on the user’s preferences or activities.

From the image given below, the microservices are called DGS or Domain Graph Service which is an in-house framework developed by Netflix to build GraphQL services. When they started using GraphQL and its Federation, there wasn’t any Java framework available which could have been old enough to use at the Netflix scale. So they used low-level GraphQL Java framework and expanded it with features like code generation for schema types and support for federation. But at its core, a DGS is just a Java microservice with a GraphQL endpoint and a schema.

For example, the LOLOMO DGS is said to define a type show using only the title. After that, the images DGS can enhance that type show by maybe adding an artwork URL or any image to it. These two separate DGSs tend to operate independently and do not share information with each other. They simply need to publish their own schema to the federated gateway only which is capable of communicating with any DGS through their GraphQL endpoint only.

How Netflix Scales its API with GraphQL?

  • It starts by breaking your API apart it into chunks that can be developed independently, as a single domain usually. They are usually implemented by domain experts. Then a graph-aware gateway, which is a central junction in this architecture, ties them together into a single API.
  • But it doesn’t contain any business logic. It tends to follow a declarative configuration that tells it which data comes from which service. This is the federation used for scaling.
  • There are usually three components in a federated architecture, namely, graph services, schema registry and graph gateway. The graph services consists of GraphQL servers only. They display only a portion of their overall schema and publish it via schema registry.
  • This registry mainly holds schemas for all of the services. And the gateway usually takes single query from client and breaks it into sub queries that later executed against the servers. They tend to process the request in two ways, query planning and execution.
  • Query plan looks through the client request and collects the related fields for each service. Query plan execution traverses through the entire query plan starting from root node in either parallel or in sequence and merges the overall response.

Example

Sample GraphQL code corresponding to the above image can be written as

type Query {
recommendedVideos(first: Int): [Video]
}

type Video {
videoId: Int
title: String
description: String
boxartUrl: String
rating: Rating
matchScore: Int
trailer: Video
}

enum Rating{}

Using GraphQL for the consumer Netflix App

  • Consider a really simple graph API. Starting at the root of the graph, which for GraphQL is called a query, can fetch the recommended videos for a user, where, you can run through each one of those videos.
  • The video type has more fields that we can fetch, like title, rating. Here, a key takeaway about this graph API is that we can choose the properties or features that we want from a client’s point of view and then later on follow relationships and maybe recursively select properties from other objects.
  • But the actual Netflix graph is more complex than this with lots of fields and relations. To make it simple, we can just break it up into smaller parts.
  • With GraphQL Federation, each distinct domain or logically meaningful portion of the graph is served by a different service.
  • And then the API aggregation layer composes these together into a single unified graph. That’s what it’s all about. A big picture broken into smaller pieces and these pieces complete a puzzle together.

Query Plan And Query Plan Execution

  • Now, look again at our initial schema, if we wanted to take the top 10 videos for any user, then for each, we want to fetch the title and the box art images to display first. We know we have to fetch the top recommended videos first, because we need those video IDs in order to know which titles and image URLs to fetch and that is how we create a query plan accordingly.
  • The recommended videos are fetched first, and then at the same time, title is fetched from the video service and box art URLs are fetched from the images service. The fetch nodes are later traversed and executed which is commonly referred as query plan execution.
  • These fetch-processes occur together in a parallel way. In a nutshell, the server parses, validates, and creates a data retrieval plan based on dependencies or a Query Plan and when the data is retrieved simultaneously based on the plan and gathered for the response, it is said to be Query Plan Execution.

Limitation of Federation

  • Only Apollo Gateway is a ready-to-use, self-hosted federation gateway implementation; the rest are still under development and not fully functional.
  • It has limited support for custom directives (instructions within GraphQL to increase functionality).

    a. No built-in mechanism for federated directives.

    b. Per-service directives, if any, get removed by the gateway.

    c. Workarounds (or temporary solutions) exist but-unsupported.

  • Service Startup: “Hello World” scenario assumes services are already running when the gateway starts which is not considered ideal for disaster management.
  • It has type naming conflicts like the term “Service” which is commonly used by tooling can not be used for anything else.
  • It does not support subscriptions currently.

Conclusion

In conclusion, Netflix’s adoption of GraphQL and the Apollo Federation specification for its API architecture has been instrumental in scaling its services to reach a large user base. By breaking down the API into smaller, independent services and using a central gateway to compose them into a unified graph, Netflix has achieved faster delivery without compromising user experience. It has some limitations, GraphQL and Federation have enabled Netflix to handle the growing complexity of data and its relationships, ensuring the continued success of its streaming services.



Contact Us