Stop Trying to Make Fetch Happen: Or, Make It Happen with Dataloader + GraphQL

By Alex Knowles, Senior Software Engineer

At CityBase, we are creating a platform that transforms the way local governments do business with their constituents. It’s an ambitious undertaking that requires our engineers to find creative solutions using the latest technologies. Previous CityBase software engineering blog posts have touched on some of the technologies and methodologies we use — such as Elixir, Agile development, and microservice architecture. In this post, I’ll discuss another: GraphQL.

GraphQL was developed internally by Facebook in 2012. It was open-sourced in 2015, and since then, many organizations have adopted it — such as Airbnb, Netflix, and PayPal.  GraphQL’s creators set out to solve several problems they were experiencing with existing web API methodologies, like REST.

One of these problems is overfetching: where an API response includes more data than the API consumer actually needs.  Another is underfetching (also known as the N+1 problem): where an API consumer must make a series of requests to get the data it requires. GraphQL solves these inefficiencies by giving the API consumer the power to request exactly the data it needs in a single request.

Pretty neat, right? Although GraphQL seems to “magically” solve these fetching problems for you, there’s a hidden gotcha. If you are using GraphQL without something called Dataloader, you may be trading solutions in one part of your app for problems in another.

The Problem with GraphQL Fetching

Before we can dig into what Dataloader is, let’s back up and understand the problem first.  As an example, we’ll imagine a schema for finding information about a local municipality. Here’s a query for a public official:

This might yield a result that looks like:

We can query for details about Leslie, such as her ‘name’ and ‘office’, but what if we could also discover connections to other employees, such as her ‘directReports’?

And the result:

Because Andy, April, and Tom are also Employees, it means that we could also request their ‘office’ field:

The result:

Not surprisingly, all four of our Pawnee employees work in the same office, but it wouldn’t be hard to imagine an organization with employees distributed across multiple offices. Another thing to point out is that the same “Parks and Recreation” record is appearing in four places.

In a case like this, how does GraphQL go about fetching the “Parks and Recreation” data structure? Does it fetch the data once and render it in four places?  Or does it fetch and re-fetch the same record over and over? The answer: it depends.

Trying to Make Fetch Happen: A First Try at GraphQL Resolver

While we might hope that GraphQL could magically do the smartest and most efficient thing, the reality is that it’s up to us to inform GraphQL of the best way to access our data layer.  GraphQL’s powerful query language and intuitive type system will get us a lot “for free” on the API side. But when it comes to fetching data — whether from a database, flat file, or a network service call — GraphQL only knows what we tell it. The mapping from an externally facing schema to the data storage structures underlying it is called a resolver.

Here is what a naive resolver for the Employee ‘office’ field might look like:

This is a very straightforward mapping. An Employee contains a foreign key to an Office. This resolver passes the `office_id` key to the `Core.office_by_id/1` function, which returns a collection of fields like `name` and `url`. This function accesses a data layer in some way — perhaps a database lookup or a remote service call — and is relatively expensive.

Too Much Fetch

Although this implementation makes sense, it can have some unintended consequences when you account for how GraphQL invokes resolvers. Going back to our query “LesliesDirectReportsWithOffices”, imagine how GraphQL would descend from the query’s “root” (Leslie) down to each “leaf” (Andy’s office, April’s office, and Tom’s office).

  1. Resolve an Employee for ID “1000”: this returns Leslie, including her `name` and `office_id` fields
  2. Resolve an Office for Leslie: using the function shown above, this returns the “Parks and Recreation” record, including fields like `name` and `url`.
  3. Resolve a list of Employees for Leslie’s direct reports: a `direct_reports_for_employee/3` resolver retrieves Employee records for Andy, April, and Tom based on their association with Leslie.
  4. Resolve an Office for Andy: using the function shown above, this returns the “Parks and Recreation” record, including fields like `name` and `url`.
  5. Resolve an Office for April: using the function shown above, this returns the “Parks and Recreation” record, including fields like `name` and `url`.
  6. Resolve an Office for Tom: using the function shown above, this returns the “Parks and Recreation” record, including fields like `name` and `url`.

At steps 1, 2, and 3, accessing the data layer is unavoidable. Steps 4, 5, and 6, however, are re-fetching information that we already have from step 2. 

There must be a better way!

The Solution: Dataloader to the Rescue!

Our first attempt at a resolver was:

  • eager: it fetches data right away
  • forgetful: it re-fetches data that it already fetched

But what if we had something that was:

  • lazy: waits until the last possible moment to fetch data
  • resourceful: caches fetched data for later

This is what we get from Dataloader. Here’s a version of the same resolver using Dataloader:

Before we dig into what this code does, let’s revisit how GraphQL would resolve the “LesliesDirectReportsWithOffices” query using this implementation.

  1. Resolve an Employee for ID “1000”: this returns Leslie, including her `name` and `office_id` fields
  2. Load a request for Leslie’s Office: using the function shown above, “load” a request for an office using an ID key
  3. Resolve a list of Employees for Leslie’s direct reports: similarly, a `direct_reports_for_employee/3` resolver retrieves Employee records for Andy, April, and Tom based on their association with Leslie.
  4. Run any pending requests: the request for Leslie’s office from step 2 is now “run” — this is where the data is actually fetched. When the fetch completes, a callback function “gets” the data to complete the resolution of Leslie’s Office.
  5. Load a request for Andy’s Office: using the function shown above, “load” a request for an office using an ID key 
  6. Load a request for April’s Office: using the function shown above, “load” a request for an office using an ID key
  7. Load a request for Tom’s Office: using the function shown above, “load” a request for an office using an ID key
  8. Run any pending requests: the requests for Andy’s, April’s, and Tom’s office are all the same key. Furthermore, it matches the key of data that was fetched in step 4, so there are no pending requests.
    • The callback for Andy’s Office resolver is run: it “gets” the cached data to complete the resolution of Andy’s Office.
    • The callback for April’s Office resolver is run: it “gets” the cached data to complete the resolution of April’s Office.
    • The callback for Tom’s Office resolver is run: it “gets” the cached data to complete the resolution of Tom’s Office.

In this case, we make 1 trip instead of 4 to fetch the “Parks and Recreation” office data!

The “load” part of step 2, happens on line 8 of the code snippet. The function accepts a `loader` from the GraphQL context and invokes `Dataloader.load/4`. Instead of immediately fetching data for `office_id`, the key is added to an `:office_by_id` batch.

Next, on line 9, a callback is supplied via `Dataloader.on_load/2`. This anonymous function is not called right away; it is saved for later.

Just Enough Fetch

GraphQL determines when it can no longer defer data fetching, at which point it calls `Dataloader.run/1` (not shown here). In step 4, there is a request pending, so the request for Leslie’s office is completed along with any other keys that may have been batched — potentially in a single operation. The results are cached and then almost immediately retrieved when GraphQL invokes the resolver callback that was registered in step 2.

During steps 5, 6, and 7, the resolvers load office IDs Andy, April, and Tom. Dataloader is smart enough to recognize that this key has already been loaded. In step 8, Dataloader recognizes that there are no pending requests because the loaded key already has a cached result. The callback functions can “get” the result from memory rather than from additional trips to the data layer.

A Perfect Pairing: Dataloader + GraphQL

If you’re using GraphQL in your web applications, you should know about Dataloader and what it can do. (For more on this, you can check out this video of my talk “GraphQL + Elixir = Absinthe I recently gave at a Chicago Elixir Meetup.) 

As with any application, technology solutions for government and utilities must scale to meet demand, without compromising performance. For instance, your electricity company might see minimal traffic to their applications throughout the month, with a massive surge in users during peak billing times. By using tools like Dataloader, you’re enabling your systems to work smarter, not harder. As a result, your system will run efficiently to deliver a seamless customer experience, even when experiencing high traffic. 

Subscribe to the Blog

We’re innovators, problem solvers, and thought partners.