In-band authentication is like letting anyone come into your house, but locking every cabinet and drawer to keep your belongings safe.
I recently spent weeks following a Google rabbit hole into the best practices for GraphQL authentication.
I found a lot of information, but the advice was conflicting. At times, I thought the advice was downright dangerous, especially if you’re working with highly sensitive data.
My goal with this post is to help you understand what your choices are and the upsides and potential downsides of each option. I’ll also share the setup I used to handle authentication in what appears to be the safest manner.
Before we dive in, I’ll assume you’re familiar with common GraphQL terms like “query,” “mutation,” “resolver,” and “context.” But if you’re not (or if you just need a refresher), check out this awesome introduction to GraphQL guide.
A search for answers about GraphQL authentication
My journey down the rabbit hole began when I was building a production-ready HIPAA-compliant version of a healthcare application prototype.
Like many prototypes, security wasn’t a top priority. The current authentication code had been copied from one of the many tutorials that come up when you search for “GraphQL authentication.” It wasn’t wrong, but I worried it wasn’t secure enough to protect the highly sensitive data we were dealing with in real-world settings.
At first, I couldn’t understand how the app even logged users in.
There were some unusual patterns in the /graphql
endpoint that raised red flags. In particular, there was an array of strings (GraphQL query names like login
and securityQuestion
) saved to a variable called whitelisted_queries
.
Any query that matched whitelisted_queries
was allowed to bypass the authentication guard and execute unauthenticated.
How was the list of whitelisted_queries
populated and maintained? Was it secure? How easy would it be to bypass the list of allowed queries?
Was there something fundamental about GraphQL that I didn’t understand?
Like many programmers, I turned to Google to find out more. Most of the resources I found applied only to prototypes. There were very few recommendations for how to handle real-world applications.
Understanding the two approaches to authentication in GraphQL
In my initial research, I found that there are two basic approaches to authentication for GraphQL APIs.
“In-band” — Authentication implemented within the GraphQL API “Out-of-band” — Authentication implemented using a separate API endpoint or service.
In-band authentication
In an in-band authentication system, the same GraphQL API that’s used for the rest of the application also manages the login
and signup
processes. It accomplishes this by using mutations like login and signup in the GraphQL API.
An in-band implementation would have the following:
- Creating credentials: A signup mutation accepts a username and password and creates a user
- Verifying credentials: A login mutation accepts a username and password
- Issuing tokens: The outcome of the login will be a token (a secure, unique string that the client can use to make requests)
- Verifying tokens: Identify which tokens in the headers are valid and reject those that are not
- Load user data: Find the user identified by the token and make it available within the resolved query by putting it into the GraphQL query context
Out-of-band authentication
With an out-of-band implementation, authentication is not handled in your GraphQL API. Instead, it takes place at a separate endpoint in the application (maybe a RESTful endpoint) or through a completely separate external authentication provider like Okta or Auth0. These providers check credentials and issue a token (almost certainly a JWT) that is then sent with requests to the GraphQL API.
In this setup, the API is only accessible to authenticated users who are then loaded into the context before query execution ever begins.
An out-of-band implementation only needs the endpoint at /graphql
to do the following:
- Verify tokens: Identify which tokens in the headers are valid and reject those that are not
- Load user data: Find the user identified by the token and make it available within the resolved query by putting it into the GraphQL query context
The separate authentication endpoints handle the remaining responsibilities of creating and verifying credentials and issuing tokens.
Note: It’s possible that a GraphQL API could handle authenticated and unauthenticated clients, regardless of the validation method. But I’ll ignore these cases since they likely require more complex decisions to keep the application secure than just choosing between in-band and out-of-band authentication.
Authentication in GraphQL can be a confusing topic
When trying to decide which approach was best for this project, nearly all of the tutorials I found recommended in-band authentication. For example, my first Google result for “graphQL ruby authentication” returned this: How to GraphQL tutorial describing a full implementation of authentication as GraphQL mutations (in-band authentication). Similar results are returned for many other programming languages I tried.
Even some of the more “official” documentation I found, like Apollo’s article on authentication and authorization, suggested extracting authentication from HTTP headers and putting the user into the query context
, which leaves you uninformed about the actual implementation.
In the fine print, almost every tutorial suggesting the in-band approach said something to the effect of “this is not a production-ready implementation” or “for a real project, you probably want a separate authentication service.”
The red flags continued when I found other official documentation that avoided the topic of authentication almost to a fault. For example, searching the otherwise excellent graphql.org for “authentication” returns only two results.
One is a link to Authentication and Express Middleware, which suggests performing authentication just like REST, using the same middleware. The other hit for “authentication” is this terse and easy-to-overlook sentence:
“GraphQL should be placed after all authentication middleware, so that you have access to the same session and user information you would in your HTTP endpoint handlers.”
What sealed the deal for me was this line from the official documentation for the Ruby GraphQL implementation:
“In general, authentication is not addressed in GraphQL at all. Instead, your controller should get the current user based on the HTTP request (eg, an HTTP header or a cookie) and provide that information to the GraphQL query.”
Despite all the recommendations for in-band authentication, I arrived at the conclusion that, in most cases, you should handle authentication outside of GraphQL.
Finding the flaws of in-band authentication
The healthcare application I was productizing handled authentication in GraphQL, using mutations like login
and signup
. The mutations were as expected for an in-band authentication pattern. As we established earlier, all of the login- and signup-related mutations were accessible without authentication.
To protect the API, the /graphql
endpoint loaded users via the HTTP Authorization token, rejecting unauthenticated clients before calling the GraphQL query processor for all protected queries.
However, before authenticating the request, the operation_name
parameter was compared with the allowed list of mutations and queries. If the name matched one of the allowed signup or login mutation names, it was allowed to bypass authentication checks.
There was a major flaw in that implementation.
The operation_name
parameter is set by the client. A minimally knowledgeable bad actor could bypass the allowed query filter by sending one of the permitted operation names. This was easy since it was the same list of query names you used to log in or sign up. Trusting user input is a reliable signature of a major security vulnerability.
For example, if a bad actor sends an operation name like Login
(which is normally allowed, as the operation_name
parameter usually matches the mutation name), but then performs a different mutation or query, (let’s use nefariousAccess
in this example), they can bypass the authentication system entirely and run almost anything.
Example:
Trying to find a solution
To confirm my theory that we needed to switch to out-of-band authentication while trying to find a workaround to our current setup, I first wrote a test showing that the current security processes could be bypassed using something like the above example.
By learning how our implementation (in this case graphql-ruby
) parsed the fields being accessed, I wrote a simple temporary fix that checked what actual root-level fields were in the mutation or query. This allowed me to authorize access only to the intended mutations and queries by name instead of allowing their client-declared operation_name
.
This rudimentary fix, not one that I would endorse as a solution, starts to reveal the complexity of allowing partial unauthenticated access.
Once a query or mutation was allowed to process unauthenticated, it encountered the normal authorization system that handled the rest of the queries. The public login/signup
mutations purposely bypassed the authorization checks since no user was loaded.
Other strange issues were encountered on the way “out” of the mutation. When the mutation fails, the user is still not authenticated, which means certain return payload types also need to be public. In this arrangement, the return type of authentication, which may contain sensitive details, was considered unauthenticated by the API. Its return value could not be verified as belonging to the correct user since one did not exist yet for this request.
Even more confusing, when one of the public mutations succeeded, the user was still not in the GraphQL query context
as they should be in a normal request. That causes the authorization system to get confused when rendering the response payload. If you’ve secured the return types correctly, the authorization system does not want to return the user’s token when the context shows there is no authenticated user.
To get around this, most in-band tutorials will have you set the user into the context mid-execution. This is generally considered bad practice in GraphQL. The context is not meant to be changed during query execution because, among many other reasons, the order of the resolution of fields is not guaranteed.
Understanding the downsides of potential “fixes”
If you make all of the accommodations above and your authentication is working, then you may wonder: What’s the big deal? There seem to be many pros to this setup: Your authentication is in GraphQL and you don’t have to buy or develop a separate system. Your clients also need only one library to connect to you, instead of adding a separate non-GraphQL REST client.
So with some extra care, an in-band implementation is fine, right?
Nope, there are still a number of potential landmines with this implementation that are difficult to avoid.
1. The in-band auth mutations are more separate than they seem.
One theoretical advantage of in-band authentication is that it’s all contained within the same GraphQL API. However, in practice, it acts more like a separate subset of the API, yet it lacks clear boundaries that help isolation. Actual client implementations need to treat some of the mutations and queries differently because of their different authentication behaviors and error handling.
You gain very little by keeping them all under one endpoint because every unauthenticated mutation must be treated differently than the authenticated ones. This puts the app at risk of data leaks caused by engineers who are unaware of this special treatment. The special attention needed every time a type or mutation is updated may scare engineers away, allowing that part of the code to languish unmaintained, or worse, make the entire GraphQL API a scary place to work for fear of accidental data leaks. It only takes one to ruin confidence.
2. There is an additional, highly complex surface area that must be protected from attacks.
As is clear from the problems with our implementation, there’s a lot of surface area for mistakes. Checking the wrong variable, operation_name
, instead of the actual fields accessed by the query led to a massive hole in our security. Even checking fields is a risky way to filter queries.
These mistakes are much easier to make when you try to control public access in an ultrafine-grained way.
3. The entire server-side API implementation becomes more complicated.
With in-band authentication, most of the API needs to at least consider the possibility of a null and dangerously unauthenticated user during the request. To achieve this, roles and authorization must be rock solid and carefully error tolerant.
In practice this is troublingly difficult. By their very nature, the mutations that handle unauthenticated users are more complex. Likewise, the queries that should only respond for authenticated users must be cautious. The most troubling behavior I saw was that resolvers would sometimes run successfully, performing some business logic, before blocking the result because the return field required authentication.
To avoid leaks, we had to resort to the blunt approach of raising exceptions. Unfortunately, using the authorization system to reject unauthenticated queries by use of raised exceptions is just the sort of abstraction-breaking exceptions-as-flow-control programming that makes code difficult to read and write. This can expose the app to even more risk as fear of complexity and poor understanding wreak havoc on code quality and standards.
Further complicating matters is the convention of returning null for unauthorized fields. This is an elegant convention for a well-designed GraphQL authorization system, but it makes things more complicated and confusing for clients that could access the resource, but actually need to authenticate. How does the client know if it needs to login or if the resource simply doesn’t exist or is not authorized? The only solution I could see required two separate paths for every unauthorized field in the query: return null for logged-in users or raise an exception for unauthenticated users
This results in a reimplementation of the HTTP 401 Unauthorized status slopily spliced into GraphQL. The way this was done in our prototype used special signaling for a certain type of failure that meant “unauthenticated” and would trigger a 401 error.
4. Clients become more complicated, not less.
The whole logic for putting authentication in-band is to avoid complicating the client with multiple styles of endpoint (for example, REST and GraphQL). However, to do its job right, the client needs to call certain queries without authentication and/or change the authentication between retries.
A simpler out-of-band implementation allows separate networking layers. The logic handles situations where HTTP error codes are returned at a lower level in the networking library, triggering authentication flows that don’t need any understanding of GraphQL. All GraphQL requests stay within the world of GraphQL, blissfully ignorant of HTTP.
The separation of concerns makes out-of-band clients, even iOS and Android apps, easier to develop despite requiring two service layers (e.g. for Apollo client and Alamofire).
Why we chose to move to out-of-band authentication
Even though it was crystal clear we needed to migrate our healthcare app to an out-of-band solution, we were hesitant at first. After all of the workarounds, the API wasn’t necessarily broken. Why fix it?
A big change like this would also require migrating customers and working with vulnerable legacy code (which is always scary).
The opportunity finally came when we needed to implement a password-less authentication feature.
To accomplish this, we chose an out-of-band third-party authentication provider, Auth0. This used JSON Web Tokens (JWTs), which our system was already built to support.
Removing in-band authentication from our GraphQL API was no easy task. It took more than a year. Other projects regularly take precedence over nebulous risks that aren’t a problem right now.
In the end, our migration strategy involved calling every user and asking them to update the app. It was either that or spend two months rewriting the authentication code to support a smooth transition (which would be necessary for any larger non-prototype application). Our users weren’t technically savvy. For some, it was the first app they had ever downloaded to their phones.
Avoid the pain of this migration and do it the right way from the start.
A quick note on using a third-party provider vs. doing it yourself
A few readers may wonder why we chose to use an authentication provider versus write our own solution. As in any development discussion, the following is just my opinion and I’m always open to considering a different approach.
In my experience, it’s easier to use a third party and then move on to other features, knowing your app is secure. If you eventually want to own your authentication system down the line, you can always just mirror the interface and add it to your final product.
If you take the time to write your own solution now, but your requirements change, you’ll be in a position where your app might not be secure. If you eventually want to set up a third-party authentication system, it’s much harder to make the switch from a custom-written solution.
There are valid downsides to working with a third party. Namely, cost, but also reduced flexibility, external dependencies and reduced control. Most services don’t get expensive until you have tens of thousands of users and the one we used would even be free at first. Once you reach the point where your thousands of users are costing too much, you’re hopefully making enough money to afford to invest in building a custom solution. You’ll also be in a better position to analyze the trade-offs and make an informed decision now that you have a working product.
A step-by-step guide to implementing out-of-band authentication
Here’s an overview of how we handled the new implementation:
- Use a separate authentication endpoint (ideally a separate provider like Auth0 that has already made and fixed all the mistakes I would inevitably make by writing it from scratch).
- Require authentication for the entire GraphQL API with middleware higher up the request stack.
- Immediately return a 401 status code when authentication fails so that clients can easily interrupt the request, refresh authentication, and retry.
- Load the user into context before resolving the GraphQL query. Don’t change context during a query.
Now, we’ll go into each step in detail.
I made an example node app for this article that tries to exemplify these behaviors. The README should help you run the app locally to try it out. I included cURL examples, but API clients like GraphiQL or Insomnia are easier ways to try it out. If you find any problems in the app, please open an issue on GitHub. The required Auth0 setup is free, but it might be the trickiest part. Please reach out to their support team if you want to run the full proof of concept.
Step 1. Use a separate authentication endpoint
There are many product choices to be made regarding user login experience. I’ll stay agnostic since they shouldn’t affect the resulting behavior. I picked Auth0 for my example app because it’s free and I’ve used it recently. Any authentication service that provides these basic behaviors will work.
Ultimately, we want an authentication server that allows the client to trigger an authentication flow for any new or returning user. The result should be a JWT on success or a message on failure. On success, the client includes the JWT in the authorization header for any requests to the /graphql
endpoint, like so:
Authorization: Bearer eylotsofcharactersjson.web.tokenstuff
In practice, your client will simply need to request authentication any time it determines that it doesn’t have a valid token, either by receiving a 401 or by checking for expiration directly.
To create my example app, I followed the Auth0 Node.js (Express) tutorial.
Step 2. Require authentication for the entire GraphQL endpoint
Use middleware or filters to check for authentication before processing anything GraphQL-related.
View this behavior in the example app I made for this article.
Step 3. Return 401 on failed or expired authentication
When authentication fails, return 401 Unauthorized and a WWW-Authenticate header. This helps client developers know that authentication is expected.
HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer realm="GraphQL"
View this behavior in the example app I made for this article.
Note: In this case, and with GraphQL in general, don’t use 401 Unauthorized as a status to indicate that an API resource exists but the user doesn’t have sufficient permissions to see it.
I have run into confusion with this before. Trying to overburden the 401 status with this user permissions logic muddies client behavior. In GraphQL, return null for resources that are denied because of permissions, as if they aren’t found.
Step 4. Load user into context before execution
When authentication succeeds, load the claims from the JWT and save the relevant ones to the GraphQL context. More often than not, it makes sense to load the user model from the database and put it in the context. Use your discretion about whether this makes sense for your API.
In the example, I chose to only load the JWT claims. I didn’t add a database to the app, but I did give an example of how I would use one to perform this behavior.
Why do so many tutorials suggest in-band authentication?
At first glance, in-band authentication seems perfectly logical. The authentication system is tightly bounded and comprehensible, exactly the sort of project for an eager engineer implementing a new GraphQL API.
Many of us have written REST API authentication. Why not write it in GraphQL too?
I suspect that the reason these tutorials go into so much detail about in-band authentication is that the authors want to show off the power of GraphQL. The query language is shiny and new. Like many folks, when I’m learning a new tool, I’ve been prone to thinking that it will solve more problems than it really can.
Unfortunately, the in-band authentication pattern, once copied from these tutorials, quickly becomes detrimental for an engineer bringing a prototype to life. As the app grows and the cracks in the solution become apparent, it can be expensive to fix. Some of the trade-offs we accept by using GraphQL (like query complexity scoring, caching difficulties, etc.) make it ill-suited for a production authentication system.
GraphQL tries to be a good answer to almost everything about API design, but it’s not a free ride. GraphQL APIs still require good API design practices. Because of the difficulty of preventing brute-force protection, and the complexity of “locking every cabinet in the house,” I don’t recommend in-band authentication in almost any GraphQL API.
Is there ever a time for in-band authentication?
There are certainly particular projects that would work with in-band authentication. Like we said earlier, it can be great for building a prototype or toy project where the login doesn’t need to be secure.
One example might be the application PurpleAir, which allows you to access public data about your local air quality. Once you log in, the app remembers your settings, none of which contain highly sensitive data. I still believe that a separate authentication system is a better choice, but I’d be willing to trade simplicity for absolute security when the security needs were minimal.
However, if you’re in any situation where you need to prevent brute-force login attempts, then I don’t suggest building the authentication in GraphQL. Instead, I strongly suggest using a separate authentication container.
That said, if you have an application where you are using in-band authentication and avoided most of the pitfalls listed above, then I’d love to hear how it goes (or how it went) so I can adjust my opinion accordingly.