Pagination Schemes

Data Liberation and TUBS Syncables

For over 15 years now, with Unhosted, Solid, PDS Interop and other open source projects, I have been working on data liberation: allowing people to get a meaningful copy of their data (not just a zip file) out of online platforms and into a personal data store. In my current NLNet-funded project, "The Ultimate Bookkeeping System (TUBS)", I'm even generalising the concept of data liberation to multi-homed data, where we no longer think of data as belonging in a specific system of record, but in multiple ones at the same time. The core component of TUBS will be Devonian, but the simpler one I'm starting with is (for now) called Syncables.

The goal of Syncables is to provide a tool that can create a local client-side copy of data that lives server-side behind an API. OpenAPI specs are a great help with that, because they specify which API endpoints are available, and which schema the downloaded data will be in. Even the security schemes such as OAuth flows and their endpoints can be specified, so for APIs that use OAuth, all you really have to add to get a client that works out of the box is a client ID and a client secret. But there are a few things that OpenAPI does not provide.

One important piece of information that is needed to understand the data that lives behind an API is how endpoints relate to each other. For instance if some endpoint has a {projectId} parameter in its URL, you may first need to fetch the list of the current user's projects, and then substitute their id field value there.

If you want to use an SQL database for the local data copy, then nested objects in API response bodies need to be dealt with in some way, possibly with linked tables. I haven't fully worked that part out yet, maybe I will prefer to have some sort of "schema store" that can deal with relational tables but also with nested objects.

There are a number of pieces of information that are needed to relate read and write operations of an API, as well as side effects. For instance a POST may lead to an item being added to a collection, and it may also have side effects in terms of billing or in terms of messages being sent out to other systems, or even physical items being shipped or manufactured.

Also, there are a number of aspects of APIs that are outside the scope of OpenAPI specs, but that will be useful for the Syncables project, such as webhooks, websockets, GraphQL, grpc, and others. So I still have a lot of aspects of the Syncables project to cover, and I find it an exciting project to work on. But I identified a small part of the work that I think can neatly be solved as a stand-alone problem, and that can also be useful in other contexts.

Four Common Pagination Schemes

When reading a list of items from an API, the result is often paginated. For instance, if a collection has 42 items in it, a request to list these items may first return only the first 10 results (the first "page" of the data), then allowing additional HTTP requests for retrieving the other pages. From experience I knew that many APIs offer a page parameter where the API client can specify which page number it wants to retrieve. Others offer a slightly more flexible offset parameter where the API client can specify the index of the first row to return - a pattern we also know from SQL.

Another popular (and potentially more robust) pagination scheme is to provide an opaque token in the response of each request except for the last one, which the client can then add as a parameter in the next request. This has the advantage that such a token can encode a snapshot identifier, so that if an item is inserted or removed during pagination, the results don't skew. For instance, if a client retrieves items 1-10 as a first page, then item 3 is removed from the collection, and the client then requests the second page of the data, item 11 will have moved from page 2 to page 1 to fill up the space left behind by the removal of item 3, and the second page would consist of items 12-21. But if the token encodes the fact that this client started pagination before the removal of item 3, then the server can consistently serve up items 11-20 in response to the second request.

A fourth way to allow pagination, which is both versatile and simple, is for the server to include a link in a response header or inside the response body of all requests except the last one, for the client to use to retrieve the next page. This means the client does not have to do any URL construction, and the server can even make use of page number, offset, and token pagination in that URL, since it is opaque to the client. Like for token-based pagination, if no token or link is present in the response, the client knows that pagination is complete and all data has been received.

Some Uncommon Pagination Schemes

I first researched different APIs for Let's Peppol and to get some more general real-world data, I used APIs.guru as an information source. There, with some heuristic and manual search, I found 361 APIs that offer token-based pagination, 147 that understand a page number parameter for at least some of their endpoints, 135 that understand offset, and 116 that offer a next page link in some of their API responses. There are a few APIs that offered a pagination scheme that does not fit into any of these four types.

The Heroku API accepts a Range header on some API endpoints, that allows specifying the id of the first item to include, and the number of items to return.

The Billingo API accepts a parameter where the client can specify the last item id of the previous page.

Some endpoints of the Microsoft Cognitive Services (Training) API accept a continuation parameter that is combined with a session parameter to make up the function of the token.

The Scrada API implements a sort of message queue of which only the first few items are visible, and then allows the client to mark individual items from a collection as confirmed, so that next items start showing up. This is however not repeatable; once an item has been confirmed, it has been popped from the queue and will never appear again.

Some APIs offer filters for for instance date range that can be used instead of pagination, and then there are API endpoints that only return the first page of search results, but with no way to retrieve subsequent pages, so we also don't count that as a pagination scheme.

These observations led me to define a paginationSchemes extension for OpenAPI, similar to securitySchemes. The format I propose is as follows.

A paginationSchemes should have a string-valued paginate property that is an absolute or relative JSON path to the array in the response body that is being paginated. Sometimes this is the root of the response body ("$"), sometimes it is a value called "results", "items" or "rows", and I guess in rare cases it might be nested such as "result.rows".

Apart from the "paginate" attribute it needs a "pageNumber", "offset", "token" and/or a "nextPageLink" object. In there, a string-valued "parameter" or "requestBody" attribute can indicate which query parameter (indicated by its name) or request body field (indicated with JSON path) can be used to specify a page number, offset or token in a page request.

A string-valued "responseBody" or "responseHeader" attribute can indicate where the next page link or the next token can be found in a page response (absence means the last page has been reached), and where applicable it can also specify where the current page number or the current offset can be found in the response, for information and sanity checking.

If the page size can be controlled with a parameter, and/or is reported in the response, this can be specified in the paginationScheme with a "pageSize" object, that can again have parameter or requestBody, and/or responseBody or responseHeader attributes.

Additional information for knowing when the last page has been reached can be described where available, in "totalCount", "pageCount", "lastPageLink", and "hasNext". Additional information for sanity checking and maybe for simplifying the client code can be described where available, in "previousPageLink", "currentPageLink", "firstPageLink", and "hasPrevious".

Here is an example of a pagination scheme:

paginate: $
offset:
  parameter: offset
pageSize:
  parameter: limit

When an API offers pagination in some endpoints but not in others, it seems doable to write auto-detect code, for instance if a pageNumber paginationScheme is described with a parameter name of "page", a client can conclude that it should be used on all endpoints that accept a "page" parameter, and on none of the endpoints that don't.

I will be using this extension myself and also submit it to the OpenAPI community in case it's useful for other people. I will also generate a collection of overlays for APIs from APIs.guru that add paginationSchemes objects where I see them applicable.