Liquid Data and Personal Consent Stores

October 2023

For twelve years now I have researched the "data portability by default" benefits of personal data stores. And now I found two caveats with it, which I would like to improve on.

Using a personal data store as a pivot for data portability between applications works very well: it allows the user to export data from the first app into a memorable place, rather than into zip file in the downloads folder on one of their devices, and it allows data to be imported into the second app without requiring any interaction with the first app.

Subsequently, all apps that are connected to the personal data store should use it as the primary storage for all data pertaining to this user. But apps will not always want to do that, they may want to keep their own copy as authoritative. A good example is patient records at a hospital. The hospital will often trust its own copy of the data before trusting the copy that is in the personal data store which they don't control.

From the point of view of the hospital, everything outside its walls is out of its control, and it will have a defense layer that filters both outgoing and incoming information, to prevent outgoing data leaks and incoming data corruption. The same will be true for bookkeeping and ERP systems, letting go of the primary store of business critical information is a paradigm shift in the best cases, and a hard no in many other contexts.

I therefore started thinking about "Liquid Data" as a paradigm for multi-homed data.

The other caveat is in how decentralisation of data puts all our ideas about access control inside out. The idea of setting up a personal data store with which the user can bring their own data to an application is at the same time powerful and simplistic. It assumes that this personal data store becomes the new server around which a wall can be built to prevent unauthorized access. But when we think of data as multi-homed, and on top of that think of identity (both for the data controller and the relying party) as something federated, the whole notion of access control at the server gate breaks down. "Read access" used to mean the ability to take data out of its home. For multi-homed data, there can still be a notion of read-access when relying parties take data out of one of its homes, but by doing so the relying party itself becomes an additional home of the data, changing the location of the wall rather than just passing through a gate in that wall. For "Write access" to multi-homed data, there is a no way to stop a relying party "writing" from a new version of the data in its local copy. The sync messages that are sent to copy this new version back to its other home may be rejected, but for multi-homed data this is then better described as a conflict between two servers and no longer as a server denying access to a client.

The Solid protocol uses access control lists (either WAC or ACP) which bind access decisions to the folder structure of the data. But sometimes an access grant will have to affect access to different resources, in different places in the folder tree. And indeed, if you look at the PoC implementation of the SAI spec, it stores registries on the user's storage, but it does not interoperate with WAC. Looking at [Inrupt's answer to SAI](https://docs.inrupt.com/developer-tools/javascript/client-libraries/tutorial/manage-access-requests-grants/), it also doesn't edit the ACP rules when access is granted or revoked.

Why not CRDT?

Long, frequent and enjoyable conversations with George Svarovski and others led me to the conclusion that CRDTs, while solving part of the puzzle, do not offer the same "data portability by default" that personal data stores aim to provide, because they do not allow for translations of the data. A CRDT only works if all nodes in the domain apply exactly the same algorithm to translate update messages to new local versions that can be proven to eventually converge.

I therefore think of Liquid Data as something slightly more high-level than a CRDT. The messages sent between servers can be any type of message which both the sender and the receiver understand, but other than that they do not need to be formatted in any way relevant to the sender's or receiver's internal data format.

Personal Consent Stores

So what is left if we take the "data store" out of personal data stores? It's true, once a user sets up a synchronisation between two services, the data can stay in sync between them, for as long both servers keep access to each other, and the user's identity on both services can stay linked, until one of the two services severs the connection. But the user may want to keep track of liquid data domains they control, and be able to add and remove connected services from a single place, even if that place is not the single source of truth for that data domain. What is left in the center then, after removing the storage server is only the auth server: the personal consent store.

And of course each services that holds a copy of the data should apply the access control on behalf of the data controller, so it might make sense to consider bundling access control with the data in the way WAC does, but on the other hand, since access grants will usually span different resources, that would mean a single access grant could lead to a thousand WAC entries. And furthermore, the idea that the data synchronisation messages that travel between servers are GET, PUT and PATCH requests on specific resources is also naive; two nodes in a Liquid Data domain might be syncing using some efficient Merkle tree updates across a whole bunch of data sets, and including copies of ACLs with each bit of data would then be a serious performance hit.

That's why I think the consent stores should stay user-centric, but the data storage should be allowed to move beyond the restrictive hub-and-spoke model.