Multi-Homed Personal Data for Education

This blogpost is part of a 20-day project I'm doing for the SURF/NPuls EduWallet Incubator this year, exploring what Personal Data Stores (PDSs) like the Solid Project can add to Wallets.

EduWallet Incubator

The motivation of the EduWallet Incubator project is to get ready to use wallets as part of educational computer infrastructure, keeping public values and the interests of the "Life-Long Learner" (the "learner" from here on in this blogpost) at the centre. For instance, giving the learner more control over which courses they enroll in, and make sure they have sovereign access to their diplomas. The current situation is that a learner's personal data and diplomas are scattered behind various web portals that require various cumbersome authentication procedures, and some data might even just be lost over time, against the learner's will. Personal data in education is "homeless" as Peter Eikelboom puts it in this interview (in Dutch).

In this blogpost I'll take a step back and consider how we can try to nudge the systems that manage personal data in education in a favourable direction, given what we see around us now in practice and what we can try to predict about the future, for instance when the market for educational offerings becomes more liquid (it's now easier to travel or study online than it was 40 or 400 years ago), offerings may come in smaller chunks (e.g. a student might pick and match 100 one-week courses on different topics and from different suppliers instead of following one monolithic 100-week education from a single institution), and the teachers as well as the managers of an educational institute may be replaced by AI in ways we cannot yet imagine.

At the same time, computer infrastructure is evolving, from the days when the internet was only interconnecting the universities, to the current situation where companies like Microsoft, Google and Apple have strong political power over the way computers aid in education, and AI is likely to make the power of proprietary software providers over the education process even more ubiquitous and far-reaching than it is already.

I'll question whether a wallet is actually a solution here, whether adding a PDS actually helps, and what a good long-term roadmap would be for educational personal data management. Rather than starting from the use cases defined by the EduWallet Incubator project (slides, in Dutch), which are basically the user journeys for registration, authentication, enrollment, exams, applying for jobs, and shopping, I want to turn these requirements inside-out, and try to define in data-centric requirements how the computer system should support real-world processes.

The Long View

My first observation is that real world processes should be supported over potentially long periods of time, for instance a learner might learn a skill when they are 20 years old, and require that skill at a job they start when they are 60 years old, so the data would need to be preserved for a 40-year period. Computer devices, data formats, technologies, even the very ways of interacting with data are likely to change a lot during these 40 years. To some extent the learner's memory can be used for persistence, but it would be desirable if all data were persisted reliably without relying on human memory.

Computer systems will control physical processes, in particular the learner's access to physical locations and to participation in learning processes, production processes and consumption processes. The learner has agency to decide whether they apply for a certain job or enroll in a certain course, but the computer system should support that agency by providing discovery and matching of offers and needs, coming up with proposals for the learner to choose from. This can be both proactive (alerting the learner to an opportunity) and reactive (answering a query from the learner).

Authentication

An important part of the system will be authentication: in a session of interaction between human and computer, make sure the personal data of the right learner is surfaced as the basis for both computer and human decision making. Authentication factors we currently use are generally combinations of biometrics, human memory, physical device ownership ("something you are/know/have"), and in password recovery procedures, different types of proof may be used, such as third-party attestations, papertrail proof of past interaction, etcetera.

I think it's important to note at this point that wallet apps on smartphones should never be the only authentication factor by which a learner can access their personal data. It should also never be the only store of any piece of data.

Just like multi-factor authentication is both more secure and resilient than single-factor authentication, multi-homed data is both more convenient and more resilient than single-system data storage.

System-Centric Requirements

As mentioned earlier, the physical processes that the system should support, probably include learning processes, production processes and consumption processes. For instance, a class might take place online or in a physical location, people might get together in a commercial company to work, and people may buy things with a student discount. The computer systems may need to ensure things like:

only people who have completed a certain training, work a certain job
people can only participate in course B after completing course A
people can only participate in a certain course if they registered for it (indicating and recording both intent and qualification, and possibly completing a payment)
before, during and after job interviews, information about the applicant's application is available to both the applicant and the interviewer
only people who are currently a student, get the student discount
people may publish recommendations about other people, or about specific work of theirs, publicly or privately shared
for some courses, some proof of legal status may be required, e.g. a student visa for a certain country and time period
data processing should be transactional wherever possible, so apply the principle of least authority to authentication requirements instead of always sending a learner's full identifiable information record
the learner may want to curate and present a sort of "verifiable CV" - something like a profile page or document in which verifiable signatures from educational institutes can be embedded as CV items
apart from institutes issuing diplomas, colleagues and co-students may also personally give peer recommendations

Data that might need to be codified to support all these processes could include:

proof of legal status
proof of payment
study results from the past
current enrollment
social recommendations

Study results would naturally be signed for authenticity by the teacher or the examiner, or the institute they represent, with a binding to the holder that allows them to be identified correctly in future interactions. For instance, it may make sense to put a person's first and last name on a diploma, so that a different person with a different first and last name cannot use that diploma and pretend that it was them who achieved the result. And here first and last name are of course a common but pretty ill-chosen identifier. It might make more sense if a diploma were tied to an identifier with more entropy (to avoid the risk of accidental identifier collisions), from a stable and politically neutral identity provider such as (in some cases) a national government, and if that identifier were recoverable by the learner through a combination of authentication factors.

Where for instance proof of passing an exam or proof of payment is concerned, the record needs to be preserved over time, but for the learner, discoverability and access to this data also needs to be preserved throughout their lifetime. Computer devices such as smartphones don't usually last for 40 years. So storing a credential on a smartphone is not a solution. I think it's clear we need to embrace the view of "multi-homed data" to address this challenge. Then we need to see how wallets and PDSs fit into that.

Multi-Homed Data

In data portability, some people have argued for "data at the source" or "linked data", but I think this is an over simplistic architecture for storing data in networked computer systems. Digital signatures (which were not yet widely in use when Linked Data was invented) can now be used to produce location-independent Verifiable Credentials (VCs). Also, distributed versioning and synchronisation is a topic that has recently come under more study in the OT, CRDT, CTM and Local-First paradigms. Although the data requirements here are not heavy on collaborative editing, it's good to keep multiple copies in multiple systems for data portability. That way, we reduce the chances of the data getting lost because one particular computer system is no longer interested in it, or because a device is physically damaged or discarded by the learner.

Whereas personal data stores (PDSs) are proposed as a user-centric location for data storage, and I did a lot of work over the years promoting them, I now think this paradigm is flawed and incomplete. I think it's better if the user's personal data is synced across multiple systems, and new systems get added over time, before the old systems get taken offline. A PDS can still play a vital role in securing the persistence of the personal data of a learner, but thanks to recent advances in sync protocols, it is no longer necessary to restrict it to being the only authoritative storage of that data. It can be a node in a data sync network.

Wallets hold VCs for at most the time the device they are installed on survives. Some wallet applications have proprietary backup systems - others may use the backup functionality of the mobile operating system, but both of those options lead to lock in. Wallet-specific backups lock the user into a specific wallet application. Mobile OS backups lock the user into a particular mobile OS. It's better to think of these synchronisation functionalities as additive - the data can exist in any number of wallets and personal data stores at the same time, and be synchronised using any number of sync protocols, as long as the result is that the data, and the corresponding signatures over that data, are copied over intact.

Multi-Identified Data

Just like a learner is not unlikely to have 10 different personal mobile devices between the time a credential is issued and the time a credential is verified, identities change over time. For privacy reasons, we don't always want to link data to a government-issued stable identifier that links together all the data about a learner from their birth to their death.

When a learner participates in a transaction, we often want to identify them in a less precise way, applying the principle of least authority, for instance if you pass the ticket check at the entrance of a conference, you get a wrist band. The restaurant will then be able to identify you as "somebody who is entitled to eat at the lunch buffet", but they will not know anything about your place of birth, etcetera.

There may come a time where restaurants will use face recognition to still know the place of birth and other details of all their customers, and we will have to "voluntarily" opt in to this if we want food, but until that time comes, let's not engineer it on purpose.

Whether the identifier used by an educational institution only identifies you as "a person who is enrolled in this course", or it actually undeniably identifies you as a specific physical human being (for instance because it uses biometrics for authentication), there will be times that identifiers will need to be translated.

Maybe you have a self-issued identifier on a device you own, and a different one on your next device. Or maybe you have a student number at university that will somehow need to be linked to your employee number at work later. Just like I think multi-homed data and co-existence of the same data in multiple systems of record is a good paradigm for data portability (I'll start blogging more about this as part of my new project The Ultimate Bookkeeping System soon), I think multi-identified data can help make identifier migrations easier.

In Linked Data, multi-identified data is partially supported through the owl:sameAs predicate, which (somewhat comically) appears to be the same as the schema:sameAs predicate. In lens projects like Devonian and Lens VM entries in a system of record can have one local identifier and multiple foreign identifiers.

Binding a Credential to a Holder Identity

Great care should of course be taken when following claims of identifier equality, not to fall for impersonation attacks. Identities may be stolen, and in an extreme case, an identity may even be sold for money, with the original account owner's consent, to an impersonator. So we need a balance between allowing the learner to have access to their personal data during their entire lifetime, without needing to tie their identifiers inequivocably to themselves as physical human beings, and also not making it too easy for an attacker to steal or buy a person's identities.

A credential might be handed out to a person who completes an exam, linked to their self-issued identity. Before switching to a new phone, the learner may have endorsed a new self-issued identity on the new phone using the identity on the old phone. So this would imply that the holder of the new phone has completed the exam. But what if they endorsed two new phones, one of their own and one of their friend?

A way to avoid this "double spend" of the exam credential would be if both the new and the old self-issued identity were linked to a passport number. But this requires the learner to give up some privacy. This is one of those moments where somebody is going to mention blockchain. I don't think I currently know of any solution for secure identity migration in a decentralised system that would be better than referring to a central trusted authority such as a government, a bank, or a specific database or blockchain network.

Linking a PDS to a Wallet

The Wallet-Attached Storage spec describes an interface that personal data stores can expose to wallets, including a way to handle access control based on DIDs and ZCAPS. Similarly, the Linked Web Storage Working Group are writing a version 1.0 of the Solid storage server spec, which could be combined with work the Solid Share project is doing, exploring a way for a wallet application to mint access credentials that match an ACP policy on the storage. And in a third, independent initiative, Inrupt have published their own Solid Data Wallet API based on WebIDs and Data Grants, a fourth one is SISSI from KIT.

Apart from these four, there are probably other protocols for wallet-PDS interaction with a similar goal that I'm not listing here. And then there are the proprietary backup solutions at the wallet level and at the device level that I mentioned earlier. I think it's clear from the number of concurrently existing options now, that if we look over the next 40 years, there will not be a single dominant protocol by which data will travel between a wallet and a PDS. That's why I think we should not try to create a single underlying layer of data transport and access control to enable the 40-year persistence of personal data in education.

VC Attachments

There is a particular mechanism that we explore in the EduWallet project, that solves a piece of the puzzle, namely VC attachments. Suppose a student has completed a class at the Amsterdam Hip Hop Academy, and they received their badge. Now they want to apply as a youth mentor at a dance school, and they want to show off their skills in their application. The badge by itself is important, but the picture of the applicant comes to life if it includes detailed micro credentials of individual skills and courses, and if these credentials can link to video footage and other media. We are therefore building a prototype that uses the evidence feature of the W3C's Verifiable Credentials Data Model v2.0.

In a VC's evidence field, a URL and a digest of for instance a video can be provided, to be covered under the issuer's signature. It's not possible to provide multiple alternative URLs (as I think would have been useful in a multi-homed data architecture), and there is also no mechanism for access control - the assumption is apparently that the video is either public, or the link is not to be leaked.

Conclusion

In conclusion, we should design for a multi-homed persistence of the personal data of a learner, while they might easily own ten different smartphones (one every four years), or whatever devices will come after smartphones, from the time a credential is issued to the time a credential is presented. It should always be possible to use a combination of authentication factors to regain access to the personal data, and to its holder binding which proves that the learner is actually their data subject.

Access control is not a layer that can be organised per data location or with one specific way for a sync client to request access or to provide proof of group membership. There need to be multiple systems working in parallel in a polyglot system, each data link playing its part, in its own language, to make the data more multi-homed.

Combining a wallet with a personal data store will be very useful for attachments that are too big to fit into a VC, as we show in the prototype we are currently building, and hope to complete during the summer of 2025. Apart from that, wallets and PDSs are both just nodes in a polyglot data storage network, where multiple data transport protocols, data formats, access control mechanisms, backup strategies and authentication factors work together to keep a learner's personal data safe and available throughout their life-long learning experience.