Aseer

Posted on Sep 10, 2023

What Are Persistent Identifiers and Why We Need Them

#beginners #web3 #webdev #stem

Overview

The concept of Persistent Identifiers is important to understand because it dictates how we store and find intel throughout the internet. It can be anything from the source code of programs to scientific research.

If we don't set up a fair and efficient system in place to track this information, it may greatly affect what we consume as the truth. Yet, most devs are unaware of this topic. This article will briefly cover all the essentials about PID and where we stand with it as of today.

What is a PID

A persistent identifier is a long-lasting reference to a digital resource. It is a unique string that provides the information to reliably identify, verify and locate data.

Without an industry-wide standard identifier, we would not be able to manage transactions and support systems between publishers and clients, which would introduce copyright issues.

To make your data easy to find, you must provide your data and metadata with a persistent identifier. They are critical for building and maintaining robust links between objects, communities and infrastructures. This is especially critical for having reliable citations in scientific literature.

A Brief History of PID

Many of the problems with the internet were due to the function of the URL, which was never meant to be an identifier but only to designate the location of objects. To overcome that difficulty, our primary task has been to design and implement a system based on a persistent identifier that would be assigned to an object when it was created and stay with it throughout its life.

Below are some implementations of PID.

People and organizations
- Open Researcher and Contributor ID (ORCID)
- Research Organization Registry (ROR)
Publications
- Virtual International Authority File (VIAF)
- International Standard Name Identifier (ISNI)
- International Standard Book Number (ISBN)
Uniform Resource Identifiers
- Archival Resource Key (ARK)
- Digital Object Identifier (DOI)
- Magnet link
- Uniform Resource Names (URNs)
- Extensible Resource Identifiers (XRIs)
- Persistent Uniform Resource Locators (PURLs)
- Software Heritage Identifiers (SWHIDs)

All of the above PID systems were designed to fill specific roles and solve certain problems, but today the most standardized and widely used system is Digital Object Identifier also known as DOI.

The DOI System is an application of the Handle System Resolver, originally developed by the Corporation for National Research Initiatives. As of today, it is run by an organization called The DOI Foundation.

Closer Look on DOI

A DOI consists of two parts:

A prefix containing the directory designation and the registrant number, both assigned by the Directory Manager.
A suffix that uniquely identifies the particular digital object.

The first two characters of the prefix represent the Directory Manager that issued it, and for the time being, all DOIs begin with 10. The second part of the prefix (.1000) defines the organization, publisher, or any rights-owner or controller registrant that purchased the DOI prefix. Everything after the slash is the suffix, a unique character string created by the publisher that defines the specific digital object.

How DOI System Operates

The DOI system attaches a persistent identifier to a digital object, thereby providing a label that will define that object for its entire life, independent of where it is located. This identifier is created along with or even before that object itself was created.

A record of that DOI, along with information on the location of its object, is then sent to a central server where the DOI data are stored. Once a DOI is registered in the DOI System server, or repository, it will be stored there virtually forever. That centrally stored data forms a resolver database that can link or resolve a DOI to its object's location.

When a user asks for a digital object (say a journal), a DOI query goes to the DOI server. That server finds the record of the DOI and the address of its object, links the two together and sends the location (most likely the URL) back to the user's browser. Now the browser can retrieve the object and show it to the user. This increases the chance of retrieval significantly in environments where URLs change a lot.

The objects reside in databases controlled and maintained by the publishers. When an object's location changes for any reason, information about that is sent by the object's owner to the central server, which automatically updates the record.

As long as the Foundation gives authority for issuing and updating DOIs only to those who will be conscientious about keeping their data up to date, the central database should retain a high degree of integrity.

DOI as a Persistent Identifier

We have seen that the DOI System provides a viable means of identifying online objects with infinite persistence across a variety of publishers. However, if too many small or mid-sized publishers are allowed to participate in the DOI System, such as libraries, the system may ultimately collapse through the accumulation of DOIs for objects that have been destroyed.

On the other hand, should such publishers not be allowed to participate in the DOI System, they will likely be forced to create other similar systems with many of the same features, and manage these systems themselves to guarantee their quality. This is not a trivial task, as illustrated by the money, complexity, and amount of work that has gone into making the DOI System.

Problems With DOI

The importance of the work done on the design of the DOI System is difficult to overrate. Solving the problem of identifying specific objects on the internet is extremely important, and the DOI System has so far helped with that solution. Still, there are some issues concerning this system.

At present, only established commercial and society publishers are purchasing publisher prefixes and so are allowed to issue DOIs. This means that most individual and small publishers are not participating directly in the DOI System, but are merely acting as end users.
Those who participate in the DOI System will need to include the overhead of detailed housekeeping of the DOIs and each item's metadata. In addition, there are the maintenance fees for the resolver database server that the Foundation will need to levy to track the traded, retired, erased, or simply forgotten identifiers.

A Modern Solution

It's clear that we should look into upgrading this system and solve these problems, or adopt a different PID System altogether.

In the modern world, when you say unique IDs, the first thing to cross the mind of a dev is Web3 since it's all about creating long unique IDs for digital stuff. The idea is to apply this concept to create PIDs and use them the way we have been using DOIs.

Below are some of the advantages of using a globally distributed content-based addressing PID system.

We would eliminate the involvement of central organizations/agencies that issue PIDs and have authority over the intelligence that they maintain. As all the digital information would never be stored in a single server, it will be truly open.
We would drastically reduce the cost of maintaining such a system. Since there are no designated servers or data centers, it will be far easier to regulate.
Every publisher will always be able to participate in this system and guarantee the quality of their work, no matter how small. There would be no need to purchase licenses for issuing PIDs and adhering to strict rules set by anyone.
This system will not have to worry about problems like the regulation of IDs for objects that have been destroyed or abandoned. Content-based addresses are extremely versatile.

The PID system known as MagnetLink already implements this to an extent, but it comes with its own limitations.

What is Content-Based Addressing?

I have explained this in detail in my Beginner's Guide to IPFS blog but in a nutshell, it's like creating a name for your digital object by using the digital object itself.

The URL of a digital object represents where on the internet it is stored. But the content-based ID is a string of characters that has been created by using the object itself as a key. If the data changes even a little, the ID also changes. It is created using hashing algorithms like SHA. This video briefly explains it.

Drawbacks Of This System

It may look like the most ideal PID system at first glance, but it's far from it. Some of the disadvantages of this system are listed below.

Mutability

As the ID directly depends on the content itself, it makes the content essentially immutable. Suppose you publish an article with an ID from a hashing algorithm that your users can then use to cite this article and find it easily on the internet. If you were to correct so much as a typo in this article, the entire ID will change and the older ID will be rendered useless.

To overcome this issue of immutability, there are other algorithms put in place that will allow users to continue using the older ID as if it was never changed. There are several designs to make content-based addressed documents mutable which we will not be discussing here, but this is a good starting point.

Security

The security of a hashing algorithm essentially depends on how advanced the technology is at a given time. For instance, the hashing algorithm MD5 is no longer useful because of how easy it is to deliberately create hash collisions, thanks to the speed of modern computers. This video explains it in detail.

Knowing this, it's worth noting that the hash algorithms that are secure today will soon be useless. In fact, it's already happening as the count of quantum computers in the world grows. They are capable of breaking many modern systems that we have relied on for decades, like RSA and SSH, forcing us to design better systems that are harder to break.

In any case, if computers advance in technology and speed, they also make it easier to implement harder algorithms which would normally have been too slow. If they make it easier to break existing systems, they also make it easier to use more complex systems that even they can't break. We would just need a renewed hashing algorithm to create our PIDs.

The End

Welp, that's the end of it.

Let me know your thoughts on these PID systems in the comments and if I left out any important details. Check out this Wikipedia page for further reading.

If you like this article then consider following me on Hashnode.

DEV Community