Presenter: Diomidis Spinellis Date: 22 August 2022
The world-wide web has allowed scientific publications to include links to resources available in it via URLs. Previous research has show that the availability of resources identified by URLs is ephemeral, decaying with the passage of time. As a response to the problem of URL decay more permanent identifiers, such as DOIs, have been developed and are often used.
I describe the methods and first findings of an ongoing study on the evolution of web and DOI linking in scientific publications. The study is based on processing a collection of n-grams collected from 71 million documents on a small cluster of computers. Preliminary results indicate that the density of published links per document has increased 10 × over the past quarter century, but the percentage of DOIs has also been increasing to reach 20% in 2018. Looking at failures, I find the expected increase of URL failures as years go by. The DOI failures are surprising and warrant further closer investigation.
As an aside, I also looked at the feasibility of recreating documents from the n-grams via the Eulerian path approach used for stitching together DNA fragments. However, I reached the conclusion that there are too many candidate paths to do this deterministically.