Hyperlinks are a powerful tool for journalists and their readers. Diving deep into the context of an article is just a click away. But hyperlinks are a double-edged sword; for all of the internet’s boundlessness, what’s found on the web can also be modified, moved, or entirely disappeared. This often-irreversible decay of web content is commonly known as linkrot. It comes with a similar problem of content drift, or the often-unannounced changes––retractions, additions, replacement––to the content at a particular URL.
Our team of researchers at Harvard Law School has undertaken a project to gain insight into the extent and characteristics of journalistic linkrot and content drift. We examined hyperlinks in New York Times articles starting with the launch of the Times website in 1996 up through mid-2019, developed on the basis of a dataset provided to us by the Times. We focus on the Times not because it is an influential publication whose archives are often used to help form a historical record. Rather, the substantial linkrot and content drift we find here across the New York Times corpus accurately reflects the inherent difficulties of long-term linking to pieces of a volatile web.
Results show a near linear increase of linkrot over time, with interesting patterns emerging within certain sections of the paper or across top level domains. Over half of articles containing at least one URL also contained a dead link. Additionally, of the ostensibly “healthy” links existing in articles, a hand review revealed additional erosion to citations via content drift.