Justice transparency: A Computer Weekly Downtime Upload podcast

The National Archives has deployed a semantic data platform, powered by MarkLogic, to support the UK government’s drive to improve transparency in the justice system.

John Sheridan, digital director at The National Archives, defines semantic data as a formal conceptual model to describe data. At The National Archives, a semantic data platform, powered by MarkLogic, is being used for the Find Case Law service. “We are preserving court judgments in the digital archive for the future,” he said.

Describing the challenge in making court judgments searchable, Sheridan said: “You can imagine that if you are just publishing court documents such as PDF documents, this doesn’t tell you very much about what the judgment is about. But if you add clearly defined pieces of information, such as the way the court identifies judgments, then you have a document that tells you that ‘I’m not just any old document – I’m a judgment and I was given by this court and I was given on this date’.” In effect, the document becomes self-describing.

Looking at the process for populating the Find Case Law archive, Nicki Welch, service owner for access to digital records at The National Archives, said: “The information that you need about the judgment is in the judgment.”

The judgment starts as a Microsoft Word document. A microservice is then used to convert it into Legal Document Markup Language, an open standard for legal documents. This XML document is then published and also transformed into HTML.

The National Archives has been running the government’s legislation.co.uk service for several years. Sheridan said this meant it had a pretty good idea of the model that could be used for representing the judgments. “We wanted to align the data model with what we were doing with legislation.co.uk and so it made sense to align the technology,” he said.

As a result, MarkLogic was chosen as the database on which to base the Find Case Law service. Sheridan said that by using MarkLogic: “We can store the documents. We can store the semantic data that we extract from the documents, and we can use it to run our search. It’s technology that we were familiar with as an organisation.”

The next phase for The National Archives is to integrate the Find Case Law service with another microservice called the Judgment Enrichment Pipeline, to mark up other cases and pieces of legislation cited in the judgment. Sheridan said: “This will help users navigate between related judgments and related pieces of legislation which are published on our sister service, legislation.gov.uk.”

Viability of file formats

Given its role as an archive, there is a question of how long data in a particular format will still be readable. Asked how The National Archives assesses the long-term viability of storing digital information in a particular file format, Sheridan said: “The records need to survive any technology that we might use for managing them or to produce them.”

The technologies being used today may no longer exist in the future. “We need to view that through a risk lens,” he added.

The challenge is about ensuring the survivability of the record and how to mitigate the risks. One approach is to ensure the document can be migrated to a different file format. The other is that because the document file format used in Microsoft Word is a pervasive file format, the chances of the document becoming unreadable over the next few decades are highly unlikely.

But what is stored in the document and how it is rendered are two very different things, said Welch. “There is a different way of presenting and rendering information within a file,” she said. “So, for example, when we are building digital services for the web and HTML, CSS is a common standard format that you would expect to see if you were accessing information on a web browser.”

Applying this to long-term document archives, The National Archives has the concept of a surrogate that offers a different way to render the information stored in a file. “Surrogate creation allows us to pick the best format,” she added.

But this conversion process can be difficult, said Welch: “One of the issues we found with the judgments is that they are highly styled documents.”

Judges may use indentation, bold, italics and paragraph numbering when drafting judgments, in order to add meaning and structure, she said, adding: “When you take all that nuanced styling that Word encodes in a certain way and you try converting it to XML and transform it into HTML, little gremlins can creep in.”

This means the document may not look the same as what the judge had originally intended. Welch said The National Archives is currently trying to balance this difference between the original document and how it is presented once transformed.