The Technologies For Open Data

It could be a database of cases being dealt with, it could be a calendar of meetings, it might be a collection of PDF documents of the minutes of those meetings, or perhaps it’s even a filing cabinet containing manilla folders full of paper.

Let’s assume that we can get the data in a digital form, there would still be a wide range of different types of data. We can place them on a Web server so that people can download them, but it might be useful to try and categorise them in a way that helps people understand what type of data it is and how easy it will be for them to make use of the data once they’ve downloaded it.

Tim Berners-Lee came up with a simple five star rating system that helps describe the nature of published open data. The rating system can be summarised as follows:

One star data:

The data is in a proprietary format that might be easily readable by a person, but is perhaps harder to process by a computer. This might be a PDF document for example. A PDF of a document describing the expenditure of a local council would allow people to read what has been spent, but perhaps not allow them to easily write a computer script to check if any expenditure was over a certain amount.

Two star data:

Here, the data is a more machine readable form but still a proprietary format. An example here might be an MS Office Excel spreadsheet. It is easy to read, and a script could be written to examine it automatically, but the format is perhaps specific to a certain type of computer operating system or application, that may not be free to use.

Three star data:

Now, the data is in a non-proprietary format such as CSV (standing for comma separated variables.) This means that it can be opened by a range of applications and across a number of different computer platforms and operating systems. It is also relatively easy to process automatically using scripts, but the script will need to understand the format of the file, for example what each of the columns means.

Four star data:

Data in this form uses specific Web technologies that allow us to describe the semantics of the data. For this MOOC, we don’t have scope to discuss Semantic Web technologies in great detail although we’d encourage you to explore the area if you find it interesting, but in simple terms the data is written in a Web format such as RDF (Resource Description Framework) that can be used to describe the data in a way that allows machines to understand the semantics of the data more easily.

RDF helps promote greater interoperability by allowing the construction of data models (ontologies) that mean similar data can be described using the same vocabularies. This can help when constructing systems that want to access a range of similar datasets on similar systems. It should be noted that data in this format is generally harder for people to read directly. Special browsers have been developed to make the data easier for people to read, or alternative versions of the data can be also provided in formats of 1-3 star ratings.

Five star data:

The gold standard of open data, this is where the data hk   data is written in a semantic format such as RDF, but importantly refers to data in other datasets using references or links. In the same way that web pages refer to other web pages, datasets can also link to other datasets. This helps avoid large scale duplication of data and helps turn discrete data sets into a Web of data.

The Semantic Web is a rich area of Computer Science research and these technologies are gradually beginning to link up large datasets of information around the globe, providing unique opportunities for both ‘Big Data’ research, and more powerful commercial information systems.


Leave a Reply

Your email address will not be published. Required fields are marked *

WC Captcha 9 + 1 =