Thursday, June 11, 2009

Publishing Government Data to the Linked Open Data (LOD) Cloud

In a previous post, I outlined a roadmap for migrating news content to the Semantic Web and the Linked Open Data (LOD) cloud. The BBC has been doing some interesting work in that space by using Linked Data principles to connect BBC Programmes and BBC Music to MusicBrainz and DBpedia. SPARQL endpoints are now available for querying the BBC datasets.

It is clear that Europe is ahead of the US and Canada in terms of Semantic Web research and adoption. The Europeans are likely to further extend their lead with the announcement this week that Tim Berners-Lee (the visionary behind the World Wide Web, the Semantic Web, and the Linked Open Data movement) will be advising the UK Government on making government data more open and accessible.



In the US, data.gov is part of the Open Government Initiative of the Obama Administration. The following is an excerpt from data.gov:

A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government by encouraging innovative ideas (e.g., web applications). Data.gov strives to make government more transparent and is committed to creating an unprecedented level of openness in Government. The openness derived from Data.gov will strengthen our Nation's democracy and promote efficiency and effectiveness in Government.

Governments around the world have taken notice and are now considering similar initiatives. It is clear that these initiatives are important for the proper functioning of democracy since they allow citizens to make informed decisions based on facts as opposed to the politicized views of special interests, lobbyists, and their spin doctors. These facts are related to important subjects such as health care, the environment, the criminal justice system, and education. There is an ongoing debate in the tech community about the best approach for publishing these datasets. There are several government data standards available such as the National Information Exchange Model (NIEM). In the Web 2.0 world, RESTful APIs with ATOM, XML, and JSON representation formats have become the norm.

I believe however that Semantic Web technologies and Linked Data principles offer unique capabilities in terms of bridging data silos, queries, reasoning, and visualization of the data. Again, the methodology for adopting Semantic Web technologies is the same:

  1. Create an OWL ontology that is flexible enough to support the different types of data in the dataset including statistical data. This is certainly the most important and challenging part of the effort.
  2. Convert the data from its source format to RDF. For example XSLT2 can be used to convert from CSV or TSV to RDF/XML and XHTML+RDFa. There are also RDFizers such as D2R for relational data sources.
  3. Link the data to other data sources such as Geonames, Federal Election Commission (FEC), and US Census datasets.
  4. Provide an RDF dump for Semantic Web Crawlers and a SPARQL endpoint for querying the datasets.

The following are some of the benefits of this approach:

  • It allows users to ask sophisticated questions against the datasets using the SPARQL query language. These are the kind of questions that a journalist, a researcher, or a concerned citizen will have in mind. For example, which airport has the highest number of reported aircraft bird strikes? (read more here about why Transportation Secretary Ray LaHood rejected a proposal by the FAA to keep bird strikes data secret). Currently data.gov provides only full-text and category-based search.
  • It bridges data silos by allowing users to make queries and connect data in meaningful ways across datasets. For example, a query that correlates health care, environment, and census data.
  • It provides powerful visualizations of the data through Semantic Web meshups.

6 comments:

Kingsley Idehen said...

Joel,

Great post! I would just like to make a vital addition. D2R is one of several technologies for generating RDF based Linked Data from RDBMS data sources.

Here are links re. RDBMS to RDF middleware realm:

1. http://esw.w3.org/topic/Rdb2RdfXG/StateOfTheArt - Ground zero for RDBMS to RDF Mapping Technology
2. http://tr.im/kiiD - Virtuoso RDF Views for RDBMS Data Sources Page
3. http://tr.im/og81 - Generating Virtuoso RDF Views over RDBMS Data & Linked Data Deployment Example

Joel Amoussou said...

Some interesting Semantic Web benchmarks:

The Berlin SPARQL Benchmark (BSBM)

The Lehigh University Benchmark (LUBM)

Joel Amoussou said...

Tim Berners-Lee on "Putting Government Data Online"

Roy Roebuck's One World said...

This is an excellent article for both the business reader and techie.

I fully agree with your steps 1, 2, and 4, having advocated the same approach for several years.

I submit that step 3 would lead to the common concequence of all many-to-many integration efforts - infinite interface requirements.

I submit instead that once the specific OWL ontology for a given viewpoint it built in step 2 (and its likely multiple variants), it should be mapped to an intermediate ontology designed as a generalized model capable of categorizing the parts, relations, and attributes of all specific ontologies. I describe my public domain approach at http://gem-ema.one-world-is.org (registration is required). The intermediate ontology then serves as a "hub" by which views across diverse ontologies can be integrated, leading to a potential of virtual unification of those ontologies, and translation between them.

Further, to keep the intermediate ontology and specific ontologies continually refreshed in both structure and content, apply an integrated terminology management approach that builds the complete spectrum of semantic products ending in a shared thesaurus for translating jargon and natural languages.

Roy

Joel Amoussou said...

Your idea of an intermediate ontology is very interesting. As I said in the post, ontology design will be the most challenging part of the effort. So proposals like yours are welcome. Publishing to the Linked Open Data (LOD) cloud does not require an ontology. But I do believe that foundational ontologies (such as FOAF, DC, geonames, and SIOC) are important for the success of the Semantic Web.

In step 3, what I am proposing is to create RDF links between different data sets using dereferencable URIs as specified by Tim Berners-Lee in Linked Data Design Issues.

The advantage of RDF links is that they facilitate entity correlation. For example, linking campaign contributions from fec.gov to voting records from senate.gov and locations in geonames.org.

Joel Amoussou said...

Flowing data creates nice visualization of aircraft-bird srike data.