Thursday, June 11, 2009

Publishing Government Data to the Linked Open Data (LOD) Cloud

In a previous post, I outlined a roadmap for migrating news content to the Semantic Web and the Linked Open Data (LOD) cloud. The BBC has been doing some interesting work in that space by using Linked Data principles to connect BBC Programmes and BBC Music to MusicBrainz and DBpedia. SPARQL endpoints are now available for querying the BBC datasets.

It is clear that Europe is ahead of the US and Canada in terms of Semantic Web research and adoption. The Europeans are likely to further extend their lead with the announcement this week that Tim Berners-Lee (the visionary behind the World Wide Web, the Semantic Web, and the Linked Open Data movement) will be advising the UK Government on making government data more open and accessible.



In the US, data.gov is part of the Open Government Initiative of the Obama Administration. The following is an excerpt from data.gov:

A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government by encouraging innovative ideas (e.g., web applications). Data.gov strives to make government more transparent and is committed to creating an unprecedented level of openness in Government. The openness derived from Data.gov will strengthen our Nation's democracy and promote efficiency and effectiveness in Government.

Governments around the world have taken notice and are now considering similar initiatives. It is clear that these initiatives are important for the proper functioning of democracy since they allow citizens to make informed decisions based on facts as opposed to the politicized views of special interests, lobbyists, and their spin doctors. These facts are related to important subjects such as health care, the environment, the criminal justice system, and education. There is an ongoing debate in the tech community about the best approach for publishing these datasets. There are several government data standards available such as the National Information Exchange Model (NIEM). In the Web 2.0 world, RESTful APIs with ATOM, XML, and JSON representation formats have become the norm.

I believe however that Semantic Web technologies and Linked Data principles offer unique capabilities in terms of bridging data silos, queries, reasoning, and visualization of the data. Again, the methodology for adopting Semantic Web technologies is the same:

  1. Create an OWL ontology that is flexible enough to support the different types of data in the dataset including statistical data. This is certainly the most important and challenging part of the effort.
  2. Convert the data from its source format to RDF. For example XSLT2 can be used to convert from CSV or TSV to RDF/XML and XHTML+RDFa. There are also RDFizers such as D2R for relational data sources.
  3. Link the data to other data sources such as Geonames, Federal Election Commission (FEC), and US Census datasets.
  4. Provide an RDF dump for Semantic Web Crawlers and a SPARQL endpoint for querying the datasets.

The following are some of the benefits of this approach:

  • It allows users to ask sophisticated questions against the datasets using the SPARQL query language. These are the kind of questions that a journalist, a researcher, or a concerned citizen will have in mind. For example, which airport has the highest number of reported aircraft bird strikes? (read more here about why Transportation Secretary Ray LaHood rejected a proposal by the FAA to keep bird strikes data secret). Currently data.gov provides only full-text and category-based search.
  • It bridges data silos by allowing users to make queries and connect data in meaningful ways across datasets. For example, a query that correlates health care, environment, and census data.
  • It provides powerful visualizations of the data through Semantic Web meshups.

Tuesday, June 2, 2009

S1000D and SCORM Integration

This is a presentation I gave at the DocTrain Boston 07 conference on how to reduce product lifecycle costs by integrating the S1000D and SCORM specifications.

S1000D is the International Specification for Technical Publications utilizing a Common Source Database (CSDB). Based on open XML standards, the latest issue (4.0) has been developed by the AeroSpace and Defence Industries Association of Europe (ASD), the Aerospace Industries Association of America (AIA), and the Air Transport Association of America (ATA).

Sharable Content Object Reference Model (SCORM) is a specification for online learning content developed by the Advanced Distributed Learning (ADL) Initiative.

The presentation has been updated to reflect the addition of SCORM support in S1000D 4.0.