Monday, December 29, 2008

From Web 2.0 to the Semantic Web: Bridging the Gap in the News and Media Industry

The news and media industry is going through fundamental changes. This transformation is driven by the current economic downturn and the emergence of the web as the new platform for creating and publishing content. The decline in ad revenues has forced some media companies to cancel their print publications (e.g. PC Magazine and the Christian Science Monitor). Others such as Forbes Media are consolidating their online and editorial groups into a single entity.

What are the opportunities and challenges of the Semantic Web (sometimes referred to as Web 3.0) for the industry and how can these companies embrace and extend the new web of Linked Data?

Widgets and APIs

Media companies are looking for ways to gain a competitive advantage from their web offerings. In addition to Web 2.0 features like blogs and RSS feeds, major news organizations such as the New York Times (NYT) and the National Pubic Radio (NPR) are opening their content to external developers through APIs. These APIs allow developers to mash-up content in new ways limited only by their own imagination. The ultimate goal is to drive ad revenues by pushing content to places like social networking sites and blogs where readers “hang out” online. Another interesting example is the Times Widget (from the NYT) which allows readers to insert NYT content such as news headlines, stock quotes, and recipes on their personal web pages or blogs.

Beyond Web 2.0: the Semantic Web

Today's end users obtain information from a variety of sources such as online newspapers, blogs, Wikipedia, YouTube videos, Flickr photos, cartoons, animations, and social networking sites such as Facebook and Twitter. I personally get some of the news I read from the people I follow on Twitter (including political cartoons at The challenge is to integrate all these sources of information into a seamless search and browsing experience for readers. Publishers are starting to realize the importance of this integration as illustrated by the recently unveiled "Times Extra" feature from the NYT.

The key to this integration is metadata and this is where Semantic Web technologies such as RDF, OWL, and SPARQL have an important role to play in presenting massive amounts of content to end users in a way that is compelling. Semantic search and browsing as well as inference capabilities can help publishers in their effort to attract and retain readers and boost ad revenues.

News and Media Industry Metadata Standards

Established in 1965, the International Press Telecommunications Council (IPTC) is a consortium of news agencies and publishers including the Associated Press, the NYT, Reuters, and the Dow Jones Company. The IPTC maintains and publishes a set of news exchange and metadata standards for media types such as text, photos, graphics, and streaming media like audio and video. This includes:

  • NITF which defines the content and structure of news articles
  • NewsML 1 for the packaging and exchange of multimedia news
  • NewsML-G2, the latest standard for the exchange of all kinds of media types
  • EventsML-G2 for news events
  • SportsML for sport data

These standards are all based on XML and use XML Schema Definitions (XSDs) to describe the structure and content of the media types. In addition, IPTC defines a taxonomy for news items called NewsCodes as well as a Photo Metadata Standard based on Adobe's XMP specification.

Another interesting standard in the news and media industry is the Publishing Requirements for Industry Standard Metadata (PRISM) specification. Also based on XML, PRISM is more applicable to magazines and journals and is compatible with RDF. There is a certain degree of overlap between PRISM and the IPTC news metadata standards. Media companies that have adopted these standards are well positioned to bridge the gap to the Semantic Web.


The W3C Semantic Web Activity and the Semantic Web community have been working on a number of specifications and tools to facilitate the transition to the Semantic Web. The W3C Gleaning Resource Descriptions from Dialects of Languages (GRDDL) specification defines a method for using XSLT to extract RDF statements from existing XML documents. For example, using GRDDL, a news organization would be able to generate RDF from content items marked up in NITF or NewsML.

The recently approved RDFa W3C specification allows publishers to get their content ready for the Semantic Web by adding extension attributes to XHTML content to capture semantic information based on Dublin Core or FOAF for example.

The W3C Simple Knowledge Organization System (SKOS) is an application of RDF that can be used to represent existing taxonomies and classification schemes (such as the IPTC NewsCodes) in a way that is compatible with the Semantic Web. SPARQL is a query language for RDF that is supported in Semantic Web toolkits such as Jena.

There are also tools that can be used to derive an OWL ontology from an existing XSD. This transformations can be straightforward if the XSD was designed to be compatible with RDF. As an example, an OWL ontology can be derived from the IPTC's NewsML-G2 XSDs. Once created, new ontologies can be linked to existing ontologies such as Dublin Core, FOAF, and DBPedia as described by Tim Berner Lee in his vision of Linked Data. Other relevant ontologies for the news and media industry include MPEG-7, the Core Ontology for Multimedia (COMM), and Geonames.

From Web 2.0 Tagging to Semantic Annotations

To facilitate the transition of the industry to the Semantic Web, it will be important to design content management systems (CMS) interfaces that make it easy for content contributors to add semantic annotations to their content. These systems certainly have a lot to learn from Web 2.0 tagging interfaces such as Flickr clustering and machine tags. However the complexity of content in the news and media industry demands more sophisticated annotation capabilities than those that are available to the masses on YouTube and Flickr. Therefore full support for Semantic Web standards like RDF, OWL, SKOS, and SPARQL will be expected.

An interesting application in this space is the Thompson Reuters Calais application which allows content publishers to automatically extract RDF-based metadata on entities such as people, places, facts, and events. is an example of a semantic news search application powered by Calais.

Calais has been integrated with the open source content management system (CMS) Alfresco to enable auto-tagging of content as well as the automatic suggestion of tags to content contributors. The latest release of Calais adds the ability to generate Linked Data to other sources such as DBPedia, GeoNames, and the CIA World Fact Book.


With the rapid growth of news content online, relevance is going to become very important. This will require metadata and structured data in general. By exposing content in XML format, news content APIs are certainly a step in the right direction. However, these APIs do have their own limitations and can create content silos in the future. Beyond APIs, Semantic Web technologies and Linked Data principles will ensure a uniform and intelligent access to news content. This will be the key to reader retention and content monetization.


Rob Gonzalez said...

Honestly, I don't buy the current Semantic Web value pitch for the media and publishing vs. technologies like XML. The value that I do see has to do with linking concepts, like authors to articles, or articles to industries, etc.

That said, from a data storage perspective, you can find that value in other publications. If you look at offerings in the media space like MarkLogic's content server, which allows you to assemble documents on the fly from document fragments, and allows you to use XQuery to perform joins on relationships between fragments on the fly, then what does RDF/OWL/SPARQL offer on top of that? They are merely different standards, not a technical panacea.

RDF, as a data model, is a graph, but so is the XQuery data model, but while A TON of the world's data is already in XML, and so is readily available to XQuery, very little data exists in RDF. furthermore, I don't see advantages that SPARQL has over XQuery as far as expressiveness is concerned (though I'd love to see examples if you can cook them up). OWL I think is the most interest of the main standards, but it really tries to boil the ocean, and I don't know that I buy that it's much more powerful than XML Schema when it gets right down to it.

I will give you RDFa. That's great stuff for crawlers...if they're made to take advantage of it. A bit of a chicken-and-egg problem there.

Joel Amoussou said...

XML/XSD/XQuery and RDF/OWL/SPARQL are not mutually exclusive. XML/XSD/XQuery is a great combination to create, process, and query XML documents. You can then RDFize those XML assets (in RDF/XML serialized form or RDFa) using XSLT or a GRDDL-based mechanism. You get the extra benefits of semantic searches, inference capabilities, and connecting to the Linked Data web.

Joel Amoussou said...
This comment has been removed by the author.
Joel Amoussou said...

Found these interesting sources:

XQuery, SPARQL and DBPedia in the same app

Querying Wikipedia with XQuery: WikiXMLDB