Showing posts with label Media. Show all posts
Showing posts with label Media. Show all posts

Sunday, July 26, 2009

From Web 2.0 to the Semantic Web: Bridging the Gap in Newsmedia

In this presentation, I explain the Semantic Web value proposition for the newsmedia industry and propose some concrete steps to bridge the gap.

Welcome to the world of news in the Web 3.0 era.

Thursday, March 26, 2009

News Content APIs: Uniform Access or Content Silos

The first generation of online news content applications was designed for consumption by humans. With the massive amounts of online content now available, machine processable structured data will be the key to findability and relevance. Major news organizations like the New York Times (NYT), the National Public Radio (NPR), and The Guardian have recently opened up their content repositories through APIs.

These APIs have generated a lot of excitement in the content developers community and are certainly a significant step forward in the evolution of how news content is processed and consumed on the web. The APIs allow developers to create new interesting mashup applications. An example of such a mashup is a map of the United States showing how the stimulus money is being spent by municipalities across the country with hotspots to local newspaper articles about corruption investigations related to the spending. The stimulus spending data will be provided by the Stimulus Feed on the recovery.gov site as specified by the "Initial Implementation Guidance for the American Recovery and Reinvestment Act" document. This is certainly an example of mashup that US tax payers will like.

For news organizations, these APIs represent an opportunity to grow their ad network by pushing their content to more sites on the web. That's the idea behind the recent release of The Guardian Open Platform API.

APIs and Content Silos

The emerging news content APIs typically offer a REST or SOAP web services interface and return content in XML, JSON, or ATOM feeds. However, despite the excitement that they generate, these APIs can quickly turn into content silos for the following reasons:

  • The structure of the content is often based on a proprietary schema. This introduces several potential interoperability issues for API users in terms of content structure, content types, and semantics.
  • It is not trivial to link content across APIs
  • Each API provides its own query syntax. There is a need for universal data browsers and a query language to read, navigate, crawl, and query structured content from different sources.

XML, XSD, XSLT, and XQuery

Migrating content from HTML to XML (so called document-oriented XML) has many benefits. XSLT enables media-independent publishing (single sourcing) to multiple devices such as Amazon's Kindle e-reader and "smart phones". With XQuery, sophisticated contextualized database-like queries can be performed, turning the content itself into a database. In addition, XQuery allows the dynamic assembly of content where new compound documents can be constructed on the fly from multiple documents and external data sources. This allows publishers to repurpose content into new information products as needed to satisfy new customer demands and market opportunities.

However XSD, XSLT, and XQuery operate at the syntax level. The next level up in the content technology stack is semantics and reasoning and that's where RDF, OWL, and SPARQL come into play. To illustrate the issue, consider three news organizations, each with their own XML Schema for describing news articles. To describe the author of an article, the first news organization uses the <creator> element, the second the <byline> element, and the third the <author> element. All of these three distinct element names have exactly the same meaning. Using an OWL ontology, we can establish that these three terms are equivalent.

Semantic Web and Linked Data to the Rescue

Semantic web technologies such as RDF, OWL, and SPARQL can help us close the semantic gap and also open up new opportunities for publishers. Furthermore, with the decline in ad revenues, news organizations are now considering charging users for accessing content online. Semantic web technologies can enrich content by providing new ways to discover and explore content based on user context and interests. An interesting example is a mashup application built by Zemanta called Guardian topic researchr which extract entities (people, places, organizations, etc.) from The Guardian Open Platform API query results and allows readers to explore these entities further. In addition, the recently unveiled Newssift site by the Financial Times is an indication that the industry is starting to pay attention to the benefits of "semantic search" as opposed to keyword search.

The rest of this post outlines some practical steps for migrating news content to the Semantic Web. For existing news content APIs, an interim solution is to create Semantic Web wrappers around these APIs (more on that later). The long term objective however should be to fully embrace the Semantic Web and adopt Linked Data principles in publishing news content.

Adopt the International Press Telecommunication Council (IPTC) News Architecture (NAR)

The main reason for adopting the NAR is interoperability at the content structure, content types, and semantic levels. Imagine a mashup developer trying to integrate news content from three different news organizations. In addition to using three different element names (<creator>, <byline>, and <author>) to describe the same concept, these three organizations use completely different XML Schemas to describe the structure and types of their respective news content. That can lead to a data mapping nightmare for the mashup developer and the problem will only get worse as the number of news sources increases.

The NAR content model defines four high level elements: newsItem, packageItem, conceptItem, and knowledgeItem. You don't have to manage your content internally using the XML structure defined by the NAR. However, you should be able to map and export your content to the NAR as a delivery format. If you have fields in your content repository that do not map to the NAR structure, then you should extend the standard NAR XML Schema using the appropriate XML Schema extension mechanism that allows you to clearly identify your extension elements in your own XML namespace. Provide a mechansim such as dereferenceable URIs to allows users to obtain the meaning of these extensions elements.

The same logic applies to the news taxonomy that you use. Adopting the IPTC News Codes which specified 1300 terms used for categorizing news content will greatly facilitate interoperability as well.

Adopt or Create a News Ontology

Several news ontologies in RDFS or OWL format are now available. The IPTC is in the process of creating an IPTC news ontology in OWL format. To facilitate semantic interoperability, news organizations should use this ontology when it becomes available. In mapping XML Schemas into OWL, ontology best practices should be followed. For example, if mapped automatically, container elements in the XML Schema could generate blank nodes in the RDF graph. However, blank nodes cannot be used for external RDF links and are not recommended for Linked Data applications. Also, RDF reification, RDF containers, and RDF collections are not SPARQL-friendly and should be avoided.

While creating the news ontology, you should reuse or link to other existing ontologies such as FOAF and Dublin Core using OWL elements like owl:equivalentProperty, owl:equivalentClass, rdfs:subClassOf, or rdfs:subPropertyOf.

Similarly, existing taxonomies should be mapped to an RDF compatible format using the SKOS specification. This makes it possible to use an owl:Restriction to constrain the value of a property in the OWL ontology to be an skos:Concept or skos:ConceptScheme.

Generate RDF Data

Assign a dereferenceable HTTP URI for each news item and use content negotiation to provide both an XHTML and an RDF/XML representation of the resource. When the resource is requested, an HTTP 303 See Other redirect is used to serve XHTML or RDF/XML depending on whether the browser's Accept header is text/html or application/rdf+xml. The W3C Best Practice Recipes for Publishing RDF Vocabularies explains how dereferenceale URIs and content negotiation work in the Semantic Web.

The RDF data can be generated using a variety of techniques. For example, you can use an XSLT-based RDFizer to extract RDF/XML from news item already marked up in XML. There are also RDFizers for relational databases. Entity extraction tools like Open Calais can also be useful particularly for extracting RDF metadata from legacy news items available in HTML format.

Link the RDF data to external data sources such as DBPedia and Geonames by using RDF links from existing vocabularies such as FOAF. For example, an article about US Treasury Secretary Timothy Geithner can use foaf:base_near to link the news item to a resource describing Washington, DC on DBPedia. If there is an HTTP URI that describes the same resource in another data source, then use owl:sameAs links to link the two resources. For example, if a news item is about Timothy Geithner, then you can use owl:sameAs to link to Timothy Geithner's data page on DBPedia. An RDF browser like Tabulator can traverse those links and help the reader explore more information about topics of interest.


Expose a SPARQL Endpoint

Use a Semantic Web Crawler (an extension to the Sitemap Protocol) to specify the location of the SPARQL endpoint or an RDF dump for Semantic Web clients and crawlers. OpenLink Virtuoso is an RDF store that also provides a SPARQL endpoint.

Provide a user interface for performing semantic searches. Expose the RDF metadata as facets for browsing the news items.

Provide a Semantic Web Wrapper for existing APIs.

A wrapper provides a deferenceable URI for every news item available through an existing news content API. When an RDF browser requests the news item, the Semantic Web wrapper translates the request into an API call, transforms the response from XML into RDF, and send it back to the Semantic Web client. The RDF Book Mashup is an example of how a Semantic Web Wrapper can be used to integrate publicly available APIs from Amazon, Google, and Yahoo into the Semantic Web.

Conclusion

The Semantic Web is still an obscure topic in the mainstream developers community. I hope I've outlined few practical steps you can take now to take advantage of the new Web of Linked Data.

Monday, December 29, 2008

From Web 2.0 to the Semantic Web: Bridging the Gap in the News and Media Industry

The news and media industry is going through fundamental changes. This transformation is driven by the current economic downturn and the emergence of the web as the new platform for creating and publishing content. The decline in ad revenues has forced some media companies to cancel their print publications (e.g. PC Magazine and the Christian Science Monitor). Others such as Forbes Media are consolidating their online and editorial groups into a single entity.

What are the opportunities and challenges of the Semantic Web (sometimes referred to as Web 3.0) for the industry and how can these companies embrace and extend the new web of Linked Data?

Widgets and APIs

Media companies are looking for ways to gain a competitive advantage from their web offerings. In addition to Web 2.0 features like blogs and RSS feeds, major news organizations such as the New York Times (NYT) and the National Pubic Radio (NPR) are opening their content to external developers through APIs. These APIs allow developers to mash-up content in new ways limited only by their own imagination. The ultimate goal is to drive ad revenues by pushing content to places like social networking sites and blogs where readers “hang out” online. Another interesting example is the Times Widget (from the NYT) which allows readers to insert NYT content such as news headlines, stock quotes, and recipes on their personal web pages or blogs.

Beyond Web 2.0: the Semantic Web

Today's end users obtain information from a variety of sources such as online newspapers, blogs, Wikipedia, YouTube videos, Flickr photos, cartoons, animations, and social networking sites such as Facebook and Twitter. I personally get some of the news I read from the people I follow on Twitter (including political cartoons at http://twitter.com/dcagle). The challenge is to integrate all these sources of information into a seamless search and browsing experience for readers. Publishers are starting to realize the importance of this integration as illustrated by the recently unveiled "Times Extra" feature from the NYT.

The key to this integration is metadata and this is where Semantic Web technologies such as RDF, OWL, and SPARQL have an important role to play in presenting massive amounts of content to end users in a way that is compelling. Semantic search and browsing as well as inference capabilities can help publishers in their effort to attract and retain readers and boost ad revenues.

News and Media Industry Metadata Standards

Established in 1965, the International Press Telecommunications Council (IPTC) is a consortium of news agencies and publishers including the Associated Press, the NYT, Reuters, and the Dow Jones Company. The IPTC maintains and publishes a set of news exchange and metadata standards for media types such as text, photos, graphics, and streaming media like audio and video. This includes:

  • NITF which defines the content and structure of news articles
  • NewsML 1 for the packaging and exchange of multimedia news
  • NewsML-G2, the latest standard for the exchange of all kinds of media types
  • EventsML-G2 for news events
  • SportsML for sport data

These standards are all based on XML and use XML Schema Definitions (XSDs) to describe the structure and content of the media types. In addition, IPTC defines a taxonomy for news items called NewsCodes as well as a Photo Metadata Standard based on Adobe's XMP specification.

Another interesting standard in the news and media industry is the Publishing Requirements for Industry Standard Metadata (PRISM) specification. Also based on XML, PRISM is more applicable to magazines and journals and is compatible with RDF. There is a certain degree of overlap between PRISM and the IPTC news metadata standards. Media companies that have adopted these standards are well positioned to bridge the gap to the Semantic Web.

From XML/XSD to RDF/OWL

The W3C Semantic Web Activity and the Semantic Web community have been working on a number of specifications and tools to facilitate the transition to the Semantic Web. The W3C Gleaning Resource Descriptions from Dialects of Languages (GRDDL) specification defines a method for using XSLT to extract RDF statements from existing XML documents. For example, using GRDDL, a news organization would be able to generate RDF from content items marked up in NITF or NewsML.

The recently approved RDFa W3C specification allows publishers to get their content ready for the Semantic Web by adding extension attributes to XHTML content to capture semantic information based on Dublin Core or FOAF for example.

The W3C Simple Knowledge Organization System (SKOS) is an application of RDF that can be used to represent existing taxonomies and classification schemes (such as the IPTC NewsCodes) in a way that is compatible with the Semantic Web. SPARQL is a query language for RDF that is supported in Semantic Web toolkits such as Jena.

There are also tools that can be used to derive an OWL ontology from an existing XSD. This transformations can be straightforward if the XSD was designed to be compatible with RDF. As an example, an OWL ontology can be derived from the IPTC's NewsML-G2 XSDs. Once created, new ontologies can be linked to existing ontologies such as Dublin Core, FOAF, and DBPedia as described by Tim Berner Lee in his vision of Linked Data. Other relevant ontologies for the news and media industry include MPEG-7, the Core Ontology for Multimedia (COMM), and Geonames.

From Web 2.0 Tagging to Semantic Annotations

To facilitate the transition of the industry to the Semantic Web, it will be important to design content management systems (CMS) interfaces that make it easy for content contributors to add semantic annotations to their content. These systems certainly have a lot to learn from Web 2.0 tagging interfaces such as Flickr clustering and machine tags. However the complexity of content in the news and media industry demands more sophisticated annotation capabilities than those that are available to the masses on YouTube and Flickr. Therefore full support for Semantic Web standards like RDF, OWL, SKOS, and SPARQL will be expected.

An interesting application in this space is the Thompson Reuters Calais application which allows content publishers to automatically extract RDF-based metadata on entities such as people, places, facts, and events. LinkedFacts.com is an example of a semantic news search application powered by Calais.

Calais has been integrated with the open source content management system (CMS) Alfresco to enable auto-tagging of content as well as the automatic suggestion of tags to content contributors. The latest release of Calais adds the ability to generate Linked Data to other sources such as DBPedia, GeoNames, and the CIA World Fact Book.

Conclusion

With the rapid growth of news content online, relevance is going to become very important. This will require metadata and structured data in general. By exposing content in XML format, news content APIs are certainly a step in the right direction. However, these APIs do have their own limitations and can create content silos in the future. Beyond APIs, Semantic Web technologies and Linked Data principles will ensure a uniform and intelligent access to news content. This will be the key to reader retention and content monetization.