Adventures in Computing: December 2008

The news and media industry is going through fundamental changes. This transformation is driven by the current economic downturn and the emergence of the web as the new platform for creating and publishing content. The decline in ad revenues has forced some media companies to cancel their print publications (e.g. PC Magazine and the Christian Science Monitor). Others such as Forbes Media are consolidating their online and editorial groups into a single entity.

What are the opportunities and challenges of the Semantic Web (sometimes referred to as Web 3.0) for the industry and how can these companies embrace and extend the new web of Linked Data?

Widgets and APIs

Media companies are looking for ways to gain a competitive advantage from their web offerings. In addition to Web 2.0 features like blogs and RSS feeds, major news organizations such as the New York Times (NYT) and the National Pubic Radio (NPR) are opening their content to external developers through APIs. These APIs allow developers to mash-up content in new ways limited only by their own imagination. The ultimate goal is to drive ad revenues by pushing content to places like social networking sites and blogs where readers “hang out” online. Another interesting example is the Times Widget (from the NYT) which allows readers to insert NYT content such as news headlines, stock quotes, and recipes on their personal web pages or blogs.

Beyond Web 2.0: the Semantic Web

Today's end users obtain information from a variety of sources such as online newspapers, blogs, Wikipedia, YouTube videos, Flickr photos, cartoons, animations, and social networking sites such as Facebook and Twitter. I personally get some of the news I read from the people I follow on Twitter (including political cartoons at http://twitter.com/dcagle). The challenge is to integrate all these sources of information into a seamless search and browsing experience for readers. Publishers are starting to realize the importance of this integration as illustrated by the recently unveiled "Times Extra" feature from the NYT.

The key to this integration is metadata and this is where Semantic Web technologies such as RDF, OWL, and SPARQL have an important role to play in presenting massive amounts of content to end users in a way that is compelling. Semantic search and browsing as well as inference capabilities can help publishers in their effort to attract and retain readers and boost ad revenues.

News and Media Industry Metadata Standards

Established in 1965, the International Press Telecommunications Council (IPTC) is a consortium of news agencies and publishers including the Associated Press, the NYT, Reuters, and the Dow Jones Company. The IPTC maintains and publishes a set of news exchange and metadata standards for media types such as text, photos, graphics, and streaming media like audio and video. This includes:

NITF which defines the content and structure of news articles
NewsML 1 for the packaging and exchange of multimedia news
NewsML-G2, the latest standard for the exchange of all kinds of media types
EventsML-G2 for news events
SportsML for sport data

These standards are all based on XML and use XML Schema Definitions (XSDs) to describe the structure and content of the media types. In addition, IPTC defines a taxonomy for news items called NewsCodes as well as a Photo Metadata Standard based on Adobe's XMP specification.

Another interesting standard in the news and media industry is the Publishing Requirements for Industry Standard Metadata (PRISM) specification. Also based on XML, PRISM is more applicable to magazines and journals and is compatible with RDF. There is a certain degree of overlap between PRISM and the IPTC news metadata standards. Media companies that have adopted these standards are well positioned to bridge the gap to the Semantic Web.

From XML/XSD to RDF/OWL

The W3C Semantic Web Activity and the Semantic Web community have been working on a number of specifications and tools to facilitate the transition to the Semantic Web. The W3C Gleaning Resource Descriptions from Dialects of Languages (GRDDL) specification defines a method for using XSLT to extract RDF statements from existing XML documents. For example, using GRDDL, a news organization would be able to generate RDF from content items marked up in NITF or NewsML.

The recently approved RDFa W3C specification allows publishers to get their content ready for the Semantic Web by adding extension attributes to XHTML content to capture semantic information based on Dublin Core or FOAF for example.

The W3C Simple Knowledge Organization System (SKOS) is an application of RDF that can be used to represent existing taxonomies and classification schemes (such as the IPTC NewsCodes) in a way that is compatible with the Semantic Web. SPARQL is a query language for RDF that is supported in Semantic Web toolkits such as Jena.

There are also tools that can be used to derive an OWL ontology from an existing XSD. This transformations can be straightforward if the XSD was designed to be compatible with RDF. As an example, an OWL ontology can be derived from the IPTC's NewsML-G2 XSDs. Once created, new ontologies can be linked to existing ontologies such as Dublin Core, FOAF, and DBPedia as described by Tim Berner Lee in his vision of Linked Data. Other relevant ontologies for the news and media industry include MPEG-7, the Core Ontology for Multimedia (COMM), and Geonames.

From Web 2.0 Tagging to Semantic Annotations

To facilitate the transition of the industry to the Semantic Web, it will be important to design content management systems (CMS) interfaces that make it easy for content contributors to add semantic annotations to their content. These systems certainly have a lot to learn from Web 2.0 tagging interfaces such as Flickr clustering and machine tags. However the complexity of content in the news and media industry demands more sophisticated annotation capabilities than those that are available to the masses on YouTube and Flickr. Therefore full support for Semantic Web standards like RDF, OWL, SKOS, and SPARQL will be expected.

An interesting application in this space is the Thompson Reuters Calais application which allows content publishers to automatically extract RDF-based metadata on entities such as people, places, facts, and events. LinkedFacts.com is an example of a semantic news search application powered by Calais.

Calais has been integrated with the open source content management system (CMS) Alfresco to enable auto-tagging of content as well as the automatic suggestion of tags to content contributors. The latest release of Calais adds the ability to generate Linked Data to other sources such as DBPedia, GeoNames, and the CIA World Fact Book.

Conclusion

With the rapid growth of news content online, relevance is going to become very important. This will require metadata and structured data in general. By exposing content in XML format, news content APIs are certainly a step in the right direction. However, these APIs do have their own limitations and can create content silos in the future. Beyond APIs, Semantic Web technologies and Linked Data principles will ensure a uniform and intelligent access to news content. This will be the key to reader retention and content monetization.

In a typical SOA project, several artifacts are represented in XML format. This includes SOAP messages, XML Schema definitions (XSDs), WSDL, WS-Policy, BPMN, BPEL, and various configuration files. The following are some examples of how XSLT 2.0 and XQuery can be leveraged in an SOA project.

Data Model and Data Format Transformation

When the services don't share the same data model (XSD) or the same data format (e.g. EDI vs. XML), there is a need to transform the data. An Enterprise Service Bus (ESB) typically provides data transformation as part of its mediation services. Some developers will find XQuery easier to use than XSLT 2.0 for transforming XML data because XQuery has a SQL-like syntax. In Oracle Aqualogic ESB, XQuery is used for data transformation while the Apache ServiceMix ESB provides support for Saxon-based XSLT 2.0 and XQuery transformations. An XSLT/XQuery engine can be deployed as Service Component Architecture (SCA) implementation type or a Java Business Integration (JBI) service engine.

One aspect of data model or data format transformation that can quickly get out of control and become difficult to manage is the mapping specification which says that a field X in message A maps to field Y in message B. In addition, the mapping specification defines business rules. This mapping specification is often produced in Excel spreadsheets format by business analysts and handed over to programmers who then code the transformation script. Now, you need to maintain and synchronize the Excel mapping document, the source XSD, the target XSD, and the XSLT 2.0 or XQuery transformation script. On top of that, if a user interface is involved, you need to ensure that it is also kept in sync with those changes.

One technique that can be useful is to use an xsd:appinfo element to capture and keep metadata close to the XSD declarations:

Data mapping specifications
Business rules using inline ISO Schematron rules for example
Labels, alerts, and appearances of UI components such as XForms controls.

This allows you to use XSLT 2.0 or XQuery to automatically generate data mapping reports in Excel or even generate UI components by transforming the XSD into XForms controls.

Managing Artifacts and Promoting Reuse with an SOA Repository

One of the key aspects of design-time SOA Governance is the management of the lifecycle of service artifacts and the dependencies between them. This is accomplished through a new breed of tools called SOA Repositories. Some of these repositories are being build on top of a JCR compliant repository such as Apache Jackrabbit. JCR supports querying compliant repositories using XPath 2.0. Suppose that I need to add a new element to my XSD. To ensure that I am reusing existing schema constructs, I first query the SOA Repository with XPath 2.0 to find all schema components (types, elements, and attributes) that contain a certain keyword inside their xsd:documentation element. XPath 2.0 can also help in detecting dependencies between artifacts (e.g. WSDL and XSD definitions) for change impact analysis. Open source SOA repositories such as Mule Galaxy, JBoss DNA, and WSO2 Repository have adopted this approach.

Functional Testing of Web Services

Automated testing is a key principle in agile software development. SoapUI is an open source web services functional testing framework that allows testers to not only perform XSD validation of SOAP messages, but also allows them to specify assertions on the structure and content of those messages using XPath 2.0 and XQuery. SoapUI can be easily integrated into a continuous integration process.

Data Integration

XQuery can alleviate performance and scalability issues related to the marshalling/unmarshalling of Java objects to/from XML (databinding) and object-relational mapping (ORM) for persisting XML data in relational databases. XQuery is a natural solution for querying and aggregating data coming from heterogeneous sources such as relational databases, native XML databases, LDAP, file systems, and legacy data formats such as EDI and CSV.

One promising specification in the data integration space is the W3C XQuery Scripting Extension (XQSE). By extending XQuery with imperative features such as state management, XSQE (pronounced "excuse") provide developers with additional XML processing power without the need to embed XQuery in a host language such as Java. At the time of this writing, XQSE is still a W3C working draft but is already supported by the Oracle AquaLogic Data Services Platform (ALDSP 3.0).

Adventures in Computing

Monday, December 29, 2008

From Web 2.0 to the Semantic Web: Bridging the Gap in the News and Media Industry

Wednesday, December 3, 2008

Keeping Data Under Control in SOA with XSLT 2.0 and XQuery

License

About Me

Disclaimer

Blog Archive