Monday, December 29, 2008

From Web 2.0 to the Semantic Web: Bridging the Gap in the News and Media Industry

The news and media industry is going through fundamental changes. This transformation is driven by the current economic downturn and the emergence of the web as the new platform for creating and publishing content. The decline in ad revenues has forced some media companies to cancel their print publications (e.g. PC Magazine and the Christian Science Monitor). Others such as Forbes Media are consolidating their online and editorial groups into a single entity.

What are the opportunities and challenges of the Semantic Web (sometimes referred to as Web 3.0) for the industry and how can these companies embrace and extend the new web of Linked Data?

Widgets and APIs

Media companies are looking for ways to gain a competitive advantage from their web offerings. In addition to Web 2.0 features like blogs and RSS feeds, major news organizations such as the New York Times (NYT) and the National Pubic Radio (NPR) are opening their content to external developers through APIs. These APIs allow developers to mash-up content in new ways limited only by their own imagination. The ultimate goal is to drive ad revenues by pushing content to places like social networking sites and blogs where readers “hang out” online. Another interesting example is the Times Widget (from the NYT) which allows readers to insert NYT content such as news headlines, stock quotes, and recipes on their personal web pages or blogs.

Beyond Web 2.0: the Semantic Web

Today's end users obtain information from a variety of sources such as online newspapers, blogs, Wikipedia, YouTube videos, Flickr photos, cartoons, animations, and social networking sites such as Facebook and Twitter. I personally get some of the news I read from the people I follow on Twitter (including political cartoons at http://twitter.com/dcagle). The challenge is to integrate all these sources of information into a seamless search and browsing experience for readers. Publishers are starting to realize the importance of this integration as illustrated by the recently unveiled "Times Extra" feature from the NYT.

The key to this integration is metadata and this is where Semantic Web technologies such as RDF, OWL, and SPARQL have an important role to play in presenting massive amounts of content to end users in a way that is compelling. Semantic search and browsing as well as inference capabilities can help publishers in their effort to attract and retain readers and boost ad revenues.

News and Media Industry Metadata Standards

Established in 1965, the International Press Telecommunications Council (IPTC) is a consortium of news agencies and publishers including the Associated Press, the NYT, Reuters, and the Dow Jones Company. The IPTC maintains and publishes a set of news exchange and metadata standards for media types such as text, photos, graphics, and streaming media like audio and video. This includes:

  • NITF which defines the content and structure of news articles
  • NewsML 1 for the packaging and exchange of multimedia news
  • NewsML-G2, the latest standard for the exchange of all kinds of media types
  • EventsML-G2 for news events
  • SportsML for sport data

These standards are all based on XML and use XML Schema Definitions (XSDs) to describe the structure and content of the media types. In addition, IPTC defines a taxonomy for news items called NewsCodes as well as a Photo Metadata Standard based on Adobe's XMP specification.

Another interesting standard in the news and media industry is the Publishing Requirements for Industry Standard Metadata (PRISM) specification. Also based on XML, PRISM is more applicable to magazines and journals and is compatible with RDF. There is a certain degree of overlap between PRISM and the IPTC news metadata standards. Media companies that have adopted these standards are well positioned to bridge the gap to the Semantic Web.

From XML/XSD to RDF/OWL

The W3C Semantic Web Activity and the Semantic Web community have been working on a number of specifications and tools to facilitate the transition to the Semantic Web. The W3C Gleaning Resource Descriptions from Dialects of Languages (GRDDL) specification defines a method for using XSLT to extract RDF statements from existing XML documents. For example, using GRDDL, a news organization would be able to generate RDF from content items marked up in NITF or NewsML.

The recently approved RDFa W3C specification allows publishers to get their content ready for the Semantic Web by adding extension attributes to XHTML content to capture semantic information based on Dublin Core or FOAF for example.

The W3C Simple Knowledge Organization System (SKOS) is an application of RDF that can be used to represent existing taxonomies and classification schemes (such as the IPTC NewsCodes) in a way that is compatible with the Semantic Web. SPARQL is a query language for RDF that is supported in Semantic Web toolkits such as Jena.

There are also tools that can be used to derive an OWL ontology from an existing XSD. This transformations can be straightforward if the XSD was designed to be compatible with RDF. As an example, an OWL ontology can be derived from the IPTC's NewsML-G2 XSDs. Once created, new ontologies can be linked to existing ontologies such as Dublin Core, FOAF, and DBPedia as described by Tim Berner Lee in his vision of Linked Data. Other relevant ontologies for the news and media industry include MPEG-7, the Core Ontology for Multimedia (COMM), and Geonames.

From Web 2.0 Tagging to Semantic Annotations

To facilitate the transition of the industry to the Semantic Web, it will be important to design content management systems (CMS) interfaces that make it easy for content contributors to add semantic annotations to their content. These systems certainly have a lot to learn from Web 2.0 tagging interfaces such as Flickr clustering and machine tags. However the complexity of content in the news and media industry demands more sophisticated annotation capabilities than those that are available to the masses on YouTube and Flickr. Therefore full support for Semantic Web standards like RDF, OWL, SKOS, and SPARQL will be expected.

An interesting application in this space is the Thompson Reuters Calais application which allows content publishers to automatically extract RDF-based metadata on entities such as people, places, facts, and events. LinkedFacts.com is an example of a semantic news search application powered by Calais.

Calais has been integrated with the open source content management system (CMS) Alfresco to enable auto-tagging of content as well as the automatic suggestion of tags to content contributors. The latest release of Calais adds the ability to generate Linked Data to other sources such as DBPedia, GeoNames, and the CIA World Fact Book.

Conclusion

With the rapid growth of news content online, relevance is going to become very important. This will require metadata and structured data in general. By exposing content in XML format, news content APIs are certainly a step in the right direction. However, these APIs do have their own limitations and can create content silos in the future. Beyond APIs, Semantic Web technologies and Linked Data principles will ensure a uniform and intelligent access to news content. This will be the key to reader retention and content monetization.

Wednesday, December 3, 2008

Keeping Data Under Control in SOA with XSLT 2.0 and XQuery

In a typical SOA project, several artifacts are represented in XML format. This includes SOAP messages, XML Schema definitions (XSDs), WSDL, WS-Policy, BPMN, BPEL, and various configuration files. The following are some examples of how XSLT 2.0 and XQuery can be leveraged in an SOA project.

Data Model and Data Format Transformation

When the services don't share the same data model (XSD) or the same data format (e.g. EDI vs. XML), there is a need to transform the data. An Enterprise Service Bus (ESB) typically provides data transformation as part of its mediation services. Some developers will find XQuery easier to use than XSLT 2.0 for transforming XML data because XQuery has a SQL-like syntax. In Oracle Aqualogic ESB, XQuery is used for data transformation while the Apache ServiceMix ESB provides support for Saxon-based XSLT 2.0 and XQuery transformations. An XSLT/XQuery engine can be deployed as Service Component Architecture (SCA) implementation type or a Java Business Integration (JBI) service engine.

One aspect of data model or data format transformation that can quickly get out of control and become difficult to manage is the mapping specification which says that a field X in message A maps to field Y in message B. In addition, the mapping specification defines business rules. This mapping specification is often produced in Excel spreadsheets format by business analysts and handed over to programmers who then code the transformation script. Now, you need to maintain and synchronize the Excel mapping document, the source XSD, the target XSD, and the XSLT 2.0 or XQuery transformation script. On top of that, if a user interface is involved, you need to ensure that it is also kept in sync with those changes.

One technique that can be useful is to use an xsd:appinfo element to capture and keep metadata close to the XSD declarations:


  • Data mapping specifications
  • Business rules using inline ISO Schematron rules for example
  • Labels, alerts, and appearances of UI components such as XForms controls.


This allows you to use XSLT 2.0 or XQuery to automatically generate data mapping reports in Excel or even generate UI components by transforming the XSD into XForms controls.


Managing Artifacts and Promoting Reuse with an SOA Repository


One of the key aspects of design-time SOA Governance is the management of the lifecycle of service artifacts and the dependencies between them. This is accomplished through a new breed of tools called SOA Repositories. Some of these repositories are being build on top of a JCR compliant repository such as Apache Jackrabbit. JCR supports querying compliant repositories using XPath 2.0. Suppose that I need to add a new element to my XSD. To ensure that I am reusing existing schema constructs, I first query the SOA Repository with XPath 2.0 to find all schema components (types, elements, and attributes) that contain a certain keyword inside their xsd:documentation element. XPath 2.0 can also help in detecting dependencies between artifacts (e.g. WSDL and XSD definitions) for change impact analysis. Open source SOA repositories such as Mule Galaxy, JBoss DNA, and WSO2 Repository have adopted this approach.

Functional Testing of Web Services

Automated testing is a key principle in agile software development. SoapUI is an open source web services functional testing framework that allows testers to not only perform XSD validation of SOAP messages, but also allows them to specify assertions on the structure and content of those messages using XPath 2.0 and XQuery. SoapUI can be easily integrated into a continuous integration process.

Data Integration

XQuery can alleviate performance and scalability issues related to the marshalling/unmarshalling of Java objects to/from XML (databinding) and object-relational mapping (ORM) for persisting XML data in relational databases. XQuery is a natural solution for querying and aggregating data coming from heterogeneous sources such as relational databases, native XML databases, LDAP, file systems, and legacy data formats such as EDI and CSV.

One promising specification in the data integration space is the W3C XQuery Scripting Extension (XQSE). By extending XQuery with imperative features such as state management, XSQE (pronounced "excuse") provide developers with additional XML processing power without the need to embed XQuery in a host language such as Java. At the time of this writing, XQSE is still a W3C working draft but is already supported by the Oracle AquaLogic Data Services Platform (ALDSP 3.0).

Tuesday, November 11, 2008

The Content Imperative: Unlearning the Relational Model

The relational data model and the SQL query language are an essential part on any computer science curriculum and are well understood by a large number of developers. On the other hand, the use of markup technologies such as SGML and XML for content management and publishing has remained a niche market for highly specialized vendors and consultants.

Today, the majority of developers use XML for application configuration files (e.g. Spring, Hibernate, JSF), syndication (RSS), and web services. When these developers are asked to design and develop XML content management and publishing applications, they often approach the problem from a relational data paradigm which is what they know and are used to. For example, when migrating content stored in a relational database into an XML format, they will simply dump the relational tables into a flat XML representation. The problem is that content is not relational data. The following are some fundamental differences between content and relational data:

  • Content is created to be consumed by eyeballs
  • Content can be rendered in multiple presentation formats such as print, web, and wireless devices. Therefore it is very important to cleanly separate content from presentation
  • Content can have an inherent deep hierarchical structure. For example, think about the book/part/chapter/section/subsection/paragraph hierarchy
  • The relationships between content items are expressed through hierarchical containment and hyperlinks
  • Content is often mixed (in the sense of mixed content in XML). For example, inside a paragraph, some words are italicized, in bold, or underlined to indicate special meaning
  • Content can have multi-valued properties such as the authors of a document. Multi-valued properties are not supported by SQL.

However, content and data do have one important thing in common. Data and content stay with us for a long time, sometimes forever. APIs, protocols, and programming languages come and go. Therefore, content modeling is by far the most important investment you can make during a content management migration project.

Unstructured Content Modeling

In a typical enterprise, there are two different types of content: unstructured and structured. Unstructured content represents the large majority of content in the enterprise. Examples are Office documents such as Word, PowerPoint, and Excel. Content modeling for unstructured content consists in describing document metadata as well as the relationships between the documents. The metadata is usually stored in a relational database. In a typical CMS, the content model is used to customize the user interface for querying specific documents based on metadata and for customizing the user interface (e.g. for capturing and displaying metadata).

Most content management systems have their own proprietary meta-model. The Java content repository API (JCR) introduced a standardized hierarchical repository model. This article by David Nuescheler (JCR spec Lead and CTO at Day Software) explains the peculiarities of JCR content modeling and the gotchas for people coming from a relational data modeling background.

Apache Jackrabbit (the JCR reference implementation) uses a textual DSL called Compact Namespace and Node Type Definition (CND) for specifying a JCR content model. There is no formal graphical notation like UML or ERDs for specifying JCR content models. Lars Trieloff of Day Software proposed a content modeling notation based on UML and Fundamental Modeling Concepts (FMC).

The Content Management Interoperability Services (CMIS) specification proposed a simplified meta-model based on documents, folders, policies, and relationships. The CMIS query language extends SQL 92 with text search, multi-valued properties, and folder-scoped queries. This choice was made because most existing CMS use a relational database and SQL is well understood by the majority of developers out there.

The Case Against Unstructured Content

The problem with unstructured content is that it cannot be processed and queried like the well-structured relational data stored by the RDBMS on which your ERP and CRM systems sit. XML goes beyond tags (in the web 2.0 sense), taxonomies, full-text search, and content categorization to provide fine-grained content discovery, query, and processing capabilities. With XML, the document becomes the database. If your business is content (you are a media company, a publisher, or the technical documentation department of a manufacturing company), then you should seriously consider the benefits of XML in terms of content longevity, reuse, repurposing, and cross-media publishing.

The question is: how to do it right?


From SGML to XML to the Infoset

Charles Goldfarb invented SGML at IBM in 1969, the same year his colleague Edgar Codd invented the relational model. The SGML (through its very popular subset called XML) and relational models are still rock solid today. To be precise, the abstract data model for XML documents was formally specified in the XML Infoset and subsequently the XQuery 1.0 and XPath 2.0 data model (XDM) specifications.

SGML was originally designed at IBM for the editing, retrieval, and composition of legal documents. The second edition of the Oxford English Dictionary and the US Department of Defense (DoD) adopted SGML in the 80s. The goal of the US DoD CALS initiative was to replace the huge quantities of paper with digital data in the acquisition of weapon systems. Before the adoption of XML by the W3C in February 98, SGML was primarily used in the publishing industry as well as for technical documentation applications in industries such as aerospace, defense, software, and telecommunications. Examples of SGML vocabularies include S1000D (aerospace and defense), Docbook (software), and TIM (telecommunications). The most popular SGML vocabulary, HTML, is the foundation of the web itself.

The XML Infoset describes the content of a well-formed XML document as an abstract tree of information items including document, namespace, element, and attribute information items. These information items have properties such as children and parent. Based on the XML Infoset, the XDM defines the abstract data model for the input to XSLT 2.0 and XQuery 1.0 processors. In addition, the XDM supports XML Schema types, atomic typed values, and ordered heterogeneous sequences.

The relational data model is based on set theory and predicate logic. It was designed originally for accounting and banking systems. Data is represented as n-ary relations and manipulated with relational algebra. CMS vendors and even standard bodies have tried to fork SQL in order to support hierarchies and multi-value properties. It is clear however that XQuery is a superior alternative, specifically designed to address those content-related concerns.

WikiXMLDB is an interesting application that uses XQuery to query Wikipedia content. WikiXMLDB allows you to not only perform database-like queries, but also enables dynamic content assembly or the ability to build compound documents from multiple Wikipedia pages. This opens up new opportunities in terms of content enrichment at a time when publishers are struggling to find new ways to monetize their content assets in the face of declining ad revenues.

Structured Content Modeling

There is a body of specialized knowledge in SGML/XML content analysis and modeling that has been applied successfully to projects such as the Oxford English Dictionary (OED), CALS, and Docbook.

In relational data modeling, the three phases of modeling typically include:

  1. The conceptual or domain model
  2. The logical data model (LDM)
  3. The physical data model (PDM)

The LDM describes entity types, data attributes, and the relationships between the entities. The LDM is normalized to eliminate redundancies. The PDM describes the schema of the database that will be used to store the data and may be denormalized to improve performance. Relational data modeling uses well known modeling notations like Entity-Relationship (ER) diagrams and UML.

The difference between logical model and physical model is even more important in modeling XML documents. The logical model can be expressed in XML Schema or Relax NG. However, the physical model depends on the underlying data persistence mechanism (e.g. relational vs. JCR vs. XQuery-enabled native XML database). Many projects make the mistake of skipping the logical modeling phase. The problem is that the physical storage can and will probably change over time.

There is no formal notation for modeling XML documents. Back in the SGML days, document analysis and modeling was a well understood process. Eve Maler and Jeanne El Andaloussi describes a tree notation for modeling DTDs in their book entitled “Developing SGML DTDs: From Text To Model To Markup”. One of the peculiarities of modeling narrative text is the presence of mixed content which is essential for intelligent processing of content (for example for the automatic extraction of book indexes).

David Carlson proposed a UML Profile for XSD which can be used to auto-generate XML schemas from UML class diagrams. As with any model-driven development tool, care should be taken to ensure that the generated XML Schema complies with XML content modeling principles and satisfies the business and technical requirements such as content reuse and repurposing.

Many people prefer Relax NG to XML Schema for its simplicity. The important think to remember however is that XML Schema is part of a full stack of XML specifications which also includes XForms, XPath 2.0, XQuery, and XSLT 2.0. The ability to load XML Schema types into your XForms, validate XForms submissions against an XML Schema, and create schema-aware XSLT 2.0 transforms and XQuery queries can be a deciding factor.

Instead of starting your content modeling effort from scratch, you can leverage any of the existing proven and well tested document-oriented XML vocabularies such as Docbook, DITA, S1000D, and NewsML.

JCR and XML Content Modeling

JCR supports the import of arbitrary XML documents into a compliant repository. The following is an excerpt from the specification:
On import...

Each XML element E becomes a content repository node of the same name, E.
...
Each child XML element C of XML element E becomes a content repository child node C of node E.
Each XML attribute A within an XML element E becomes a property A of content repository node E. The value of each XML attribute A becomes the value of the corresponding property A.
...
Text within an XML element E becomes a STRING property called jcr:xmlcharacters of a node called jcr:xmltext, which itself becomes a child node of the node E.

This is fine if you're only storing a small quantity of XML documents. Performance will probably degrade quickly if you're storing a large quantity of XML documents with a deep hierarchy. Same name siblings are almost always used in document-oriented XML (e.g. paragraph siblings) and can cause performance to degrade or JCR paths to become brittle if you remove or reorder nodes. JCR allows roundtripping of imported XML. However, JCR adds repository metadata such as jcr:primaryType that must be stripped out at export time. It's possible to derive an optimized JCR content model from the XML document's logical model, although this could prevent you from fully exploiting the original hierarchy of the XML documents in your application. This shows the importance of separating the logical model from the physical model.

You should seriously consider a native XML database when dealing with large quantities of document-oriented XML documents. A simple benchmarking exercise (between Jackrabbit and the Exist database for example) can help you settle down on the right solution.

Business Rules and Content Quality

In addition to specifying the content items, types, and relationships in your schema, you should also specify business rules that are beyond the capabilities of the XML Schema language. ISO Schematron uses XPath 2.0 to declare assertions about arbitrary patterns in XML documents and then reports on the presence or absence of those patterns. In this article, Rick Jelliffe, inventor of Schematron explains how it works.

Assertions and conditional type assignments capabilities have been added to XML Schema 1.1.

Content Capture

XML editors have been around for a long time. However, these XML editors remain complex specialist tools that are often used only by professional technical authors in documentation departments. Using XForms, you can provide a user friendly interface for your end users to contribute XML content by presenting them with a simple XHTML form. The Alfresco web content management (WCM) platform uses XForms for content capture and XSLT/XSL FO for content rendition. The XForms controls can be generated automatically from an XML Schema. Alfresco's implementation is based on the open source Chiba XForms engine.

The Right Tools for the Job

When content is stored in a relational database, typically an object relational mapping (ORM) solution like Hibernate is used to map relational tables into Java objects. The business logic is handled with POJOs such as Spring beans which communicate with the front end through a UI framework or templating technology such as JSP or Freemarker.

When dealing with XML documents however, there are domain specific languages such as XForms, XInclude, XLink, XPointer, XPath 2.0, XSLT 2.0, XQuery, and XSL FO that greatly facilitate the processing of document-oriented XML and provide unmatched processing power when compared to traditional approaches based on SQL, JSP, or Freemarker. These XML-related languages are declarative in nature and require some learning curve. Many software architects and developers are not yet aware of the power of this new paradigm. However, I believe that rather than using only familiar development frameworks, it’s important to always evaluate alternatives and select the best approach even if it involves a learning curve.

XML databases such as Exist can store data natively in XML and provide full XQuery support for sophisticated queries and manipulation of XML content. Exist also provides integrated support for XInclude, XSLT 2.0, and AtomPub. The XRX (XForms, REST, XQuery) architecture with Exist and the Orbeon XForms engine allows you to integrate an XForm front end to an Exist data store through a REST API, therefore bypassing the ORM layer altogether.

Content Modeling for the Semantic Web

The web ontology language (OWL) is used for modeling for the Semantic Web. In the RDF data model, statements are expresses as Subject-Predicate-Object. Specialized RDF stores exist for storing RDF triples. In this blog post, Kingsley Idehen, CEO of OpenLink Software (maker of the Virtuoso RDF triple store), explains why The Time for RDBMS Primacy Downgrade is Nigh!

Conclusion

In summary, content modeling in general and XML content modeling in particular are intrinsically different from relational data modeling. In modeling XML content, the logical model should be separated from the physical model because the latter depends on the persistence storage mechanism which can change during the content's life cycle. XQuery-enabled native XML databases provide a better alternative for storing, querying, and processing large quantities of document-oriented XML. The relational data model is still rock solid today and will be around for many years to come because of its strong foundations in mathematics. However, it is not a panacea for all information management problems. While SQL can be forked to support content characteristics such as hierarchy and multi-valued properties, XQuery was natively designed to address those concerns. Companies can help their developers by providing training on declarative XML processing languages like XForms, XQuery, and XSLT 2.0 which are better suited for handling XML content.

Thursday, October 23, 2008

AtomPub Use Cases in the Aviation Industry

At the XML 2007 Conference in Boston, I introduced my concept of an Integrated Documentation Environment for Aircraft Support (IDEAS) based on AtomPub and OpenSearch. The following are some use cases that illustrate how such an approach could facilitate integration and technical information publishing in the aviation industry:

  • Notification. Nancy is an aircraft mechanic. When she gets to work in the morning, she opens her feeds aggregator to get new and updated content from all of the following sources: airframer, engine manufacturer, component manufacturers, FAA, or airline policies. Essentially, Nancy doesn't want to login into the support sites of all those content providers to find out what is new and updated. She wants content pushed to her instead. This use case is implemented with the Atom syndication format.

  • Federated search. While Nancy is repairing the hydraulic tank, she wants to perform a single search against all those content repositories. She wants the results aggregated and returned to her as Atom entries, so that she can subscribe to those items that she is interested in and receive updates via web feeds. This use case is implemented with the OpenSearch specification.

  • Airline Originated Changes. Judy is an engineer working on a new engineering order (EO) to be performed on the hydraulic tank. The airline's technical documents are hosted by the aircraft manufacturer. Judy uses an XML editor which is also an AtomPub client to post the EO to the remote content repository (an AtomPub server).

  • Distributed Aircraft Manufacturing. Future Composites Inc. is a supplier of composite aircraft structures to X-Aero, a major airframer (systems integrator). Future Composites is also responsible for providing technical content in S1000D to X-Aero on those composite structures. After a failed attempt to connect to X-Aero's repository using their SOAP and WS-* interface, Future Composites and X-Aero mutually agree to go back to the basics and use AtomPub and its simple and generic RESTfull HTTP interface to CRUD (create, retrieve, update, and delete) documents to X-Aero's content repository.


The main argument in favor of this approach is simplicity and scalability. I am glad to see that the software industry is moving in that direction. Having been involved in complex WS-*-based integration projects in the airline industry, I believe this new approach is a breath of fresh air. The RESTful approach is also more amenable to agile software development as opposed to the waterfall approach which is typical when the big up front purchase of a proprietary ESB is involved.

Integration projects are becoming critical to the success of new aircraft projects. Speaking about the repeated postponement of the 787 maiden flight in an internal memo send to Boeing employees on April 21, 2008 (and obtained by the Seattle Times) Boeing CEO Jim McNerney wrote:
I expect we’ll modify our approach somewhat on future programs—possibly drawing the lines in different places with regard to what we ask our partners to do, but also sharpening our tools for overseeing overall supply chain activities.

Why AtomPub specifically? Because too many people have been putting the "REST" label on their unRESTful chef-d'oeuvre (HTTP APIs) lately. AtomPub is a good embodiment of the principles of the REST architectural style and a good place to start.

So, what are the key principles of RESTful design:

  • Everything is a URI addressable resource
  • Representations (media types such as XHTML, JSON, and Atom) describe resources and use links to describe the relationships between those resources.
  • These links drive changes in application state (hence Representational State Transfer or REST).
  • The only type that is significant for clients is the representation media type, not any other resource type
  • URI templates (a la OpenSearch) as opposed to fixed or hard coded resource names
  • Generic HTTP methods (no RPC-style overloaded POST)
  • Stalessness (the server keeps no state information)
  • Cacheability

Adherence to these principles is what drives massive scalability. Security in a RESTful application can be achieved with any of the following existing solutions:

  • XML Signature and Encryption
  • OpenID
  • HTTP Authentication
  • SSL

How does the aviation industry gets started with this new approach? This will require leadership from aviation IT specialists, particularly from original equipment manufacturers (OEMs). I don't think that another Air Transport Association (ATA) standard committee is needed. Such standard committees are plagued by vendor politics. By the time they finish their work, someone may have invented a better solution than AtomPub.

In the Java space, the recently approved Java API for RESTful Web Services (JAX-RS) specification greatly simplifies REST development with simple annotated POJOs. Jersey is the open source reference implementation of JAX-RS. Apache Abdera is an AtomPub implementation with Spring Framework integration. The latest release of Abdera features a collection of pre-bundled Atom Publishing Protocol adapters for JDBC, JCR, and filesystems.

The following is an excellent technical article from InfoQ that explains how REST and AtomPub facilitate integration: "How to Get a Cup of Coffee".

Monday, August 11, 2008

Recession-Proof Computing

The past three years have been very exciting at Efasoft. We've delivered value to a number of customers in a variety of industries including automotive, pharmaceutical, homeland security, wireless internet, aerospace, defense, insurance, and customer loyalty management. We've also learned a lot in the process. These projects have allowed us to strengthen our expertise in XML, Java EE, and SOA. In the aerospace vertical where we have a strong expertise, we took the initiative to propose new ideas such as using ISO Schematron to exchange and validate S1000D business rules and the AtomPub protocol and Atom syndication for the efficient exchange of up-to-date aircraft technical data between airlines and aerospace manufacturers.

Going forward, our objective will continue to be the success of our customers. We'll achieve that by researching and implementing best practices as always. We understand that technology must be aligned with strategic business goals, but should also take into consideration the context within which the business operates.

So, key questions that a lot of business leaders are now asking include:

  • How can we continue to invest in much needed strategic IT initiatives in the current economic downturn under tight budgets?
  • Given the high rate of IT project failures, how do we minimize risk?
  • Which software development methodology can help us deliver quality software on time and under budget?
  • What are the tools that we can use to help our developers in their work and keep them productive?
  • Is outsourcing and offshoring the right approach? And if we do outsource, how do we keep control, quality, and our intellectual property?

Open source software (OSS) is the answer to some of these questions. OSS lowers the total cost of ownership (TCO). By providing full access to the source code (including unit and functional tests), OSS provides transparency into enterprise software. By supporting standards and open frameworks, OSS allows organizations to avoid vendor lock-in, protect their investments, and find talent in the open job market to maintain and support their software assets in the future (in case the software vendor goes out of business).

SOA and Web 2.0 technologies allow organizations to gain a competitive advantage by supporting business process efficiency and by facilitating collaboration and online communities.

On the SOA front, OSS tools such as Apache CXF (web services framework), Apache Tucsany (SCA implementation), Intallio BPMS (BPMN), Apache Axis2, Mule ESB, Apache ODE (BPEL), and Apache ServiceMix (ESB) have demonstrated their strength in supporting SOA projects in mission critical applications in industries such as banking. Based on carefully researched SOA design principles and patterns, our SOA offering includes the following:

  • Business process analysis using BPMN
  • A model driven development (MDD) approach where appropriate
  • SOA implementation using emerging standards such as BPEL and SCA (Service Component Architecture)
  • SOA Governance using open source SOA Repositories.

On the Web 2.0 front, we really like the Liferay enterprise portal and the Alfresco document/web content management platforms particularly their built-in social networking features which enable enterprise collaboration and online communities. Document management is one area where we can leverage our expertise in XML and related technologies (XInclude, XSLT, XQuery, XForms, ISO Schematron, and S1000D) to help our customers bring their knowledge assets under control. We'll continue to support the Exist XQuery-enabled native XML database to build dynamic XML content applications. The XRX (XForms, REST, XQuery) architecture with Exist and the Orbeon XForms engine enables what we call "Web 2.0 XML authoring and Publishing". We've acquired a strong expertise in document management for maintenance and operation documentation in the aerospace industry (our traditional forte) and drug related documentation in the pharmaceutical industry.

JBoss Seam is a very compelling application development framework because it not only brings together Java EE frameworks such as Hibernate, JPA, EJB 3, Spring, JSF, Facelets, and Java portlets, but also integrates human workflow capabilities (jBPM), full-text search (Hibernate Search), a business rules engine (Drools), and an integration testing facility. We like the ability to leverage third party AJAX-enabled JSF component libraries such as ICEFaces and Apache MyFaces to quickly create rich internet applications (RIA). However, JBoss Seam is not limited to JSF and can also integrate Flex 3 front-ends.

Going forward, all these open source software will be part of our toolkit as we craft innovative software solutions for our customers.

At Efasoft, we are proponents of agile development methodologies such as Extreme Programming and Scrum. These methodologies are based on practices such as user stories, iteration (sprint) planning, pair programming, unit test first, refactoring, continuous integration, and acceptance tests. Agile programming helps create better software that is also easier to maintain. We've witnessed the success of agile first hand and believe that it can help IT organizations achieve success. For more on Efasoft's approach to quality, see my previous blog Addressing Software Quality Head-On.

Saturday, July 26, 2008

Architecting SOA Solutions with a Model Driven Development (MDD) Approach

How do you architect an SOA solution to ensure that it is driven by the business and can respond rapidly and efficiently to ever changing business requirements? A Model Driven Development (MDD) approach can help provide that level of agility. The goal with the MDD approach to SOA is to auto-generate service artifacts such as WSDL, XSD, SCA composites, and BPEL code from the service model.

First, the business articulates their vision for the SOA project in requirements documents or in the form of use cases. Business analysts (BAs) then model business processes that realize the use cases by leveraging the Business Process Modeling Notation (BPMN). With the help of the right tools, the BAs can specify Key Performance Indicators (KPI) such as those required by service level agreements (SLAs). They can also run simulations to validate the proposed business processes.

SOA is indeed all about reengineering and supporting organizational business processes. Back in 1993, Michael Hammer and James Champy made the case and outlined the management framework for reengineering in their book entitled "Reengineering the Corporation: A Manifesto for Business Revolution". Today, SOA is the software architecture that enables and facilitates the reengineering of business processes.

BPMN is an effective tool for BAs (as opposed to UML) because they should only focus on the business and operational aspects of the business process and shouldn’t have to worry about IT concerns such as service loose coupling, reusability, reliability, security, persistence, and transactions. While direct transformation from BPMN to executable BPEL code (so called BPMN-BPEL round tripping) may be effective for simple business processes, it can not always satisfy those IT concerns. More complex business processes will require advanced modeling and coding by SOA architects and developers.

For example, the SOA architect will have to decompose the proposed business process into task, entity, and utility service layers in order to satisfy the SOA principles of loose coupling, reusability, and composability. That will also give the SOA Architect opportunity to apply SOA design principles and patterns and check the enterprise SOA Repository or Registry to reuse existing services or legacy assets.

After decomposing the proposed business process to identify reuse opportunities and address other IT concerns, the SOA architect can then build an assembly of service components based on the Service Component Architecture (SCA). SCA implementation types can be Spring beans, EJBs, C++, Cobol, WS-BPEL, PHP, Spring, XSLT, XQuery, and OSGi bundles. SCA supports different bindings such as SOAP/HTTP Web services, JMS, RSS, and Atom.

The tool of choice for software architects is UML 2.0. In the case of SOA, UML can help abstract the service model from technology-specific implementation details. Basic UML artifacts such as activity and collaboration diagrams can be auto-generated from the BPMN diagrams produced by the BAs to bootstrap the SOA Architect's modeling effort.

To help SOA architects in crafting service-oriented solution logic, a UML Profile for service-oriented design should be adopted. The profile should define a number of stereotypes that can be applied to UML artifacts in order to refine the transformation from UML artifacts to service artifacts.

The automatic generation of the service artifacts from the UML model should be part of the build and continuous integration process which should also include automated tests (for example to ensure that the generated XSD and WSDL are syntactically correct, WSI-BP compliant, and backward compatible).

The benefits of an MDD approach to crafting SOA solutions include: increased development productivity, traceability to business requirements, responsiveness to changing requirements, quality, and overall agility.

Wednesday, July 9, 2008

SOA in the Java Space: State of the Union

In the Java space, an SOA Architect starting a new SOA project will have to make some strategic as well as tactical decisions regarding which approach and technologies are appropriate for their project. Of course, the SOA project should be aligned with the organization’s long term business goals.

Technically, there is a myriad of specifications to choose from:

  • The Business Process Modeling Notation (BPMN)
  • The Java API for XML Web Services (JAX-WS)
  • The WS-* specifications including the WS-I Basic Profile, WS-Addressing, WS-Policy, WS-Reliable Messaging, and WS-Security
  • The Java API for RESTful Web Services (JAX-RS)
  • The Java Business Integration (JBI)
  • The Web Services Business Process Execution Language (WS-BPEL)
  • The Service Component Architecture (SCA).

Each of these specifications has its raison d’etre and should be part of the architect’s toolkit. However, I find SCA quite intriguing. SCA defines a language neutral programming model for the assembly and deployment of services. SCA implementation types include Java, C++, Cobol, WS-BPEL, PHP, Spring, XSLT, XQuery, and OSGi bundles. SCA supports different bindings such as SOAP/HTTP Web services, JMS, RSS, and Atom. SCA applications can be hosted in web containers, application servers, and OSGi runtimes. SCA is geared toward the developer and can apply policies such as reliability, security, and transactions to services in a declarative manner. SCA has the support of big players including Oracle, SAP, and IBM. At the time of this writing, Sun Microsystems support for SCA is less than clear (to me anyway).

JBI is implemented by a number of Enterprise Service Bus (ESB) products. JBI defines a runtime architecture that allows plugins such as binding components and service engines to interoperate via a Normalized Message Router (NMR). Binding components (BCs) use communication protocols such as JMS, FTP, XMPP, and HTTP/S to connect external services to the JBI environment. Service engines provide application logic in the JBI environment. Examples of SEs are XSLT/XQuery data transformation engines, rules engines, and WS-BPEL engines. BCs and SEs do not communicate directly. They only communicate through the NMR. IBM, BEA (now part of Oracle) and SAP did not vote in favor of JBI Java Specification Request (JSR 208).

When starting a new SOA project, SOA architects will have to look beyond vendors politics and make some judgment calls about the best approach based on their business goals and functional requirements.

My personal take on this is to adopt an agile approach where new functionalities are implemented in an iterative manner. For example, instead of starting with an ESB infrastructure, a project can start by service enabling existing applications (code-first approach) with JAX-WS annotation capabilities or by creating new services with a contract-first approach where JAX-WS annotated service and server stub are generated from a WSDL. Alternatively, the Java API for RESTfulWeb Services (WS-RS) could be used when a RESTful approach seems more appropriate.

An ESB could later come into the picture in a context where you are connecting to multiple services (often across organizational boundaries) and there is a need for mediation services such as business process orchestration, business rules processing, data model transformation, message routing, and protocol bridging. In that context, JBI provides plug-and-play functionality in ESBs for service engines such as business rules engines and BPEL engines. For that reason, JBI can help avoid ESB vendor lockin (perhaps a reason why proprietary ESB vendors are not backing JBI).

However, SOA architects should carefully consider the benefits of the programming language and binding agnostic service assembly model proposed by SCA. This will be essentially a choice between centralized mediation and decentralized assembly. The Open Service Oriented Architecture (OSOA) group believes that JBI and SCA can actually work together for example to allow SCA components to call JBI components or to use JBI runtime containers to deploy SCA composites.

Decentralized assembly is agile and looks a lot more like the way the web itself works. So I believe that while the JBI model is fine for integrating legacy enterprise applications, new and future service-oriented applications will embrace the SCA approach.

So is the state of the union strong? There is certainly a risk of fragmentation. But choice will always drive innovation forward and that’s what attracts me to the Java platform in the first place.

Friday, July 4, 2008

XML Schema Design Strategies for SOA Projects

The schema modeling and design effort should be an integrated part of an agile approach which implements practices such as user stories, acceptance tests, unit test first, refactoring, short iterations, common code base, and continuous integration.

In the recommended contract-first approach to web services development, the XML Schema and WSDL artifacts are the foundation of an SOA project. For example, with Apache CXF, you can use the WSDL2Java tool to automatically generate a JAX-WS annotated service and server stub from your WSDL.

The first step is to adopt schema naming and design rules (NDR). Industry and government standard bodies like the Universal Business Language (UBL) and the National Information Exchange Model (NIEM) have published such NDRs.

If you’re building a new schema from scratch, then the schema should be designed in an iterative and collaborative manner. During each iteration, add just enough components to your schema to support the specific user stories that are being implemented. As the schema grows, refactor as required.

One option is to start the modeling effort with a domain model in the form of UML class diagrams to facilitate collaboration between non-technical subject matter experts (SMEs), the modeler, and the technical team. An XML schema can then be generated automatically from the UML class diagram with a tool such as Hypermodel. David Carlson, creator of Hypermodel, proposed a UML Profile for XSDs which defined a number of stereotypes that can be added to UML class diagrams to refine the mapping from class diagrams to XML schemas. Alternatively, you could export the UML model to the XML Metadata Interchange (XMI) format and use an XSLT transform to map the XMI into an XML Schema and even an XML instance. This Model Driven Development (MDD) approach to XSD provides agility in the face of constant changes in business requirements.

If you are reusing an industry or government schema such as UBL or NIEM, it is very important to use the right methodology for extending the schema as recommended by the applicable NDR or Information Exchange Package Documentation (IEPD) process in the case of NIEM. Extensions to the standard schema should be clearly defined in a new custom namespace and documented properly. The following are some strategies for extending an XML schema:

  • Wildcards xs:any and xs:anyAttribute
  • Element substitution and abstract elements
  • Type substitution via xsi:type and abstract types
  • Concrete Extension (creating a new type by extending an existing type to include additional local elements).

The schema should be tested for quality against the NDRs. The US National Institute of Standards and Technology (NIST) has published an XML Schema Quality of Design Tool (or QoD Tool) which combines Schematron and JESS rules (a Java-based open source rule engine) to validate schemas against NDRs.

For unit testing, the XMLUnit framework can be helpful in testing the schema as you refactor and implement new user stories. XMLUnit for Java allows you to make assertions about the validity of an XML document against an XML Schema. The execution of these tests should be part of your build and continuous integration process.

The automatic generation of the XSD code from the UML model should also be part of the build and continuous integration process.

Business rules or modeling requirements that are beyond the capabilities of XML Schema 1.0 should be implemented with an assertion-based language such as ISO Schematron or with the new assertion and conditional type assignment (co-constraints) capabilities in XML Schema 1.1.

Although ISO Schematron and XML Schema 1.1 can be quite powerful when used with XPath 2.0, some complex business rules will be easier to handle with rule engines such as JBoss Drools and JESS. The rule engine can be deployed as a dedicated and reusable utility web service to validate messages.

When modeling XML schemas for an SOA project, careful consideration should be given to the issue of data transformations. When the services don’t share the same data model and XML Schema, there is a need to transform the XML data using technologies such as XQuery and XSLT. This can introduce additional design complexity and runtime performance issues. Data transformations shall be avoided unless absolutely required.

The XML Schema's xsd:appinfo element can be used to capture and keep metadata close to the XSD declarations:

  • Metadata such as data transformation specifications
  • Business rules using inline ISO Schematron rules
  • Labels, alerts, and appearances of UI components such as XForms controls. This provides the opportunity to auto-generate UI components from your XSD using a transformation language like XSLT or XQuery. Keeping UI and XSD components in sync can be a challenge in SOA projects.

Finally, to maximize reuse, an enterprise-wide SOA Repository/Registry should be used to publish, centralize, and discover schema components (see my previous post on SOA Governance tools).

Tuesday, June 10, 2008

Addressing Software Quality Head-On

Adopting a test-driven development methodology (TDD) and using the right tools can help deliver quality software.

With TDD, you start with user stories and you write acceptance tests for those stories. To implement the functionality for the stories, you write unit tests first, then just enough code to make the unit tests pass, and then you refactor. You repeat the test-code-refactor cycle until the acceptance tests pass.

If your development team is not already using TDD, the first step is to provide adequate training on the concepts and patterns of TDD. One option is to hire an Agile coach or pair junior developers with experienced practitioners of TDD. Next, you need to use the frameworks and tools that will facilitate adoption, keep your developers productive, and provide transparency into the process. The following are tools and frameworks (all are free and open-source) that we find useful:


  • Build tool: Maven 2
  • Continuous integration: Hudson
  • A tool for configuring, starting, stopping Java containers and deploying applications for continuous integration and functional tests: Cargo
  • Unit testing frameworks: JUnit 4.4, EasyMock, and JSFUnit (for JSF applications)
  • Enforcing coding standards: Checkstyle
  • Detecting bugs, overcomplicated expressions, and suboptimal/dead/duplicate code: PMD and FindBugs
  • Code review: Jupiter
  • Test coverage: Cobertura
  • Web Services functional and load testing: SoapUI
  • User interface testing: Selenium and Umangite
  • Integration testing: Fit, DBUnit, DbFit, and ORMUnit
  • XML unit testing: XMLUnit and Tennison Test
  • Load and performance testing: JMeter
  • Profiling and monitoring: JConsole
  • Analyzing code dependencies: JDepend
  • Source code documentation: Doxygen
  • Project tracking and planning: XPlanner


By enforcing Java best practices, tools like Checkstyle, PMD, and FindBugs are very helpful for developing high quality code.

Unit and integration tests should be fast, repeatable, and automated. That's why you need to deploy a build tool and a continuous integration server from the start. All the tools listed above should be executed as part of your continuous integration process. You should use a dedicated integration build machine.

DBUnit helps make your system tests repeatable by using XML to insert a specific data set into the database before each test run. Cargo can configure and starts the web container or application server (AS), deploy your application's WAR or EAR file, and then shutdown the container after each test run.

Keep in mind that test coverage reports such as those provided by Cobertura should be used mainly to isolate code that has not been appropriately tested in order to take corrective measures. These reports shall not be used solely to aim at a magical high coverage percentage. Applications that are built on an inversion of control (IoC) container such as Spring are more amenable to unit testing.

For web applications with a user interface (UI) layer such as Struts or JSF, make sure that you exercise the UI functionalities on real web browsers (IE, Firefox, etc.) with a tool like Selenium. Fit and Fitness (the wiki-based version of Fit) are effective tools for integration testing your application's business logic. By integrating Selenium, TestNG, Spring, and Cargo, the Umangite framework makes it easy to write web tests.

If you intend to service-enable the same application, then SoapUI will help with both integration and load testing of the web services over HTTP. The nice thing about SoapUI is that in addition to XML Schema validation, it allows you to specify response assertions by using XPath 2.0, XQuery, and Groovy scripts.

Sun JDK tools such as jConsole, jmap, and jhat can help with profiling and diagnosing memory leak issues. The JMeter Proxy makes load and performance tests easy my allowing you to record a test case.

Are agile practices applicable to XML development (XML Schema modeling, XSLT, XQuery, and XSL FO programming)? You bet. More on that on my previous blog entitled Extreme XML Programming.

Saturday, May 24, 2008

S1000D Content Reuse for Aircraft Documentation

One of the justifications for moving to an XML-based S1000D content management system (CMS) is the ability to reduce cost and improve quality by reusing content. In the aerospace industry, hundreds of thousands of pages of maintenance and operation documentation are produced and maintained for every new aircraft project. Warnings and cautions are a good example of reuse in aerospace documentation. They describe hazards that may cause injury or death or damage to the aircraft. For product liability reasons, these warnings and cautions are carefully reviewed and approved by qualified personnel. Technical authors may be required to reuse these warnings and cautions verbatim across all documents. In this blog, I will discuss some principles and practices that facilitate S1000D content reuse.

From a technical perspective, the key to successful reuse in S1000D is the W3C XInclude specification. The S1000D specification does not make reference to XInclude. The reason is that earlier versions of S1000D were based on SGML. Some S1000D CMS still rely on the SGML/XML 1.0 external parsed entity mechanism for implementing reuse. This approach has several limitations and should be avoided. The preferred approach in modern XML content applications is to use XInclude which allows the transclusion of not only whole chunks of XML content, but also elements (addressed using XPath/XPointer) within those chunks. The following are some examples:

<xi:include href="dm.xml"/>
<xi:include href="dm.xml" xpointer="warning-001"/>

In the first example a data module file named dm.xml is included. In the second example, an element with ID value "waning-001" within the data module is included.

Using XInclude in an S1000D content application requires some modifications to the XML Schema used for the authoring of data modules to allow the insertion of xi:include elements. However, these modifications will still produce valid S1000D documents since you're not altering the structure of your documents, but rather simply modularizing the content.

While we are on the subject of inclusion, the XLink specification can be used as a simpler alternative to the XML 1.0 unparsed entity and notation mechanism (another concept inherited from SGML) for including illustrations into S1000D documents.

At the DocTrain 2007 conference in Boston, I gave a presentation on how to integrate training and documentation using S1000D and the Shareable Content Object Reference Model (SCORM) specification. One way to reuse S1000D content in SCORM is to assign a unique ID to all elements in S1000D data modules (DMs) that are reusable such as paragraphs, steps, warning, cautions, notes, tables, etc. This can be done automatically using the XSLT generate-id() function. The instructional designer then searches the S1000D common source database (CSDB) to find and display relevant DMs. She can then use XInclude to include reusable elements from S1000D DMs into SCORM shareabe content objects (SCOs). When this is done, the SCOs are automatically updated when the DMs are updated.

Successful S1000D reuse requires adherence to the principle of context-agnostic content. For example, to make it possible to reuse a warning across multiple documents in different contexts, one should avoid formulations such as "refer to the illustration in the next section" inside the warning.

Enforcing the principle of context-agnostic content can be semi-automated using an assertion-based schema language like ISO Schematron to report the occurrence of keywords such as "previous", "next", "below", etc. The warning shall be routed through a comprehensive review and approval workflow provided by the CMS before final publication. The principle of business rules definitions and enforcement ensures that reusable content is of the highest quality. Consider a dual-purpose data module that is written to be reused by both training and publications. A business rule could require the use of a certain language style (e.g. active as opposed to passive voice) for the dual-purpose data module.

Another principle that can help when the content cannot be context-agnostic, is the parameterization of reusable content. With parameterization, you include variable references in the reusable content that are resolved at run time. The Exist XML database has an elegant way of handling this using a combination of XInclude and XQuery as in the following example:

<xi:include href="warning.xq?var1=material&var2=process"/>

Here warning.xq is a stored XQuery witch is compiled and executed by Exist to return the root element of the warning. The content of the warning depends on the material and process used to carry out the maintenance procedure. var1 and var2 are passed as global external variables to the XQuery.

The issue of content granularity is directly related to the principle of context-agnostic content. Although the data module is the basic unit of information in S1000D, content can be managed at a lower level of granularity. An interested feature of some XML editors is the ability to select an element inside an XML document and convert that element into an XIncluded file. So while a technical author is writing a warning inside a data module, she can pull out that warning as an XIncluded XML file if she determines that the warning could be reusable in other publications.

Another area where XQuery facilitates reuse is the dynamic assembly of content based on product attributes such as applicability, security, and skill level. S1000D has a comprehensive metadata facility called IDSTATUS that can be leveraged to filter content. A good example is applicability filtering. In the case of an aircraft, the applicability of an S1000D maintenance or operation procedure can depend on the following attributes and conditions (among others):

  1. Manufacturer serial number
  2. Aircraft registration number
  3. Service bulletin incorporation
  4. Location of maintenance
  5. Aviation regulations
  6. Temperature, wind speed, and sandy conditions.

XInclude and XQuery can be used together to package content into S1000D publication modules by executing queries that filter content based on metadata in the IDSTATUS.

An important condition for content reuse is the principle of discoverability of reusable content. Obviously, you cannot reuse a piece of content if you don't know that it exists and where to find it. A technical author should be able to query or browse the S1000D CSDB (Common Source Data Base) to find relevant reusable content. To facilitate enterprise-wide content reuse, I highly recommend a CSDB based on a native XQuery-compliant XML database and deployed as a web application. That will allow authors to perform both full-text and structured queries on the CSDB. The query should return a list of data modules or reusable chunks. The author should then be able to select the reusable chunk to automatically insert an XInclude targeting that chunk.

In support of the principle of reusable content discoverability, appropriate metadata should be added to the content. The DMs already have comprehensive metadata in the IDSTATUS section. Reusable content at a lower level of granularity (like a warning) should also have appropriate metadata specified.

An XQuery-enabled native XML database can help with the governance of your reuse initiative by providing powerful reporting capabilities. For example, you can easily run an XQuery to find all documents that contain an XInclude to a particular chunk. This is important for understanding the impact of updates to that chunk. Another potential issue that could require some attention is the versionning of reusable content. Some form of notification mechanism can be helpful to alert consumers to changes to reusable content. This can take the form of an Atom feed to which consumers can subscribe.

It is important to select an XML authoring tool that has good support for XInclude. Fortunately, some commercial XML editors now have decent support for XInclude. However, these XML editors remain complex specialist tools that are often used only by professional technical authors in documentation departments. At one of our aerospace customers, manufacturing assembly and functional test procedures were used to create installation and testing procedures for service publications. To allow their engineers to contribute S1000D content, we designed a light XML authoring application based on an XForms front-end with XML data persisted in a native XML database using a RESTful API.

Any data reuse strategy should look beyond training and publication to identify opportunities to reuse data and streamline processes across the entire aircraft lifecycle.

Thursday, May 15, 2008

Spring, SCA, and OSGi

French paleontologist Pierre Teilhard de Chardin once said: “Tout Ce Qui Monte Converge” or “Everything That Rises Must Converge”. I learned this quotation from my father who mentioned to me that it was the topic of his philosophy dissertation at his university entrance exam.

The quotation accurately describes what I see happening in the world of software development with the rise of Spring, OSGi, and SCA.

During the last 30 years, the software industry has evolved from structured design to object-oriented design, POJO programming, and lately service-oriented design. With the rising complexity and costs of software systems, the ultimate goal of this evolution has been the reuse of software assets through loose coupling and service orientation.

Spring is based on the principle of inversion of control (IoC) or dependency injection. With Spring, objects (simple POJOs) are provided with their dependencies as opposed to the objects managing or looking up those dependencies themselves. Spring relies on aspect-oriented programming (AOP) to declaratively manage cross-cutting concerns such as security, transaction, and logging. From a quality and agile development perspective, one big advantage of Spring-based applications is that they are amenable to unit testing using frameworks such as JUnit or EasyMock.

Service-oriented architecture (SOA) exposes application business logic as a set of services that are remotely accessible and reusable across platforms and programming languages. The Service Component Architecture (SCA) has been designed to facilitate service composition. The SCA Assembly Model consists of one or more service components. Service components provide business functions to other components within or outside the module. A composite contains one or more service components and specifies communication bindings and policies such as security and transactions. Like in Spring, the artifacts and the dependencies between them are described using XML.

The Open Services Gateway initiative (OSGi) defines a dynamic service model where components (packaged as bundles) and their dependencies are specified in a service registry. OSGi standardizes the life cycle management of these bundles including deployment, installation/uninstallation, and updates with full versioning capabilities. Bundles can be dynamically started, stopped, or updated without the need for a reboot. OSGi defines a model for publishing, discovering, and binding to services within the same virtual machine (VM).

Spring, SCA, and OSGi are converging to create an environment that facilitates the design and the lifecycle management of software assets that are exposed as reusable services. Software vendors are actively exploring different opportunities to combine these three technologies. The combination of Spring, SCA, and OSGi is already having a transformative impact not only on service-oriented design and application development in general, but also on the application servers and middleware market as well.

Sunday, May 4, 2008

SOA and ROA Design Principles and Patterns

I've been compiling a list of design patterns and anti-patterns on Service Oriented Architecture (SOA) and Resource Oriented Architecture (ROA). I find the following resources quite useful.

If you're looking for design patterns in building RESTful applications, the best way to start is to look at the Atom Publishing Protocol (AtomPub) which is a good embodiment of the principles of the REST architectural style. The Google Data API (GData) is a real world implementation of AtomPub. At the XML 2007 Conference, I've proposed a RESTful approach to aviation technical data management called "Integrated Documentation Environment for Aircraft Support (IDEAS)" (more on that on my previous blog RESTful IDEAS).

Another good resource is the book "RESTful Web Services" by Leonard Richardson and Sam Ruby. Chapter 8 entitled "REST and ROA Best Practices" is a must read and also addresses potential REST implementation issues such as asynchronous operations and transactions. Chapter 10 entitled "The Resource-Oriented Architecture Versus Big Web Services" offers ROA alternatives to WS-* specifications.

For SOA design patterns and anti-patterns, here are some useful resources:

I don't believe that ROA is the answer to all SOA project failures out there. However, I do believe that certain requirements and use cases are more amenable to the REST architectural style (more on that in a future post).

Sunday, April 13, 2008

SOA Governance Tools

First, an important caveat: SOA governance is not a tool or a product. SOA governance is about people and leadership. No tool will deliver good governance out the box if the human factor is not taken into consideration. However the right tool can facilitate and provide transparency into SOA governance.

It’s important to make a distinction between design-time SOA governance and run-time SOA governance. The main objective of run-time SOA governance is the enforcement of QoS and SLAs. Design-time governance focuses on the enforcement of industry-recognized SOA design principles and patterns.

The goal of these patterns is not to kill the creativity of SOA developers or police their work, but instead to avoid SOA anti-patterns that are known to undermine the success of SOA projects. For example, SOA developers cannot reuse services if they are not aware of the existence of these services in the enterprise. Even if they know these services exist and where to find them, they cannot reuse the services if they don't understand them. Therefore the ability to easily discover well specified service metadata is one good design principle that can help deliver on SOA's promise of service reuse across the enterprise.

One of the key aspects of design-time SOA Governance is the management of the lifecycle of service artifacts and the dependencies between them. This is accomplished through a new breed of tools called SOA Repositories/Registries. The following are what I consider important requirements for an SOA Repository/Registry.

Indexing of XML-formatted artifacts such as WSDL, XML Schemas, Schematron rules, XSLT transforms, Spring and Hibernate configuration files, data mapping specifications, WS-Policy documents, etc. Ideally, users should be able to use languages such as XSLT and XQuery to manipulate artifacts and query the Repository/Registry. Requirements and specification documents should be stored in XML (as opposed to MS Word or Excel) if possible so that they can be processed the same way. The SOA Repository/Registry should sit on a native XQuery compliant database. This would provide powerful visualization and reporting capabilities to the registry. I should be able to run an XQuery search against the SOA Repository/Registry to return all artifacts that contain a reference to a certain XML element, so that I can visualize the impact that a change to that element would have. Automatic detection of certain dependencies (e.g. WSDL and XML Schemas) should be supported as well.

Policy enforcement is also important as artifacts are added to the registry. For example, Schematron can be used to enforce XML Schema best practices. WS-I Basic Profile compliance and XML schema backward compatibility may also need to be enforced.

Support for a RESTful API for all CRUD (create, read, update, delete) operations on artifacts. Ideally, I would prefer support for the AtomPub specification and the Atom syndication format for pushing updates to stakeholders. Competing SOA Registry protocols and APIs include: UDDI v3, the Java API for XML Registry (JAXR), JSR 170/283 (Java Content Repository API), and IBM WebSphere Registry and Repository (WSRR). However, Open source vendors such as MuleSource and WSO2 have adopted AtomPub for its simplicity, while RedHat is building its upcoming SOA Repository/Registry (JBoss DNA) on JSR 170. Mule Galaxy sits on the Apache Jackrabbit JCR reposiroty. The JSR 170 repository model could also be adopted in the Java space as a standardized repository model for SOA registries.

Of course authentication, authorization, audit trails, workflow, and versioning should be expected in any SOA Repository/Registry.

Sunday, March 23, 2008

Passportgate: what would you do about it?

The current Passportgate scandal in the US involving the unauthorized access to the passport files of the three presidential candidates got me thinking about information security in enterprise applications particularly records and content management systems.

Ensuring information security requires a multidimensional approach based on technology, process, policy, and governance. Technology alone is not the answer. However, since this is a technology-oriented blog, I will focus only on the state of the art in securing Java EE applications particularly in the open source space.

From a technology standpoint, I see at least four potential issues: authentication, authorization/access control, audit trail, and business process.

Spring Security (formally Acegi) has demonstrated its strength for both authentication and authorization in Spring-based portal and content/record management applications. Spring AOP (Aspect-Oriented Programming) provides an elegant and simple solution for audit trails in such systems.

JBoss jBPM is a robust BPM engine that meets the requirements for workflow and enterprise business process orchestration between applications, services, and people.

The eXtensible Access Control Markup Language (XACML) is an OASIS standard for specifying access control policies in XML. XACML is not currently widely used in content/record management systems. One explanation is that XACML has been designed to provide access control for new services such as web services in service-oriented architectures (SOA). XACML would be challenging to use for document-level security in content repositories that have a hierarchical structure (e.g. JSR 170/283 repository model) and demand sophisticated caching for scalable and rapid access to massive amounts of content.

However, my favorite XML database (eXist) has an elegant implementation of XACML for controlling access to resources such as XQuery modules and Java methods, proving once again that Open Source is ahead in terms of innovation in the software industry.

Tuesday, March 4, 2008

Boeing 787 Flight Control Software and Composite Fuselage Tested

Randy Tinseth, Boeing Commercial Airplanes vice-president for marketing announced on his blog that Boeing chief pilot Mike Carriker and 787 systems director Mike Sinnett successfully tested the flight-control Blockpoint 8 software code in the 787 engineering flight simulator. He wrote:
During the test, Mike and Mike demonstrated most of the operational procedures required by a flight crew - pushback and engine start at Sea-Tac airport near Seattle, taxi and takeoff, climb, cruise, simulated engine failure, descent, approach, single-engine go-around, landing, taxi and arrival at the gate at the Portland, Oregon airport.

Boeing also performed a serie of test on the composite fuselage of the B787 including "limit load", "ultimate load", and beyond 2.5 times the normal force of gravity (2.5 G). According to a Boeing press release dated 02/28/2008:
Testers observed audible indications of damage as the test progressed but the piece did not reach the level of destruction that had been anticipated.

This is a significant development because last September, Boeing announced the delay of the maiden flight of the first 787 due to flight control software issues, fastener shortage, and supply chain bottleneck. Also last year, a former Boeing engineer went public with concerns about the survivability of the 787 composite structure in case of a crash.

The maiden flight has been postponed again to June this year. First delivery to launch customer All Nippon Airways is scheduled for early 2009.

UPDATE: On April 9, 2008, Boeing has announced that 787 maiden flight has been postponed to the 4th quarter of 2008 and the first delivery for the 3rd quarter of 2009.

Speaking about the 787 globally distributed aircraft manufacturing model in an internal memo send to Boeing employees on April 21, 2008, (and obtained by the Seattle Times) Boeing CEO Jim McNerney noted:
I expect we’ll modify our approach somewhat on future programs—possibly drawing the lines in different places with regard to what we ask our partners to do, but also sharpening our tools for overseeing overall supply chain activities.

Friday, February 29, 2008

Declarative Programming Recipes

If you've been using an imperative programming language such as Java, C#, or JavaScript, then you should consider the benefits of declarative programming when designing your next application. In this blog, I will explore the benefits of the following technologies:

  • XML Schema
  • ISO Schematron
  • XForms/XBL (XML Binding Language)
  • XSLT 2.0
  • XQuery
  • Atom Syndication
  • AtomPub

The basic difference between declarative programming languages and imperative programming languages such as C# and Java is that the former specify the “what” (the intent) as opposed to the “how” (the algorithm). The following are some reasons to consider this new paradigm:

  • Declarative programming languages are accessible to many non-programmers
  • It’s possible to create a solution that is completely declarative (no Java, C#, JavaScript, or AJAX code)
  • It facilitates the Model-Driven Architecture (MDA) software design approach
  • When the data or content is managed in XML, a declarative programming model is a superior and simpler alternative to a conventional approach based on Java Server Faces (JSF), Spring, and Hibernate. For example, you get a more robust and simpler data and business rules validation mechanism and you can forgo ORM mapping altogether. However, declarative and imperative programming can co-exist in the same solution.

The following are some recipes for taking advantage of this new paradigm:


  • Define your data structures and types in the XML Schema.
  • Use XML Schema's xsd:appinfo element to:

    • Capture metadata such such as XML to RDBMS mapping information
    • Capture business rules using inline ISO Schematron rules
    • Control the labels, alerts, and appearances of UI components such as XForms controls.

  • Generate your UI by transforming the XML Schema into XForms with XSLT.
  • Alternatively, store your XML schema and other XML-formatted metadata artifacts in a native XML database and use XQuery to extract and manipulate the metadata during development and at run-time.
  • Use XBL for enabling custom UI controls for your XForms. As an example, you can integrate a rich text editor using XBL and Dojo.
  • Use ISO Schematron for business rules validation either post form-submission or by using XSLT to inject schematron rules into the XForms directly with the xforms:bind element.
  • Use the RESTful or AtomPub API to a native XQuery-enabled XML database (such as Exist) to CRUD (create, read, update, delete) the data.
  • If you're using SOAP-based web services, use XForms to send SOAP requests and display SOAP responses as well.
  • Leverage Atom syndication for pushing updates to data consumers. In general, when creating your application, think "Syndication Bus".

Many software architects and developers are not yet aware of the power of this new paradigm. However, I believe that rather than using only familiar development frameworks, it’s important to always evaluate emerging alternatives and select the best approach even it involves a learning curve.

Saturday, February 2, 2008

Open Source and the Democratization of Knowledge

I like Open Source Software (OSS) for a number of reasons. It lowers the cost of software and allows a larger number of organizations and individuals worldwide to make use of it. OSS is also where real innovation is currently happening in the software industry. OSS companies focus their resources on the engineering process and use the web itself as a marketing and distribution channel. This allows users to freely download and try the product and decide for themselves if it’s good for them. They can get free support from the community of users or purchase commercial support for mission critical applications. The quality of product can be quickly enhanced because of the feedback that the OSS developer gets from the potentially thousands of users who download the software.

But the most important reason why I like OSS is that it enables what I call the democratization of knowledge. Complex software like operating systems, databases, ERP, CMS, and web portals all exist today in OSS. OSS is a great equalizer because anyone can access the source code to learn or discover how such complex systems are designed. They can also contribute to the code if they wish. This is unprecedented in human history. Traditionally certain groups of people have kept their technological know-how to themselves as a competitive advantage used to dominate and control markets and/or other groups of people.

Wednesday, January 23, 2008

Learning From My Students

For the past 10 years, I've taught a variety of technical topics in classroom settings. The subjects I've covered include XML, XML Schema, XPath, XSLT, XSL FO, SAX, DOM, XQuery, and S1000D. The participants are usually professionals who are looking to upgrade their skills. It’s gratifying to read an e-mail from a student saying how a course I’ve taught has helped them be more productive in their work or pass a certification exam.

It’s always a pleasure to share my knowledge and experience with people, but also to learn from them. When you teach a class, you must master the topic because you can't afford to always respond to students questions by saying that you will do some research and get back to them later with an answer. While preparing for the class, you need to find examples and real life scenarios to explain complex concepts to them. During the class itself, by listening to students’ questions and comments and trying to answer them, you actually discover some aspects or applications of the topic that you have not thought about before.

As the saying goes "the best way to learn is to teach".

Saturday, January 19, 2008

Applicability in S1000D 3.0

S1000D has a new and improved applicability mechanism based on the concepts of Applicability Cross Reference Table (ACT), Condition Cross Reference Table (CCT), and Product Cross Reference Table (PCT).

The ACT data module declares attributes of the product that are not likely to change during its life cycle such as model, series, and serial number. Examples of product attributes for a commercial aircraft include the manufacturer serial number and aircraft registration number.

The CCT data module declares technical, operational, and environmental conditions that can affect the applicability of technical content. Examples of these conditions are: service bulletin incorporation, location of maintenance, aviation regulations, temperature, wind speed, and sandy conditions.

The PCT data module lists actual physical product instances. For each product instance, the PCT specifies the values of product attributes and conditions pertaining to the product instance.

Applicability can be specified at the data module level inside the IDSTATUS or within the content of the data module at a more granular level such as a <step1> element. The ACT and the CCT are used as look up tables to lookup the relevant product attribute or condition as well as their allowed possible values. The applicability element then specifies the correct product attribute or condition identifier from the ACT or CCT and the values to test against.

The applicability information itself can be captured in human readable format for simple cases. For more complex cases, one or more assertions are used to specify the product attribute or condition to test and the values to test against. These values can be constrained with a pattern based on regular expressions as defined by the XML Schema specification.

The new S1000D applicability mechanism supports the "effectivity" requirements of civil aviation and provides capabilities that are beyond the ATA 2200 effectivity mechanism. It also facilitates the development of applicability filtering functionalties in Interactive Electronic Publications (IETPs). However, building an authoring front end that hides its complexity (regular expressions and logical operations) to the technical authors creating the content will be the key. This is also an area where well-defined business rules should be specified and enforced using a tool such as ISO Schematron.

Friday, January 4, 2008

On our Radar in 2008

First, Happy New Year and Best Wishes for a Peaceful 2008!

Since this is my first blog this year, I will talk about what will be on our radar screen. The Java EE platform with JSF and the open source frameworks Spring and Hibernate will continue to be our preferred development platform for robust enterprise applications (ERP, portals and CMS). We like the ability to leverage third party AJAX-enabled JSF component libraries such as ICEFaces and Apache MyFaces to quickly create rich internet applications (RIA). JSF also allows us to target both the mobile and web delivery platforms simultaneously through the use of render kits. Spring brings sanity and simplicity into the development of complex Java EE applications and Hibernate allows us to develop database agnostic applications among other benefits.

Although we still believe Java EE is the way to go for complex enterprise applications, we will learn and embrace Ruby on Rails and Adobe Flex this year. Feel free to recommend your favorite books on these topics.

We'll continue to make innovative uses of ATOM syndication, AtomPub, FeedSync, and OpenSearch to solve our customers problems.

A new version of our favorite XML database (Exist) will be released soon and will improve performance significantly. Exist already has some support for AtomPub. Declarative programming using a combination of XForms, ISO Schematron, XSLT 2.0, XQuery, and Exist's RESTful API will deliver great value for those who are willing to experiment.

Finally on the Web 2.0 front, we'll be exploring the intersection of content management with social computing, mashups, user-generated content, and rich internet applications (RIA).

We look forward to another collaborative and productive year.