In the technology world, there is a lot to be said about perception, and how perception defines your reality. You can perceive a market one way, only to change your point of view slightly and realize that the real target market for a class of technologies is quite different from where those technologies are actually being sold. Unfortunately, that change of perception often occurs too late to make a difference. A perfect example of this phenomenon is Enterprise Information Integration, or EII. EII is a specific segment of the data integration market that was gaining some real traction on its perceived value to enterprise IT systems. But, for reasons that become clear later, has essentially disappeared during the past couple of years. The market for SOA infrastructure software, on the other hand appears to be (finally) not only gaining traction, but really to be on the verge of becoming the next big wave.
What’s interesting in juxtaposing the decline of EII with the rise of SOA is that data integration is a critical component of SOA, but the vendors we associate with SOA are not necessarily specialists in data integration. At the same time, the EII vendors who were well positioned to address this real requirement were not able to break into the SOA market. They perceived a very different market need for their technologies. Instead of refocusing their efforts to address the data services requirements of SOA, they spent their time trying to sell to people who were more interested in reporting applications and in augmenting data warehouses. By taking this route, they became mired in attempting to build model-based virtual databases, distributed query engines and optimizers, and headed down a sure fire losing path.
The bottom line is that there’s no way to optimize around the fact that if you need to run a join of large data sets from two different databases without an index, the most efficient algorithms will still have to traverse both data sets at least once, so someone used to index-driven optimized database response times for this sort of thing will naturally opt for an ETL based warehouse or data mart solution and eschew the more dynamic form of data integration promoted by EII. You could easily argue that the EII vendors had a perception of the market for their products that was widely divergent from the market reality.
Data integration for applications, and along the same line of thinking SOA, is different though, and much more amenable to distributed, real time data integration approaches. Why? Well, for the most part, applications deal with manageable chunks of data at any given time, even if that data comes from multiple sources. All the account information for a customer or even all the insurance policies for a given customer, and that customer’s entire history is a much more manageable volume of data to integrate on the fly.
Interestingly, this is one key area where the database heritage of the EII vendors actually could have worked to their advantage. Most, if not all of the major EII vendors were able to exploit relational algebra, and the fact that if you can represent data in a relational form. The result is small numbers of composable operators that can be used to build a service that can accomplish extremely sophisticated data manipulation. This in turn, make it possible to build some very powerful and intuitive tools that make the whole process of developing data services more efficient and less error prone. Think of it as the data manipulation equivalent of BPEL.
So EII vendors actually had the majority of the technology and tools that are required for application-based (or application-driven) data integration. What they didn’t have was the market and sales focus on SOA, and so in general didn’t address the remainder of the technical problems required.
In the SOA world, data is commonly represented in any number for formats, but more and more commonly in XML. The adoption of XML as a lingua franca is only accelerating as the use of industry-specific or even enterprise-specific XML grammars becomes a hard requirement in deployed services. This is a development that has been a long time coming, but we still do not have the equivalent of the relational algebra for XML. The trouble with this is that building intuitive, declarative-looking tools to describe data integration logic (as opposed to data mapping of one schema to another) is very difficult because the language operations (in XSLT or XQuery) that the tools ultimately produce (and therefore must be specified in the UI) are too fine grained, and can devolve easily into procedural languages in the way in which they are used.
To truly take advantage of data services in an SOA environment, what is needed is the best of both worlds – technology that provides the ability to use a small set of relational or relational-like operators to manipulate tuple-based data streams combined with the ability to map a single tuple-based data stream into an arbitrary XML schema. Furthermore, the ability to specify and deploy data validation rules that can be deployed as policies attached to service endpoints that can be enforced much more efficiently than XSD schema validation is also crucial
Finally, and perhaps most importantly, what is needed is the realization that the true “on-the-fly” data integration customer is someone who is interested in building or integrating applications, not someone who is interested in producing reports or doing on-the-fly data mining or multi-dimensional analysis. The person who is interested in building out an SOA that is data rich, isn’t interested in heavy, cumbersome up-front data modeling exercises, but in straightforward tools that allow to the declarative description of data integration and mapping operations.