|
|
Advertisement: Support LinuxWorld, click here!
|
|
|
XML on Linux
|
Practical XML with Linux, Part 2: A survey of tools
The standard remains uncorrupted by the influx of heavy hitters
Summary
XML's popularity as a document-exchange format has soared recently. Uche Ogbuji surveys the vast menagerie of sometimes remarkably polished tools available for creating and serving XML documents. (3,100 words)
By Uche Ogbuji
|
![]() |
hen rereading my first two XML articles, published just over a year ago in this journal and its sister, SunWorld, I'm struck by how much they justified XML as the great tin opener of closed data formats. It looks now as if all the light bulbs went on in less than a year. XML is in the computer news every day, every company seems to be scrambling to include XML in their brochures, and XML standards organizations such as the World Wide Web Consortium (W3C) practically have to turn people away. At this point, I hardly think any more justification of XML for open data exchange is required; it's basically a fact of information technology. The remaining questions concern the use of XML to better solve real problems.
As I mentioned in my last article, the W3C and other standards organizations are working very quickly to complete specifications for technologies complementary to XML. I mentioned namespaces, which are a key facility for managing global names and are now used in almost every XML technology. I also mentioned DOM and XSLT. Since then, XLink, XPointer, XML Schemas, SVG, and other specs have neared completion. I will discuss these later in the series, as well as RDF, Schematron, and other beasts in the XML menagerie. The XML community has also matured greatly -- there are many new high-quality information and news sites, some of which I list in the Resources section. If you are an XML enthusiast, I particularly recommend regular visits to xmlhack.
The triumph of open standards
The most amazing thing about XML's incredible rise, which I think has been quicker than that of the PC, Java, or even the Web, is the fact that it is still as open as ever. Even though XML was originally intended to encourage data interchange by providing both human and machine readability, the odds were that a powerful company or group of companies would foul the waters. Many vertical industries, such as the automobile industry (which recently surprised analysts by announcing a huge XML-driven online exchange), the health care industry, and the chemical industry, have adopted XML as their preferred data-exchange format. If the likes of Microsoft (the early and ongoing XML champion) and Oracle could co-opt standards for XML processing, they could increase their domination in such industries under the guise of openness: the perfect monopolistic Trojan horse.
This was never an idle menace. Last year, Microsoft nearly derailed XSLT by bundling a mutation of XSLT, different from the emerging standards and laden with Microsoft extensions, into its Internet Explorer 5 browser. Many Linux advocates cried loudly about Microsoft's "embrace-extend-extinguish" move on Kerberos, but that was a weak jab compared to the MS XSL ploy. Since Internet Explorer is by far the most popular browser, Microsoft ensured that most of the world's XSLT experience would come through their proprietary version, and nearly made that version the de facto standard. There were many flame wars on the XSL-List mailing list (see Resources) when Explorer users arrived in droves asking what the proper XSLT was.
But then something surprising happened. Microsoft's customers said loudly and clearly that they didn't want an MS flavor of XSLT -- they wanted the standard. The first sign that Microsoft understood this was a slow migration to the standard in Internet Explorer updates. Then MS developers announced publicly that their new design goal was full compliance with the XSLT standard. Finally, after some prodding on XSL-List, several of Microsoft's developers admitted they had been receiving numerous email messages asking them to get in line.
Now, I know Linux users don't expect such sophistication and independent thought from large numbers of MS users; I'm no exception to that possibly bigoted attitude. I credit this remarkable episode to the power of the promise of openness in XML. Of course, this didn't prevent Microsoft from committing amusing gaffes such as claiming to have invented XML (as reported by the Washington Post in May), but such things are far less dangerous than standards pollution.
Similar stories recur frequently throughout XML's brief history. In fact, Microsoft apparently didn't learn its lesson from the XSLT fiasco; it is currently being bludgeoned into abandoning its proprietary XML schema format, XML-Data, in favor of XML Schemas, which has almost traversed the W3C standards track. The battle hit fever pitch with Microsoft's loud announcement of BizTalk, an ambitious repository and toolkit for XML schemas. But it increasingly looks like the open standard will win out.
But enough about the wide, wild world. Let's have a look at what's happening at home. In my first XML article, I had to virtually apologize for the lack of XML tools for Linux. This problem has been corrected to an astonishing degree.
This article briefly introduces some XML tools for Linux in a few basic categories: parsers, Web servers, application servers, GUIs, and bare-bones tools. Most users' introduction to XML will be geared toward better Webpage management. They may then choose to migrate to complete, all-inclusive application servers or to construct custom systems from the various XML toolkits available and the usual Unix duct tape and wire ties. There is usually some content to manage, and you may see no reason to leave the world of Emacs or vi to churn out documents. However, content managers are often non-technical, so it's helpful that there is a good selection of GUI XML editors.
Just the parsers, ma'am
XML processing starts with the parser, and Linux has many to choose from. First, you have to pick a language: C, C++, Java, Python, PHP, Perl, TCL, or even _javascript_. (This is hardly an exhaustive list.) Next, you must decide whether and how to validate your XML documents. Validation ensures that all of an XML document's elements and attributes conform to a schema. The traditional XML validation method is document type definition (DTD). The W3C, as I mentioned, has almost completed XML Schemas, which has the advantages of XML format (DTDs are in a different format) and "modern" data-typing. Its main disadvantages are complexity and immaturity.
C users are well served by James Clark's old standby, Expat, a bare-bones parser which is arguably the fastest in existence, but provides no validation. It is significant that almost every language under the sun, from Python to Eiffel, provides a front-end to Expat. But even Expat is facing some tough "competition" from entries such as the capable libxml project, led by Daniel Viellard of the W3C. This library, most prominently used in GNOME, offers many fine-tuning parsing options and supports DTD validation. There is also Richard Tobin's RXP, which supports DTD. C++ users have Xerces-C++, which is based on XML4C code that IBM donated to the Apache XML Project. Xerces-C++ supports both DTD and Schemas. In fact, if you want to use XML Schemas in Linux, Xerces is probably your best bet. Individual efforts include Michael Fink's xmlpp, which is quite new and doesn't support validation.
There is a Java version of Xerces with a similar pedigree. (Java users are almost drowning in choices.) The media has called it the "marriage" of Java and XML, but the most likely explanation for the huge number of XML tools for Java is that XML emerged right as Java was peaking as a programming language. Besides Xerces-J, there are Java parsers from Oracle, Sun, DataChannel, and others. Individual efforts include Thomas Weidenfeller's XMLtp (tiny XML parser), designed for embedding into other Java apps (as was the pioneering but now moribund AElfred from Microstar). Weidenfeller also provides one of the neatest summaries of the OSS license I've ever seen: "Do what you like with the software, but don't sue me." Then there is the Wilson Partnership's MinML, designed to be even smaller, for use in embedded systems.
Python still has the growing and evolving PyXML package, as well as 4Suite, from my company, FourThought. XML considerations are helping to shape many grand trends of Python, such as Unicode support and improved memory management. The Perl community has definitely taken to XML. The main parser is, appropriately, XML::Parser, but you can take most XML buzzwords, prefix "XML::", and find a corresponding Perl package.
Serving up XML pages
XML's early promise to the media was a way to tame the Web -- structured documents, separation of content and presentation, more manageable searching, and autonomous Web agents. Some of that was drowned out by the recent interest in XML for database integration and message-based middleware, but XML is still an excellent way to manage structured content on the Web. And Linux is a good operating system on which to host the content.
The big dog among XML Web servers is the well-known big dog of Web servers, period. Apache is absolutely swarming with XML activity lately. I've already mentioned Xerces, the XML parser from the Apache XML Project. Apache also has an XSLT processor, Xalan, with roots in IBM/Lotus's LotusXSL. There is also Formatting-Object Processor (FOP), a tool for converting XML documents into PDF documents by way of XSL formatting objects, a special XML vocabulary for presentation. Apache has added support for Simple Object Access Protocol (SOAP), an XML messaging protocol used to make HTTP-based queries to a server in an XML format. As a side note, SOAP is heavily contributed to and championed by Microsoft, one of the many positive contributions the company has made to XML while trying not to embrace and extend.
These bits and pieces are combined into an Apache-based XML Web publishing solution called Cocoon. Cocoon allows XML documents to be developed, then published on the Web for wireless applications through Wireless Application Protocol (WAP), and to print-ready PDF format through FOP.
Perl hackers already have the aforementioned proliferation of "XML::*" packages, but Matt Sergeant has also put together AxKit, a comprehensive toolkit for XML processing. AxKit, specialized for use with Apache and mod_perl, provides XSLT transformation and other nonstandard transform approaches, like XPathScript.
Full-blown application servers
Enterprises that want an end-to-end solution for developing and deploying applications using XML data have several options under Linux. Application servers build on basic Web servers like those described above by adding database integration, version control, distributed transactions, and other facilities.
Hewlett-Packard found an open source, Web-hip side with e-speak, a tool for distributed XML applications with Java, C, and Python APIs for development and extension.
A smaller company that uses the advantages of open source to promote its XML services is Lutris, developer of Enhydra, a Java application server for XML processing. It has some neat innovations such as XMLC, a way to "compile" XML into an intermediate binary form for efficient processing. It is also one of the first open source implementations of Java 2's Enterprise Edition services, including Enterprise JavaBeans.
XML Blaster is a messaging-oriented middleware (MOM) suite for XML applications. It uses an XML transport in a publish/subscribe system for exchanging text and binary data. It uses CORBA for network and interprocess communication and supports components written in Java, Perl, Python, and TCL.
Conglomerate, developed by Styx, is a somewhat less ambitious, but still interesting, project for an XML application server more attuned for document management. It includes a nifty GUI editor and provides a proprietary transformation language that can generate HTML, TeX, and other formats.
Oo-wee! GUI!
Last year, I lamented Linux's lag in the area of GUI browsers and editors. While I personally use a 4XSLT script and Xemacs, respectively, for these tasks, I frequently work with clients who want friendlier GUIs for XML editing and ask whether my preferred Linux platform has anything available. Fortunately, there are more choices than ever on our favorite OS; much of the succor comes in the form of Java's cross-platform GUI support.
GEXml is a Java XML editor that allows programmers to use pluggable Java modules to edit their own special tag sets. It uses a standard layout for XML editors: a multi-pane window with a section for the tree view, sections for attributes, and a section for CDATA.
ChannelPoint's Merlot is a Java-based XML editor that emphasizes modeling XML documents around their DTDs, and abstracting the actual XML file from the user. Merlot supports pluggable extension modules for custom DTDs.
Lunatech's Morphon is a DTD-based XML editor and modeling tool. Morphon is similar to the other editors described here, with a couple of nice twists: it allows you to specify cascading stylesheets for the editor appearance and document preview, and it mixes the ubiquitous tree view with a friendly view of the XML document being edited. Morphon is available for Linux, but is not open source.
Hopefully, all of these DTD-based tools will expand to accommodate XML schemas and other validation methods, making life easier for XML-namespace users.
IBM's alphaWorks keeps churning out free (beer) XML tools. One of those, XML Viewer, allows users to view XML documents using (once again) the tree view and specialized panes for element and attribute data. XML Viewer is written in Java. It also lets you link the XML source and DTD to view such items as element and attribute definitions. alphaWorks also offers XSL Editor, a specialized Java-based XML editor for XSLT stylesheets. It incorporates advanced features such as syntax highlighting and an XSLT debugger.
TreeNotes is an XML text editor that uses a series of widgets to allow editing of XML's tree structure, elements and attributes, and character data.
DocZilla is an interesting project: an extension of the Mozilla project for Web-based XML document applications. It promises XML browsing support on par with Internet Explorer's, including an extension and plug-in framework. DocZilla started out very strongly, but seems to have lagged a bit. This may be due in part to Mozilla's increase in XML focus. Mozilla has always supported XML and Cascading Style Sheets (CSS), but now, with projects like Transformiix (an XSLT processor for Mozilla), it is bidding to replace Explorer as king of XML browsers.
There is also KXMLviewer, a KDE XML viewer written in Python -- I'll cover it in more detail when I discuss GNOME and KDE XML support later in this series.
In the hacker spirit
We've looked at lumbering app servers and pretty GUI tools. These are very nice for easing into XML, but we all know that Linux (and Unix) users typically prefer sterner stuff: small, manageable, versatile, no-nonsense packages that can be strung together to do a larger job. Luckily for the desperate hacker, the nuts-and-bolts toolkit is growing just as quickly as the rest of XML space.
A mature entry is LT XML, developed by the University of Edinburgh's Language Technology Group. LT XML is a set of standalone tools and libraries for C programmers using XML. It supports tree-based and stream-oriented processing for a wide variety of application types. The LT XML repertoire would please those who love nothing more than stringing together GNU textutils to produce some neat text transformation. LT XML has the mandatory XML-aware grep, sggrep (the "sg" for SGML), as well as sgsort, sgcount, sgtoken, etc. Python bindings for LT XML should be available by the time you read this.
Speaking of grep, there is also fxgrep, a powerful XML querying tool written in Standard ML, a well-regarded functional programming language from Bell Labs (XML provides a rather fitting problem space for functional languages). fxgrep uses the XML parser fxp, also written in SML. fxgrep supports specialized XML searching and query using its own pattern syntax.
Paul Tchistopolskii makes clear his target user base for Ux: "Ux is Unix, revisited with XML." Ux is a set of small XML components written in Java. (OK, we're leaking some Unix heritage.) The components are designed to be piped together for database storage and extraction, XSLT transformation, query, etc.
Pyxie, by Sean McGrath, is an XML parsing and processing toolkit in Python. Pyxie is highlighted in McGrath's book, XML Processing with Python. Pyxie builds on James Clark's earlier work by focusing on a line-based view of XML instead of the "natural" tokens that emerge from the spec. This can provide a useful optimization, and also occasional complications.
For those looking askance at XML in a TeX environment, alphaWorks might provide a useful introduction. TeXML is a tool that allows you to define an XSLT transform for converting XML files to a specialized vocabulary, the results of which are converted to TeX. Also, thanks to alphaWorks, there is an XML diff as well as a grep. XML Tree Diff differentiates documents based on their DOM tree representation. It's more of a collection of JavaBeans for performing diffs than a standalone application, but it's relatively straightforward to use.
And there is the aforementioned 4Suite, a set of libraries for Python users to construct their own XML applications using DOM, Xpath, XSLT, and other tools. I covered 4XSLT in my last XML article (though the spec and 4XSLT have changed since then), and 4Suite libraries are now standard components in the Python XML distribution.
Conclusion
Hopefully, this tour will help Linux users of all levels find XML resources. In upcoming articles (hopefully not as delayed as this one), I will cover XML and Databases, XML and KDE/GNOME, and more on how to put XML to work in a Linux environment.
By the way, this very article is available in XML form (using the DocBook standard). I've also put up a simplified DocBook XSLT stylesheet that can be used to render this article in HTML. (See Resources for both.) I used the "doc" file extension for DocBook files. I encourage you to use DocBook (O'Reilly & Associates publishes an excellent book on the topic, by Norman Walsh) and the ".doc" extension, chipping at the hegemony of the proprietary Microsoft Word format. Just another small way XML can open up data to the world.
Discuss this article in the LinuxWorld.com forums
(1
postings)
(Read our forums FAQ to learn more.)
|
|
![]() |
About the author
Uche Ogbuji is a consultant for and cofounder of FourThought, a consulting firm that specializes in custom software development for enterprise applications, particularly using XML to provide Web-based integration platforms for small- or medium-size business.
|
Resources