Managing your content with XML

Content management systems come in all shapes and sizes. Rick Grehan looks at two extremes

Content is the lifeblood of any organisation that relies on information. If documents are lost in filing cabinets or hidden away on hard drives, the knowledge they carry is buried. But when content is organised and searchable that information lives on. It does useful work over and over again as it is referenced, consulted and combined with other information.

The two CMSes (content management systems) in this review create organised and searchable repositories of digital documents. At first glance, both products appear similar and, fundamentally, they are. Both, for example, make extensive use of XML. Closer inspection, however, reveals that each is designed for somewhat different uses of content.

Daisy is an open source CMS whose strength is its flexible organisation and navigation capabilities. Ixiasoft’s TeXtML is a commercial CMS that takes a more straightforward approach to content organisation, but excells at text search.

Daisy 1.3

Daisy is an exceptionally modular system; its designers purposely decoupled its internal organs for greater flexibility. Among those pieces is the back-end database, MySQL. There’s the repository server, which manages the storage and retrieval of documents. The OpenJMS Java messaging service informs applications of updates to the repository. Finally, the Daisy wiki front-end provides dynamic, web-based repository view and access.

The database back-end and the repository server are Daisy’s core components. The OpenJMS service is more ancillary: it passes status events to apps that request notification of changes in the repository’s content or structure.

Strictly speaking, Daisy’s wiki component is merely an example of a front-end for the repository. Daisy’s creators describe Daisy as a “content management framework,” precisely because it could be used to support other front-ends.

Mind you, Daisywiki is not simply a sample application; it’s a fully functioning wiki, complete with a built-in editor, versioning, search pages, PDF publishing and more.

I installed Daisy on my test system and, with the exception of a problem with Internet Explorer 5.0, I had it running within half an hour. The installation constructs a small wiki-based website populated with an initial “welcome” page. The installation includes all the tools for adding new documents, editing existing ones, adding and managing users and so on. Because all the site’s pages are built from documents in a Daisy repository, Daisywiki is an excellent mechanism for exploring how Daisy works.

The internal structure of Daisy’s repository is unusual in that there is none. There are no folders or sub-folders, no collections — just a container in which documents float about like the meat and potatoes in a digital stew. All is not anarchy, though.

First, documents themselves are structured, being composed of parts and fields. A part carries binary data of a specific mime type (RTF information or image data, for example), and a field carries simple data (such as a numeric value, a date or a string). The structure and allowed content of a document’s parts and fields is defined by the document’s type (which is specified in yet another document). So all documents within a repository must adhere to one of the defined document types. You can define as many document types as your imagination permits.

Second, a repository includes one or more “navigation documents,” an XML-based specification that defines how users navigate through the repository. There can be more than one navigation document in a repository, effectively allowing you to define multiple repository views. Behind the scenes, navigation documents work their magic by performing a query on the repository. So, for example, one navigation document might arrange the contents by modification date, another might do so by title.

The Daisy API is a combination of HTTP and XML. In other words, you send commands to the Daisy repository via HTTP and those commands are in the form of XML embedded in the HTTP request. Hence, you can control Daisy through just about any scripting language that can “talk” HTTP; you can even handcraft commands by typing in the proper URL. If, however, you’d rather put a more robust API into the repository, Daisy provides a Java wrapper around the HTTP/XML interface.

The DQL (Daisy Query Language) is obviously derived from SQL. A query is a “select” clause, adorned with modifiers for filtering and ordering the results. Whereas in SQL those filters amount to comparisons on column values, in DQL the comparisons are performed on document fields. So, for example, to search for documents in the repository with a PictureContent field equal to “boat”, you would enter the following Daisy query: “select id, name where #PictureContent = ‘boat’.” This returns the ID number and name of the document.

Daisy’s eschewing of a repository structure appears, at first glance, to be a severe omission. Further reflection, however, reveals this weakness as a strength. In a typical CMS, a document is placed into a specific collection within the repository, but that implies a redundancy. Someone has used the document’s content to determine which collection to put the document in. If you’ve properly tagged the document, however, and if your repository server can create a view of the repository derived from those tags, then the equivalent of a collection structure can be rendered at display time. And, unlike collection-based repository servers, such a view-based server renders multiple, different views of the same repository. This is exactly what Daisy does and the result is quite impressive.

TeXtML

Ixiasoft’s TeXtML applies the bulk of its energies to the storage, retrieval and management of text, and does so by creating an environment awash in XML.

It’s not much of a stretch to say that TeXtML takes text documents from our universe, maps them into their equivalents in an XML universe and uses the capabilities of that universe to provide search and management functions that would not be available otherwise. This is not to suggest that TeXtML can handle text-only docs: it can easily store and retrieve documents with embedded binary data.

TeXtML uses a collections paradigm for organising documents. Collections appear as named folders on TeXtML’s administration console and are navigated using the standard path constructs that anyone familiar with a file system would recognise.

How documents are stored in the repository, though, is a bit complicated. Documents are mapped to XML equivalents — but that is only partly true. On the one hand, documents are stored wholesale in their native format. On the other hand, when a document is placed in the repository it is parsed into a kind of XML doppelganger document that TeXtML uses to build indexes for the document. The TeXtML repository keeps track of the relationship between the original document and its XML shadow. This technique of creating XML shadow documents, while keeping the original available, helps TeXtML significantly with its indexing chores, thus speeding queries.

The parsing is performed by the TeXtML’s Universal Converter which reads some 220-plus document formats. It is an optional component but without it the only querying you can do is on document metadata such as title, creation date, document type and so on.

TeXtML knows which parts of a given document are to be indexed via an index definition document. There is only one index definition document in the repository and its content is entirely XML. When a new document enters the repository it is dissected by the Universal Converter and the index definition document is consulted to determine which elements are to be indexed. TeXtML creates indexes for full-text content, strings, numeric data, dates and time.

TeXtML’s query language is yet another XML variant and is entirely unlike XQuery. The dissimilarity is understandable. TeXtML is primarily intent on performing rapid document content search; less important is the capability to navigate an XML document’s structure using XPath-style expressions, as can happen in XQuery.

TeXtML’s demonstration download comes with a preloaded repository as well as an application that allows the user to experiment with the system’s querying capabilities. The application lets the user enter queries by filling in text boxes, generates the query invisibly, then executes it.

The installation also includes sample apps and queries, and the included programmer’s manual provides a line-by-line explanation of the VBScript programs. This is not to suggest that VBScript is the only programming avenue into TeXtML, which also supports APIs for Java, native .Net, COM and OLEDB. There is also a WebDAV extension but at the time of writing the API did not support some of TeXtML’s advanced features.

ContentDaisy could certainly benefit from a smoother installation. Hopefully, a turnkey version, expected as part of the next release, will eliminate that complaint. Beyond that, the Daisywiki is a joy to play with and is an excellent test-drive of Daisy’s novel stuff-it-all-in-one-bag approach to document storage.

TeXtML is the product for scuba-diving through oceans of text content. It provides safeguards that Daisy doesn’t have, such as the Fault Tolerant Server, which replicates documents and transactions on multiple TeXtML servers.

If hard-core text searching is what you need in your CMS system, then by all means give TeXtML a look. Daisy, however, has that powerful attribute that we are seeing more and more in high-quality software: open source. If you want to set up a wiki site in an evening or two, Daisy is very hard to beat.

Daisy 1.3

Outerthought

Cost: Free

Platforms: Requires only a JVM 1.4.2 or higher and MySQL version 4.0.20 or Version 4.1.7 (or higher).

Bottom line: Daisy’s novel approach to stuffing all documents into one bag and leaving it to metadata and navigation documents to sort out may sound like anarchy but this provides more flexibility than the collections approach. Daisy allows multi-user editing of repository content. The installation takes a bit of work and the documentation is still in progress. However, Daisy is proving itself in live on-the-web use, so the extra effort is worth it.

Ixiasoft TeXtML Server

Ixiasoft

Cost: Starts at around US$10,000

Platforms: Requires Windows 2000/2003 or Windows XP Professional

Bottom line: TeXtML provides a wide array of APIs. Setup is easy and it supports fault tolerance with multiserver fail-over repositories (an optional component).

TeXtML strikes the right balance between turning everything into XML or using XML to enable powerful queries. The CMS excels at text searching. Although the price is steep, TeXtML may well be worth considering for companies that want quick search access to documents in a secure repository.

Join the newsletter!

Error: Please check your email address.

Tags Reviews ID

More about CMSMySQL

Show Comments

Market Place

[]