Contents
DownloadDownload the source (*.java), class and jar files packaged as one zip file here. IntroductionThe XML Content Model Validator (CMV) is a standalone module I wrote several years as part of a larger Java based XML framework developed by a company I used to work for. This company having gone bust in the meantime, I thought it would be a waste for this code to disappear too, so I decided to make it public hoping somebody, somewhere, may find a use for it. For those new to some of these concepts, here are the very basic facts - DTDs (Document Type Definition) and XML Schemas (XSD) define a number of constraints that an XML or XSD document must follow to be considered valid relative to that DTD or Schema. An XML or XSD document has a hierarchical structure formed of elements, and each element may contain sub-elements, as well as have a number of attributes. The content model is a description of the structure of these elements - what and how many sub-elements or attributes they can have, and in which order these sub-elements must appear (the order is not relevant for attributes), . For a detailed description of XML Schema content models see http://www.w3.org/TR/xmlschema-0/ and http://www.w3.org/TR/xmlschema-1/ . What's relevant in this context though is that content models are similar in concept and can be described in a regular expression like language. In fact "content model" and "regular expression" will be used interchangeably throughout this document. The content model regular expressions have specific characteristics compared to ordinary regular expressions. For example, they must be deterministic (a short explanation of what that means is at the end of this document). Also, they do not need all the features that a general purpose regular expression usually provides. This module was developed with these factors in mind, but if necessary, it can be extended to support more features. The CMV was written in Java but could be easily ported to other languages, due to its modular design. StructureThe module contains the following main logical components:
Here is how it works at a conceptual level We have 2 entities: the content model (regular expression) and the actual instance structure that needs to be matched against the content model. The first step is to generate the state machine (which is in fact a DFA - a deterministic finite state automaton) corresponding to the regular expression .
This state machine thus generated is then "run" taking a string representing the instance structure. This step will return a matches/doesn't match result. In the case of a "no match" result, more information can be extracted from the state machine to determine what exactly caused the failure. If the regular expression is complex, the state machine generation can take longer, but since a state machine must be created only once and then run repeatedly, once constructed it can be cached or even serialized (feature not implemented) for faster processing later on. The validation part is very efficient, as generally are all state machine based algorithms. Packages and source codeHere is a complete list of packages and files with links to the source code: com.tradery.collections - various collection classes, mostly encapsulation of more primitive types for better type checking.
com.tradery.contentmodel
com.tradery.cmdline (see note below)
com.tradery.contract (see note below)
Note - the CmdLine* and Contrct classes were written by another developer and since I haven't been able to contact him to ask for permission to publish the source code, they are only provided as class files. UsageUsing the classes is relatively simple - see see the ContentModel.java file, the "main' function for a detailed example of how to use them. To use the module, the content model must be reduced to an expression using several operators:
Note: general purpose regular expression usually don't have an explicit operator AND - simple concatenation achieves the same result. In our case concatenation is used to represent whole words, and the "," (comma) operator is used as AND. Words in the instance string (the one that must be validated) must be separated by "," (comma). Let's start with a few examples of valid expressions that can be used to describe unambiguous content models and instances that match or don't match these expressions. See the command line usage below on how to actually run these examples.
Command line usageThe ContentModel.main is the entry point when using the module from the command line. Be sure to set the correct environment (classpath for instance) before running it. The command line accepts the following arguments: java com.tradery.contentmodel.ContentModel [-?] [-help] [-s] [-e] [-t] [-r] "regular_expression" "string_validate_1" "string_to_validate_2"...
Here are a few typical basic commands basic command line - one regular expresson and one string to validate java com.tradery.contentmodel.ContentModel "(aa|bb),cc" "aa,cc" java com.tradery.contentmodel.ContentModel -s "(aa|bb),cc" "aa,cc" java com.tradery.contentmodel.ContentModel -a "(aa|bb),cc" "aa,bb,cc" "aa,cc" aa,dd" Ambiguous vs. unambiguous regular expressions.The implementation of this module is based on a paper titled "Unambiguous Regular Expressions and SGML Document Grammars" written by Anne Brüggemann-Klein and Derick Wood. That is a very dry read and probably not very interesting for most, but I could provide a link to those interested. By design XML only accepts unambiguous content models, and this implementation will also accept only unambiguous regular expressions as input. An exception will be thrown if a content model is ambiguous. An ambiguous regular expression will generate a state machine with at lest one state with 2 ore more exit transitions triggered by the same input symbol (or word). In that case it is obviously not possible to determine the valid transition (without looking ahead in the string). General purpose regular expressions accept ambiguous expressions as they usually look ahead in case of ambiguity to determine the transition that matches. The component is available for download as a zip file for the source code and Visual Slick Edit project, as well as a jar file containing the generated class files. FeedbackPlease let me know if you have found this article and associated code useful, if you are using it in a project or have modified it for other purposes, or if you have any feedback for me - info@amichel.com . Copyright noticeThis is the copyright notice that should appear in all copies of the software and its documentation. Other than that, feel free to use it as you like. * Copyright (c) 2007
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||