AMichel.com Tools for Traders Home | Forums | About

Tradery.com - free advanced trading system development
Now with Multi-System capability - no coding required

Yahoo & Google Quotes Downloader 2.6 - 100% FREE

XML content model validator and regular expression matcher

XML Content Model Validator

Contents

 

Credits

The implementation of this module is based on a couple of papers Unambiguity of Extended Regular Expressions in SGML Grammars and especially Regular Expressions into Finite Automata written by Anne Brüggemann-Klein

Download

Download the source (*.java), class and jar files packaged as one zip file here.

Introduction

The XML Content Model Validator (CMV) is a standalone module I wrote several years ago as part of a larger Java based XML framework developed by a company I used to work for. This company having gone bust in the meantime, I thought it would be a waste for this code to disappear too, so I decided to make it public hoping somebody, somewhere, might find a use for it.

For those new to some of these concepts, here are the very basic facts - DTDs (Document Type Definition) and XML Schemas (XSD) define a number of constraints that an XML or XSD document must follow to be considered valid relative to that DTD or Schema.

An XML or XSD document has a hierarchical structure formed of elements, and each element may contain sub-elements, as well as have a number of attributes. The content model is a description of the structure of these elements - what and how many sub-elements or attributes they can have, and in which order these sub-elements must appear (the order is not relevant for attributes), . For a detailed description of XML Schema content models see http://www.w3.org/TR/xmlschema-0/ and http://www.w3.org/TR/xmlschema-1/ . What's relevant in this context though is that content models are similar in concept and can be described in a regular expression like language. In fact "content model" and "regular expression" will be used interchangeably throughout this document.

The content model regular expressions have specific characteristics compared to ordinary regular expressions. For example, they must be deterministic (a short explanation of what that means is at the end of this document). Also, they do not need all the features that a general purpose regular expression usually provide. This module was developed with these factors in mind, but it could be extended to support more features if needed.

The CMV was written in Java but could be easily ported to other languages, due to its modular design.

Structure

The module contains the following main logical components:

  • a regular expression parser
  • syntax tree
  • compiler
  • state machine

Here is how it works from a conceptual standpoint

We have 2 entities: the content model (regular expression) and the actual instance structure that needs to be matched against the content model.

The first step is to generate the state machine (which is in fact a DFA - a deterministic finite state automaton) corresponding to the regular expression .

  1. the regular expression is passed to the CMV
  2. the CMV parses it and generates a syntax tree which is hierarchical representation of the regular expression
  3. the syntax tree is compiled into the state machine

This state machine thus generated is then "run" taking a string representing the instance structure. This step will return a matches/doesn't match result. In the case of a "no match" result, more information can be extracted from the state machine to determine what exactly caused the failure.

If the regular expression is complex, the state machine generation can take longer, but since a state machine has to be created just once and can then be reused, once constructed it can be cached or even serialized (feature not implemented) for faster processing later on.

The validation part is very efficient, as generally all state machine based algorithms are.

Packages and source code

Here is a complete list of packages and files with links to the source code:

com.tradery.collections - various collection classes, mostly encapsulation of more primitive types for better type checking.

com.tradery.contentmodel

com.tradery.cmdline (see note below)

  • CmdLineApp.java
  • CmdLineAppBundle.java
  • CmdLineAppConstants.java

com.tradery.contract (see note below)

  • Contract.java

Note - the CmdLine* and Contrct classes were written by another developer and since I haven't been able to contact him to ask for permission to publish the source code, they are only provided as class files.

Usage

Using the classes is relatively simple - see the ContentModel.java file, the "main' function for a detailed example of how to use them.

To use the module, the content model must be reduced to an expression using several operators:

Operator Description
| or
, and
? optional (0 or 1 occurrences)
+ one or more occurrences
* zero or more occurrences
[m,n] between m and n occurrences
<empty> not an operator, used to represent the empty element
() grouping of sub-expressions

Note: general purpose regular expression usually don't have an explicit operator AND - simple concatenation achieves the same result. In our case concatenation is used to represent whole words, and the "," (comma) operator is used as AND.

Words in the instance string (the one that must be validated) must be separated by "," (comma).

Let's start with a few examples of valid expressions that can be used to describe unambiguous content models and instances that match or don't match these expressions. See the command line usage below on how to actually run these examples.

Expression Match No match
adrian? adrian bdrian
(adrian|bdrian)+ adrian,adrian,bdrian,adrian cdrian
adrian?,bdrian bdrian
adrian,bdrian
adrian
(bdrian),(bdrian*,adrian)* bdrian
bdrian,bdrian,adrian
adrian
bdrian,bdrian
(aa,bb,ee,ff)*|(cc,dd)|(gg,hh) aa,bb,ee,ff
aa,bb,ee,ff,aa,bb,ee,ff
cc,dd
gg,hh
aa,bb,ee,ff,cc,dd
cc,dd,gg,hh
(aa[2,3],bb[2,3])[5,6]    
a?,b?,c?    
(a+,b+,d+)?,c    
((a?,b?,h,i)|(c*,d?,k)+|e|(f,g))    
(a*,b?,c)    
c,d,e*    
(a|b|c)[2,4]    
a[2,4],(a|b)[1,1],(a,b,c?)*    
(a|b)[2,3]    


 Here are a few examples of ambiguous regular expressions which will trigger runtime exceptions:

a*|(a,b)
(a,b)|(a,c)
(a,b)*|(a,c)
a,(b,a)*,(b|<empty>)
a|(a,b)
(a|b)*,a,b,b

Command line usage

The ContentModel.main is the entry point when using the module from the command line. Be sure to set the correct environment (classpath for instance) before running it.

The command line accepts the following arguments:

java com.tradery.contentmodel.ContentModel [-?] [-help] [-s] [-e] [-t] [-r] "regular_expression" "string_validate_1" "string_to_validate_2"...

argument description  
-help, -? displays the help message  
-s dumps the state machine (nodes and transitions)  
-e dumps the content model expression  
-t dumps the syntax tree  
-a dump all (equivalent to -s -e -t)  
-r uses the reverse polish notation for the regular expression, where character "^" represents the "pop" operation  

Here are a few typical basic commands

basic command line - one regular expresson and one string to validate

java com.tradery.contentmodel.ContentModel "(aa|bb),cc" "aa,cc"

java com.tradery.contentmodel.ContentModel -s "(aa|bb),cc" "aa,cc"

java com.tradery.contentmodel.ContentModel -a "(aa|bb),cc" "aa,bb,cc" "aa,cc" aa,dd"

Ambiguous vs. unambiguous regular expressions.

By design XML only accepts unambiguous content models, and this implementation will also accept only unambiguous regular expressions as input. An exception will be thrown if a content model is ambiguous. An ambiguous regular expression will generate a state machine with at lest one state with 2 ore more exit transitions triggered by the same input symbol (or word). In that case it is obviously not possible to determine the valid transition (without looking ahead in the string).

General purpose regular expressions accept ambiguous expressions as they usually look ahead in case of ambiguity to determine the transition that matches.

The component is available for download as a zip file for the source code and Visual Slick Edit project, as well as a jar file containing the generated class files.

Feedback

Please let me know if you have found this article and associated code useful, if you are using it in a project or have modified it for other purposes, or if you have any feedback for me - info@amichel.com .

Copyright notice

This is the copyright notice that should appear in all copies of the software and its documentation. Other than that, feel free to use it as you like.

* Copyright (c) 2007
* Adrian Michel
* http://www.tradery.com
*
* Permission to use, copy, modify, distribute and sell this software
* and its documentation for any purpose is hereby granted without fee,
* provided that the above copyright notice appear in all copies and
* that both that copyright notice and this permission notice appear
* in supporting documentation. Adrian Michel makes no
* representations about the suitability of this software for any
* purpose. It is provided "as is" without express or implied warranty.
*/
 

Policies | Guidelines | Contact
Copyright © 2003, 2007 Tradery.com