Chapter 2. The RegexXMLReader Stylesheet

2.1. Introduction

In order to appropriately parse an incoming text file, RegexXMLReader needs information to apply to the incoming text file. This information is an XML document that I call a Regular Expression Stylesheet, or simply "the Stylesheet" although there are major differences between this type of stylesheet and an XSLT stylesheet, which it it somewhat patterned after.

The stylesheet is made up of various directives which are essentially commands that are in the required namespace of RegexXMLReader, "http://regexxmlreader.sourceforge.net/1.0". Using these commands, the document is transformed into a series of SAX events thus turning the arbitrary text file into an XML document.

One concept that must be made clear is what I refer to in the documentation is contextual text. The contextual text is the entire incoming text file at the beginning of processing - it is just one large stream of data. As processing ensues, this stream of text usually changes to the more relevent parts; that is to say, the context changes for each inner child in the RegexStylesheet. This contextual text normally becomes less and less depending on the directives that are placed on its parent.

For example, consider the replace directive. This directive modifies the contextual text stream and that directive's children are then processed on that modified version. At the same level of replace there is no change in the contextual text stream. Rather, the change is only apparent for the directive's children:


 <!--
     assume that the contextual text at this
     level is: "abcdefg hijklmno pqrstuvwxyz"
  -->
 <re:replace regex="^." with="X" xmlns:re="http://regexxmlreader.sourceforge.net/1.0">
   <!-- 
        the contextual text is now:
        "Xbcdefg hijklmno pqrstuvwxyz"
    -->
   <re:for-each split=" ">
      <!--
           In this case the contextual text will change
           for each itteration through the split parts,
           namely:
                 1) Xbcdefg
                 2) hijklmno
                 3) pqrstuvwxyz
            etc.
       -->
   </re:for-each>
   <!--
        The contextual text is what it was prior to
        the previous directive:
        "Xbcdefg hijklmno pqrstuvwxyz"
    -->
 </re:replace>
 <!--
      And now it is the same way it was at the
      beginning: "abcdefg hijklmno pqrstuvwxyz"
  -->