Chapter 5. Tutorial

5.1. A Simple CSV Example

As you may or may not know, CSV stands for "comma separated values" and it is a common method to group fields and records; the fields are separated by commas and the records are separated by new lines.

For this example, we shall use the following text file:


one,two,three,four,five
six,seven,eight,nine,ten
eleven,twelve,thirteen,fourteen,fifteen
sixteen,seventeen,eighteen,nineteen,twenty

In order to turn this into XML, we'll need to itterate over each newline and then split up each line based on the comma.

The itteration bit is rather simple so lets flesh out a Regular Expression Stylesheet using the root element of "csv-data" along with the iteration over the lines themselves and then prints out each line encapsulated in a "line" tag:


<?xml version="1.0"?>
<csv-data xmlns:re="http://regexxmlreader.sourceforge.net/1.0">
  <re:for-each regex="(?m)^.*$">
    <line>
      <re:match-string />
    </line>
  </re:for-each>
</csv-data>

One of the first things to note is the (?m) at the beginning of the regex in the for-each directive. This tells the regular-expression compiler within Java that we are to match multi-lines. Thus, with this regular expression, we are matching from the beginning of each line to the end of that very same line, and for each match found the children of this for-each directive are applied.

Let's take a look at the output by invoking the command line processor like: java net.sourceforge.regexxmlreader.Process -in tutorial-example-01.txt -rxl tutorial-example-01.rxl


<?xml version="1.0" encoding="us-ascii"?>
<csv-data>
  <line>one,two,three,four,five</line>
  <line>six,seven,eight,nine,ten</line>
  <line>eleven,twelve,thirteen,fourteen,fifteen</line>
  <line>sixteen,seventeen,eighteen,nineteen,twenty</line>
</csv-data>

Now we are close to what we want to do but not quite there; we also need to break up each line on the comma and there are a few ways that we can do this:

  1. We can split the text up using the split directive and access each part with the group directive like so:

    
  <re:split split=","/>
        <re:group>
          <!-- process the first matching part -->
          <first>
            <re:match-string/>
          </first>
        </re:group>
        <re:group>
          <!-- process the second matching part -->
          <second>
            <re:match-string/>
          </second>
        </re:group>
      </re:split>
    

  2. We can split the text up iterating over each part with the for-each directive:

    
  <re:for-each split=",">
        <part>
          <re:match-string />
        </part>
      </re:for-each>
    

  3. We can use grouping within the regular expression of a match attribute within an external element:

    
  <line re:match="^([^,]+),([^,]+),([^,]+),([^,]+),([^,]+)$">
        <re:group>
          <!-- process the first matching part -->
          <first>
            <re:match-string/>
          </first>
        </re:group>
        <re:group>
          <!-- process the second matching part -->
          <second>
            <re:match-string/>
          </second>
        </re:group>
      </line>
    

  4. We can do essentially the same thing using the match directive:

    
  <re:match regex="^([^,]+),([^,]+),([^,]+),([^,]+),([^,]+)$">
        <line>
          <re:group>
            <!-- process the first matching part -->
            <first>
              <re:match-string/>
            </first>
          </re:group>
          <re:group>
            <!-- process the second matching part -->
            <second>
              <re:match-string/>
            </second>
          </re:group>
        </line>
      </re:match>
    

In fact, there may be numerous other ways to do this but this is all we shall explore for right now.

Now for the final bit of this part of the tutorial, we shall do change the CSV file into XML using the split directive as well as place an attribute in the containing, out-going element. Here is the stylesheet:


<?xml version="1.0"?>
<csv-data xmlns:re="http://regexxmlreader.sourceforge.net/1.0">
  <re:for-each regex="(?m)^.*$">
    <line>
      <re:split regex=",">
	<re:group>
	  <element-one>
	    <re:match-string />
	  </element-one>
	</re:group>
	<re:group>
	  <re:attribute name="element-two">
	    <re:match-string />
	  </re:attribute>
	</re:group>
	<re:group>
	  <element-three>
	    <re:match-string />
	  </element-three>
	</re:group>
      </re:split>
    </line>
  </re:for-each>
</csv-data>

And here is the result:


<?xml version="1.0" encoding="us-ascii"?>
<csv-data>
  <line element-two="two">
    <element-one>one</element-one>
    <element-three>three</element-three>
  </line>
  <line element-two="seven">
    <element-one>six</element-one>
    <element-three>eight</element-three>
  </line>
  <line element-two="twelve">
    <element-one>eleven</element-one>
    <element-three>thirteen</element-three>
  </line>
  <line element-two="seventeen">
    <element-one>sixteen</element-one>
    <element-three>eighteen</element-three>
  </line>
</csv-data>