Sunday, February 14, 2010

Regular confusion

Original post date: Sun Nov 19 12:00:00 2006
I have been working the last couple of days on adding syntax highlighting to my blog. For this I used the jEdit syntax package that I found on SourceForge. I had to change it a bit because it was a bit too focused on using it from within an editor but all in all it is not bad.
Only the XML highlighting didn't really convince me, it would just color everything blue except for the comments. So I thought I'd take a stab at parsing the attributes so keys and values would get their own colors.
But looking at the code I realized it would take some work doing it the way they did it so I decided to just at a regular expression at right point.
Thing is, regular expressions are great but when they finally result in something like this, I have to wonder "WTF am I doing???":

(\s*)(?:([^=\s]*)(?:(?:(\s*?=\s*?)([^\s"]+))|(?:(\s*?=\s*?)(".*?")))?)
  
And this is without the escaping that is necessary when putting this in a Java string! Luckily it's only a couple of slashes but it can soon get very messy.Oh, and what it does is figure out the attributes for an HTML or XML element. So if you have something like this:

<input type= checkbox name = "checkme" selected>

the regular expression will cut it up into the following tokens:
  • " "
  • "type"
  • "= "
  • "checkbox"
  • " "
  • "name"
  • " = "
  • "\"checkme\""
  • " "
  • "selected"
Of course not all of this is legal in XML, but the expression supports the most lax of the 2 formats.

No comments:

Post a Comment