|
| Sun, Jul 06th | home | browse | articles | contact | chat | submit | faq | newsletter | about | stats | scoop | 12:58 UTC |
|
login « register « recover password « |
| [Article] | add comment | [Article] |
Recently, I needed to write a script that would parse an XML file and extract various bits of information. I'm sure there are plenty of excellent XML modules for Perl, but I didn't want to go through the pain of having to find one and install it (along with its tree of dependencies). Besides, I was sure that I was dealing with well-formed XML, and all I was doing was extracting fields, so I didn't need error-checking, XSLT, XInput, and all that fancy stuff. I just rolled my own XML parser in about 100 lines. It's not fancy. It makes all sorts of assumptions that will cause it to break in a production environment, but I thought I'd show how it's built. Copyright notice: All reader-contributed material on freshmeat.net is the property and responsibility of its author; for reprint rights, please contact the author directly. Outer Loop"<" and ">" are reserved characters, and can only appear at the beginning and end of tags. If you're used to lex-style parsers, you're thinking "Ah, so I should read until I see a "<" or ">". But in Perl, you can just read the input until you see a ">", and then you know that you've got, at the very least, a tag. So the outline of the parser is:
Note the use of $/ to set the input record separator. This means that if our input file is <?xml version="1.0" ?> <address> <friend /> <name>John <nickname>Spike</nickname> Smith</name> <streetAddress>123 Maple Ave.</streetAddress> </address> then successive values of $_ will be
In other words, $_ always ends with a complete XML tag, which might be preceded by other text. So, the first order of business is to separate the tag from the text that precedes it:
(For those who don't remember, m{pattern} is equivalent to /pattern/.) The "s" modifier is there so that a dot will match anything, including a newline. Data RepresentationNow that we have some content, it would be a good idea to think about how the data in the XML file will be represented. XML is a nested set of elements, each of which has a name, optional attributes, and the contents (the stuff between <foo> and </foo>). The contents can be either text data or other elements. I chose a rather simple data structure to hold an element's data, an array of the form "(name, attributes, contents…)". Thus, the streetAddress element in the example above would be turned into: ( "streetAddress", "", "123 Maple Ave." ) The name element contains three items: the strings "John" and "Smith", and a <nickname> element. It would be represented as:
Eventually, we want to store the entire file in a treelike list like this. ContextWhen we see an open tag like <foo>, we're going to start parsing it. Anything that we read will go inside that <foo> element until we see the closing tag, at which point we're done with the <foo> tag, and should go back to whichever element we were processing before (the element that contains the <foo> element). Naturally, this suggests a context stack. In my code, I did a somewhat bad thing. I used two variables with the same name. @context is the context stack (stacks are naturally represented using arrays), and $context is a reference-to-array which refers to whichever element we're looking at at the moment. At any given moment, $context is a reference to an array in the (name, attributes, contents…) format described above, and @context is a stack of such references-to-arrays, describing the elements inside which the current element is embedded. That is, at some point, $context will be and at that time, @context will be[ "nickname", "" ] ( [ "address", "" ], [ "name", "", "John" ] ) In the section on the main loop, we'd gotten as far as separating the tag from the text that precedes it. Now we see that the text part should be appended to the array that $context points to (never mind for now how $context was set up; we'll worry about that later):
Parsing TagsNow we just need to deal with the text in $tag. There are three types that we need to be concerned with:
Opening and singleton tags can also have attributes after the tag name: <name added="2007-07-19"> <address format="us-postal" category="personal"> In my script, I didn't need to worry about attributes, so I chose just to store them as raw, unparsed strings. If they matter to you, I suggest representing them as hashes that map each attribute's name to its value. We can parse a tag using (what else?) a regular expression. This one's complicated enough that it's a good idea to use the "x" flag, which allows us to embed whitespace (including newlines) and comments inside the regular expression:
Now $name is the name of the tag, $attrs is the (unparsed) attribute string, and $closing and $singleton are boolean flags that tell us whether this is a closing or singleton tag, respectively. Closing tags are easiest to deal with. We're done parsing the element, and the parsed data is in $context, so all we need to do is append it to its parent and pop the @context stack to return to that element:
This leaves us with opening tags and singleton tags. Singleton elements like <foo /> are equivalent to <foo></foo>. That is, a singleton has no children. In either case, though, we need to start a new context: # Save the old context on the context stack push @context, $context; # Start a new context $context = [ $name, $attrs ]; Of course, if we're looking at a singleton tag, then we already know that it has no contents and should be closed immediately. And since we've already taken care of closing tags, we already know how to close tags:
Finally, we can tackle the hardest of the three tag types: opening tags. For these, we need to save the old context to the stack and start a new context (which we've already done), and then… actually, that's all we need to do at this stage. We can't add the contents of the element, because we haven't read them from the input file yet. Remember at the top, when we separated the tag from the text that preceded it, and appended the text to @{$context}? Now we see how $context was set up: We created a new anonymous array when we saw the opening tag, so that later iterations of the loop would have a place to put their text. And that's pretty much it! You can read the full script, which includes a &dumptree function for printing the parsed tree and a &lookup function for looking up elements. Lessons LearnedInput records don't have to end with a newline. If there's a more convenient record terminator or separator, use $/ to read the input in way that makes your life easier. Perl's regular expressions are powerful. Don't bother trying to read the "<", then the tag name, then the attributes, then the ">", as you would with lex. Just read the whole thing and extract the interesting bits with parenthesized expressions inside a regular expression. If you're trying to solve a difficult problem, try to break it down into not-quite-so-difficult problems, and break those down into easier subproblems. Take care of the easy parts first. This allows you to make simplifying assumptions (if we know that we're not looking at a closing tag, we know that we need to save the old context and start a new one). Once you've taken care of enough easy bits, you might find that your hard problem has been simplified to the point that you don't need to do anything at all. Author's bio: Andrew Arensburger has been hacking Unix and Perl for over fifteen years. These days, he splits his time between system administration by day, fun coding by night, and obeying the whims of his feline overlords. T-Shirts and Fame! We're eager to find people interested in writing articles on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an article gets a t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about. [Comments are disabled]
[»]
vtd-xml You should try vtd-xml
[»]
Why? ", fun coding by night, and obeying the whims of his feline
overlords."
[»]
Not a great advert for Freshmeat! > I'm sure there are plenty of excellent XML modules for Perl, but
I
[»]
Use standard XML libraries This really does undermine the reason for using XML, and I can't think how
this could be recommended in any circumstance. --
[»]
Re: Use standard XML libraries
[»]
Re: Use standard XML libraries
[»]
lex me >Perl's regular expressions are powerful. Don't bother trying to read
the "<", then the tag name, then the attributes, then the
">", as you would with lex.
|