Below is a short tutorial on using the pil VM, DB and OOP for basic data mining in XML structured data. The article assumes some knowledge in programming and the environment to be POSIX or at least unixlike. For information on installing picolisp, visit picolisp.com.
How to lisp some XML
This year we're having elections in Sweden. To prepare it might be a good idea to read up on the last election cycle back in 2014.
Luckily the Swedish government supplies us with a full set of voting results.
wget http://www.val.se/val/val2014/slutresultat/slutresultat.zip unzip slutresultat.zip
This will unpack a lot of ISO-8859-1 encoded XML files in a new directory, as well as some metadata files about the rest. It is easy to convert them into UTF-8, like this.
Most of them contain voting results from a county, regarding either national parliament, regional or local assembly. The file names are constructed from a numeric county code and a character for each election, R for parliament ("riksdag"), L for region ("landsting"), K for local assembly ("kommun").
In the directory with the data we then start pil.
$ pil + :
Not very exciting. Let's take a look at one of the files by attaching it to a symbol and do some simple processing.
:(length (setq A (make (in "slutresultat_1407L.xml" (until (eof)(link (line T] -> 1832 :(head 10 A)
As 'head shows this actually is XML data. Since it is we also know that right after every '< there will be interesting information, and we can see that it usually is the first or second character in the data attached to our 'A symbol.
To extract this we can use an expression like this.
:(length (setq *Subjects (uniq (make (for X A (and (or (match '("<" @Subject " " @B ">") (chop X)) (match '(@C "<" @Subject " " @B ">") (chop X))) (link (pack @Subject] -> 17 :*Subjects
In a real mining session one would most likely use 'de and turn this into a function that takes a list (for 'match to use) as argument and has the side effect that the symbol and global variable *Subjects (which would probably have another name) is updated with the results from 'match. When doing interactive work globals can be like post-it notes of arbitrary size, holding anything from millions of text fragments to object references or the emptiness of NIL while the temporary nature of these sessions is resistant to the horrors of globals in production code.
Here 'length is used to make sure the expression returns something both short and useful, the amount of unique hits in our search. 'setq returns the contents of the symbol which means that it will fill the screen with junk unless our expression is sensible or the data set is very small (and who would want that?).
Picolisp 'match is often useful where one would otherwise use 'grep' or regexp. Its first argument is a list that it matches against its second argument, and when there's a fit the @Variables get the content that's in the data source at that place in the structure. This means it is extremely well suited for sophisticated exfiltration techniques as well as the basic ones shown here.
As you can see we now have a set of 17 unique XML tag names. These can be supplied to our earlier expression to extract more information. Let's have a look at what "PARTI" contains.
:(length (setq *PARTI (uniq (make (for X A (and (or (match '("<" ~(chop "PARTI") " " @B ">") (chop X)) (match '(@C "<" "P" "A" "R" "T" "I" " " @B ">") (chop X))) (link (pack @B] -> 21 :*PARTI
The tilde (~) causes the expression to which it is attached to be evaluated as regular lisp, here 'chop turning a transient symbol like "PARTI" into the sequence "P" "A" "R" "T" "I", before matching the result against the list in the next argument. If it is easy to omit 'chop one should, or at least use it to prepare a list for 'match that gets reused, since it brings some overhead that gets very noticable with bigger data sets.
Apparently 21 parties ("PARTI" means 'party') ran in the regional election in this county (the name of which is in the "NAMN" attribute of the "KOMMUN" tag).
Let's write out their names.
:(mapcar '((X) (and (match '(@A " " @B "\"" @Party "\"" @C) (chop X)) (printsp (pack @Party)))) *PARTI)
This kind of expression is especially useful, since the first argument to 'mapcar could perform all sorts of side effects while 'mapcar itself will return a list of all successful matches regardless of those side effects.
One such potential use is saving the exfiltrated data or perhaps entire lines or documents in database objects, by simply putting a 'new! clause at the end and having set up some classes for holding and structuring the data of interest.
It could look something like this:
:(class +GeneralContainment +Entity) -> +GeneralContainment :(rel tag (+Ref +String)) -> +GeneralContainment :(rel dta (+List +String)) -> +GeneralContainment :(mapcar '((X) (and (match '(@A "=" "\"" @Short "\"" @B "=" "\"" @Full "\"" @C) (chop X)) (new! '(+GeneralContainment) 'nm "Party names, (abbreviation . name)" 'dta (cons (pack @Short) (pack @Full))))) *PARTI)
Such class names aren't particularly idiomatic, the point here is that these classes ought to be very generic and kept in a small utility library. Usually the structure above or one with two classes where one can hold many '+Joint relations to objects of the other class (in addition to a name/category field and one for data) is enough to structure most interactive mining.
Once one gets the hang of the basic list manipulations, text parsing techniques and pilog expressions it can be quite a bit easier to go straight to exfiltration into generic objects instead of dabbling with globals, especially with bigger data sets, but then one should also use some helper methods that do the exfiltration without blocking the REPL since DB operations will probably take some time.
Thanks for reading, questions and comments are welcome.
Top comments (0)