Since I have trouble finding the time to finish that guestbook article I'll keep you bored with some more interactive parsing, this time we'll wring out and save a copy of a subset of a remote database we can access through a browser or similar web client.
Earlier I've written a little about how one can use the picolisp function 'match for exploring XML or similarly structured data. This is a very powerful pattern matching tool that can be utilised to extract just about any data from any set as long as you have an idea of how its environment looks like.
Sometimes you want something else and don't know enough for an efficient 'match expression, though. Perhaps you want to exfiltrate every occurrence of a pattern in a repetitive data set but isn't yet sure what the pattern ought to look like. If this is the case one usually applies the pair functions 'from and 'till instead, and in a slightly different manner.
I have used this technique for all sorts of interactive searching and pattern matching, including real work applications like chugging through gigabytes of web sites hunting for contact information.
The other day I was in a pickle, I had some files containing markup and information about swedish metal bands from metal-archives.com and wanted to extract the information and make it searchable. Basically create a subset of their database locally and query it for specifics that I was interested in.
Instead we get to figure out what we want in sequential pattern fragments instead.
How to save a copy of a JS-generated, dynamic DOM is a small science in itself, which I will conveniently assume you have already doctored in. As for my setup, I did it with a Chromium instance and quickly got nine HTML files to examine and write a small parser for.
What we want to do is have the picolisp reader look over that search result line in the HTML files and give us some data structure back that contains band name, location, status, genre and URL. So we start it and tell it what our data model should look like and in what file to keep it persistent.
$ pil + : (pool "mase.db") -> T : (class +Band +Entity) -> +Band : (rel nm (+Ref +String)) -> +Band : (rel loc (+Ref +String)) -> +Band : (rel status (+Ref +String)) -> +Band : (rel genre (+Ref +String)) -> +Band : (rel url (+Ref +String)) -> +Band :
Or as a snippet in an er.l:
(pool "mase.db") (class +Band +Entity) (rel nm (+Ref +String)) (rel loc (+Ref +String)) (rel status (+Ref +String)) (rel genre (+Ref +String)) (rel url (+Ref +String))
This can then be loaded with '(load "er.l").
Now that you have your model, HTML files and a pooled database it is time to start working on our parser. When doing this with a new set of data it is likely one does some examining from the REPL by reading a file into a variable as a list of transient symbols, or a list of strings, or a list of lines. One could also use a text editor or 'View source' in a browser, whatever works for you when you want to skim through and quickly get an idea about where the valuable information is and how the data is structured there.
Having looked at 1.html and found the line with ~500 bands and some data on them we can start figuring out how to get it out of there. 'from is like a hook while 'till is a release, so you tell 'from how it looks where the extraction should begin and 'till when to stop.
: (make # we want this to return a list (in "1.html" # from this file (until (eof) # and keep going until the end (link # linking lists into a list of lists (make (from "tr class") # from this pattern, look for (from "href=\"") # this pattern, and then (link (pack (till "\""))) ] # grab it.
If you run this at your REPL and have the HTML files in order you'll get a list of URL:s, something like this but about 500 elements long '(("https://www.metal-archives.com/bands/Name/ID")("https://www.metal-archives.com/bands/Another_Name/AnotherID")).
This is built by the 'make-'link constructs, where the innermost one gets the URL as a transient symbol from 'pack. If we didn't use 'pack we would instead get the URL:s as lists of characters.
Just getting these links aren't satisfactory, though. We also want the rest of the information, so we'll extend the above like this.
: (make (in "1.html" (until (eof) (link (make (from "tr class") (from "href=\"") (link (pack (till "\""))) (from ">") # name is between > (link (pack (till "<"))) # and < (from "><")(from "><") # after two >< (from ">") # then a > (link (pack (till "<"))) ]
As you can see, we can be quite ignorant of the details in the data and extract what we want anyway, it is enough to know that there will be two '><' until there is '>Data<'. This is slightly different from working with 'match, here we are extending our exfiltration incrementally rather than expressing ourselves with a dense list as pattern.
To get all the interesting information out of this data one would continue adding 'from expressions, the next one would look for a single '><' before extracting 'from '>' 'till '<', and the last extract would probably be 'from 'span', 'from '>' and 'till '<'.
Putting it in objects
Once all of this is in place we get a list where each element is a list '(Url Name Location Genre Status), making it almost trivial to save it as database objects. Assuming you ran it and got the result in the REPL you would probably do something like this, where @ in the regular REPL context is short for 'last on the result stack' and @@ and @@@ dig deeper.
: (for X @ (new! '(+Band) 'nm (cadr X) 'loc (caddr X) 'status (last X) 'genre (cadddr X) 'url (car X) ]
Once 'new! has run successfully your data is persistent, browseable and searchable. To be really, really sure it is in your database and will be available next time you open it, run '(commit 'upd) an extra time.
From here it is easy to put it all together with 'dir supplying a list of arguments for the resulting function. Simplest but somewhat ugly is to replace @ above with the full 'make-'from nest and take the file name from 'dir, so when it has gone through all of the files you end up with a lot of database objects containing a full set of swedish metal bands known to Metal-Archives.
It is of course better to factor it into a few functions in a file instead, at least if one wants to reuse the basic 'make 'in 'from 'till structure, however, it is as an interactive tool pil really shines and sometimes this means sacrificing a bit of elegance for a more brutish approach to code style.
Primitive ad hoc queries
Querying the database can be done in several ways. We used a simple '+Ref index so tolerant and substring searches aren't possible right away, but we can simulate this with our old friend 'match and the 'printsp function, that prints something with a space.
: (for X (collect 'nm '+Band) (with X (and (match '("S" "t" "o" "c" "k" "h" "o" "l" @B) (chop (: loc))) (printsp (pack (: nm) " - " (: loc))) ]
Using a variable like '@B amounts to a wildcard, this pattern will be true for all location fields that begin with 'Stockhol' (i.e. the next line will fire and 'printsp name and location), also those that look like 'Stockholm/SomeOtherCity'. When doing this in practice it is likely one puts a '@A at the beginning, and in a file one would use 'use to declare these symbols and prevent them from leaking artifacts from other parts of the program.
If you would like to see this done quickly by the REPL I saved a ten minute session here that is basically a repetition of the above.
If you thought this was worthwhile reading, consider donating a coffee or two, Stagling@Patreon.
The data used above might be copyrighted, if so I consider this to be fair use and decent advertisement for an excellent web service, but will of course remove the post if the right people tell me to. While doing this I listened to Temple of Baal, Verses of Fire.
Addendum: Sorry about the formerly broken links, they are fixed now. Thanks aw for pointing it out.
Top comments (0)