In this part of series, I will describe, how to create a corpus of German law texts from https://www.gesetze-im-internet.de.
In the previous parts of this series, we downloaded 6518 German laws, in XML format, stored in ZIP files.
Converting XML documents to plain text format can be accomplished with many tools and technologies, but after thorough considerations about a couple of edge cases I decided to use an XSLT stylesheet.
After studying the DTD file, which was referenced in the XML files, as well as the XML files themselves, following tasks had to be addressed (the paths given use XPath notation):
- The XML files have root element
- The laws are either incredibly short and consist of a single paragraph, or rather long with a table of contents
- In the first case from 2., the law name is in
metadaten/titel(if the first path is present) or in
metadaten/enbezonly; in the second case ibid, the title is in
- The text body is always in
- The paragraphs are in the
Ptags and end with a new line
- The definition lists are in
DLtags and are rendered similar to paragraphs, but without new line after the last entry
- The new line in text has
BRtag, but is not rendered if being within a table or a list entry
- Table of contents (
TOCtags) are excluded, as they repeat paragram titles only and thus senseless in language model training; also, they are unusable in case of plain text, as there are no known page numbers
- Titles (
Titletags) are rendered with appended new line
- Tables (
tabletags) are rendered with rows (
rowtags) ending with a new line and all single cells but the last in row one (
entrytags) with a tab character appended
- The end marker of the law text will be 25 empty lines
And hence the short XSLT stylesheet of about 100 lines:
Run it in Windows using msxsl.exe as XSLT processor like this:
msxsl BJNR001270871.xml giitotext.xsl > BJNR001270871.txt
Concatenating the text files creates a law text corpus.
In the next part of series we will see how to train a language model with the text corpus we just created.