Build law text corpus

#python #xslt

In this part of series, I will describe, how to create a corpus of German law texts from https://www.gesetze-im-internet.de.

Previously in series

In the previous parts of this series, we downloaded 6518 German laws, in XML format, stored in ZIP files.

Conversion to plain text

Converting XML documents to plain text format can be accomplished with many tools and technologies, but after thorough considerations about a couple of edge cases I decided to use an XSLT stylesheet.

After studying the DTD file, which was referenced in the XML files, as well as the XML files themselves, following tasks had to be addressed (the paths given use XPath notation):

The XML files have root element /dokumente
The laws are either incredibly short and consist of a single paragraph, or rather long with a table of contents
In the first case from 2., the law name is in metadaten/enbez and metadaten/titel (if the first path is present) or in metadaten/enbez only; in the second case ibid, the title is in norm/metadaten/langue
The text body is always in textdaten
The paragraphs are in the P tags and end with a new line
The definition lists are in DL tags and are rendered similar to paragraphs, but without new line after the last entry
The new line in text has BR tag, but is not rendered if being within a table or a list entry
Table of contents (TOC tags) are excluded, as they repeat paragram titles only and thus senseless in language model training; also, they are unusable in case of plain text, as there are no known page numbers
Titles (Title tags) are rendered with appended new line
Tables (table tags) are rendered with rows (row tags) ending with a new line and all single cells but the last in row one (entry tags) with a tab character appended
The end marker of the law text will be 25 empty lines

And hence the short XSLT stylesheet of about 100 lines:

Run it in Windows using msxsl.exe as XSLT processor like this:

msxsl BJNR001270871.xml giitotext.xsl > BJNR001270871.txt

Concatenating the text files creates a law text corpus.

Next step

In the next part of series we will see how to train a language model with the text corpus we just created.

DEV Community

Build law text corpus

Previously in series

Conversion to plain text

Next step

Top comments (0)