Yitzi Ginzberg

Posted on Jun 26, 2019 • Originally published at Medium on Jun 26, 2019

Fixing Democracy, Developer Style

#opensource #code #law #webscraping

Submitting forms and scraping data for the public good

The US legal system is stacked against the poor.

The outcome of any court case is mainly dependent on the amount of money the parties involved can spend on their representation. Wealthy criminals who have harmed others with impunity get let off with a slap on the wrist or less. At the same time, the less financially fortunate can end up with outrageous sentences for the pettiest violations.

One of the main drivers of this unjust situation is the cost of legal research. Two companies hold a duopoly on the U.S. market for legal information.

Law firms seeking access to precedent-setting legal opinions will need to pay Westlaw or Lexis upwards of $100 per month per license! See “The Lexis/Westlaw Duopoly and the Proprietization of Legal Research” — Columbia.

To fight this travesty AnyLaw (as seen on forbes.com) has built a robust, fully-featured, and most importantly, FREE legal search engine!

Cool, I know!

How we do it

Because this was to be a free solution, we obviously could not have the overhead of licensing the published opinions annually.

Thankfully, US law requires every state and federal court to maintain a website and publish their legal opinions online.

The obvious solution, therefore, is to scrape and download the data from the individual court sites.

Unfortunately, there is no standard. Sites can be:

static
rendered server-side
dynamic JavaScript sites

Courts are allowed to do as they please in terms of building their site.

It pleases them to do some very interesting things.

For example have a look at the website of Virginia’s Supreme Court:

Yup, that is every Supreme Court decision since 1995 on a single page. 😕

TLDR; Each court website needs a bot 😧

Case study: Texas

Texas is a great example because of the complexity of their site. This is what our bot is confronted with when it navigates to the court system URL:

The first thing our bot needs to do is to get a list of all Supreme Court opinions. Since this bot needs to interact with the site, we chose to use HtmlUnit by Gargoyle Software Inc.

Step 1. Install the HtmlUnit dependency. We place the following code snippet into our app's pom.xml. If you are not using Maven, visit the website for additional options.

Step 2. Create a method that fills out and submits the form. This is what such a method looks like:

What’s happening here?

Let’s go through this line by line.

Getting the page

HtmlPage page = webClient.getPage(pageUrl);

HtmlPage page represents the initial page. HtmlPage comes from the HtmlUnit library. I use an instance of WebClient, another HtmlUnit class, to get the HtmlPage by calling getPage() on it and providing the URL. The WebClient instance is initialized with a simple WebClient webClient = New WebClient().

Getting the form

final HtmlForm form = page.getFormByName("aspnetForm");

The getFormByName() method provides us with another HtmlUnit object, this one representing the form. It is easy to identify the name of any elements that we need using the browser's dev tools as seen below.

Getting and interacting with the input fields

Since we only want court “opinions” we get the checkbox labeled “opinions”:

final HtmlCheckBoxInput checkBoxInput = form.getInputByName("ctl00$ContentPlaceHolder1$chkListDocTypes$0");

We “check” it programatically with the line:

checkBoxInput.setChecked(true);

We also need to set a “from” date. We get the “from” field with:

final HtmlTextInput textField = form.getInputByName("ctl00$ContentPlaceHolder1$dtDocumentFrom$dateInput");

and “type” in a properly formatted string with:

textField.type(getFromDateString());

If you are curious about how we get the date string, here is what that looks like:

This bot runs frequently, so it is sufficient to fetch the most recent opinions.

Lastly, we need the submit button:

final HtmlSubmitInput button = form.getInputByName("ctl00$ContentPlaceHolder1$btnSearchText");

and finally, button.click() will submit the form and return the page generated as a result of the submission.

Our bot is now seeing a neat list of the most recent Supreme Court opinions. The bot can efficiently run through this list downloading and tagging each document.

The results:

After a single run the bot has:

downloaded 541 files from Texas courts!

Inserted the download info into our database. This info includes the source URL as well as the current location on our server.

Inserted metadata about the case with a foreign key to the download entry. This metadata includes the date of the decision, case title, and issuing court.

Grand Finale!!🎉🎉

Finally, we can make use of this data by sharing it with the world! All this messy, unorganized data is now beautifully organized, searchable, and tagged. Have a look and you’ll discover some other cool stuff that we’ve done with this data.

Thanks for reading!

I’m Yitzi 👋

Yitzi Ginzberg (@codegician) | Twitter

Top comments (7)

Alan Hylands • Jun 27 '19

Great post and project. Remember speaking to some solicitors 10-15 years ago who complained about how bad LexisNexis was to use back then.

Excellent use of tech to free up access to this data.

Yitzi Ginzberg • Jun 27 '19

Thanks! We are also doing a cool judicial analytics AI project that will help people self-represent!

Alan Hylands • Jun 27 '19

I'll keep an eye out for your next blog post if you write about that one as well.

Anurag Kale • Jun 27 '19

Wow! Great work Yitzi.

Yitzi Ginzberg • Jun 27 '19

Thanks!! I'm a big fan of your work as well!

Kamran Ayub • Jun 27 '19

I have been wondering about using something slick and easy like cypress.io to do scraping. Gotta take care to follow Terms of Use but as a fun little project, I bet it would work!

Yitzi Ginzberg • Jun 28 '19

Never heard of it. I'll check it out.

DEV Community

Fixing Democracy, Developer Style

Submitting forms and scraping data for the public good

How we do it

Case study: Texas

Getting the page

Getting the form

Getting and interacting with the input fields

The results:

Grand Finale!!🎉🎉

Top comments (7)

Read next

11+ Best Open-source Web Analytics, Alternatives to Google Analytics 🚀

Notpad: Reinventing the Ordinary Notepad Experience with Svelte, Shadcn, and Tauri ✨📝🚀

🚀 Announcing the Launch of #openDesk 1.0

Day 1056 : Get Made