DEV Community

Yitzi Ginzberg
Yitzi Ginzberg

Posted on • Originally published at Medium on

Fixing Democracy, Developer Style

Submitting forms and scraping data for the public good

Photo by kyryll ushakov on Unsplash

The US legal system is stacked against the poor.

The outcome of any court case is mainly dependent on the amount of money the parties involved can spend on their representation. Wealthy criminals who have harmed others with impunity get let off with a slap on the wrist or less. At the same time, the less financially fortunate can end up with outrageous sentences for the pettiest violations.

One of the main drivers of this unjust situation is the cost of legal research. Two companies hold a duopoly on the U.S. market for legal information.

Law firms seeking access to precedent-setting legal opinions will need to pay Westlaw or Lexis upwards of $100 per month per license! See “The Lexis/Westlaw Duopoly and the Proprietization of Legal Research” — Columbia.

To fight this travesty AnyLaw (as seen on forbes.com) has built a robust, fully-featured, and most importantly, FREE legal search engine!

Cool, I know!

How we do it

Because this was to be a free solution, we obviously could not have the overhead of licensing the published opinions annually.

Thankfully, US law requires every state and federal court to maintain a website and publish their legal opinions online.

The obvious solution, therefore, is to scrape and download the data from the individual court sites.

Unfortunately, there is no standard. Sites can be:

  • static
  • rendered server-side
  • dynamic JavaScript sites

Courts are allowed to do as they please in terms of building their site.

It pleases them to do some very interesting things.

For example have a look at the website of Virginia’s Supreme Court:

Yikes!!

Yup, that is every Supreme Court decision since 1995 on a single page. 😕

TLDR; Each court website needs a bot 😧

Case study: Texas

Texas is a great example because of the complexity of their site. This is what our bot is confronted with when it navigates to the court system URL:

a non-exceptional form

The first thing our bot needs to do is to get a list of all Supreme Court opinions. Since this bot needs to interact with the site, we chose to use HtmlUnit by Gargoyle Software Inc.

Step 1. Install the HtmlUnit dependency. We place the following code snippet into our app's pom.xml. If you are not using Maven, visit the website for additional options.

Step 2. Create a method that fills out and submits the form. This is what such a method looks like:

What’s happening here?

Let’s go through this line by line.

Getting the page

HtmlPage page = webClient.getPage(pageUrl);
Enter fullscreen mode Exit fullscreen mode

HtmlPage page represents the initial page. HtmlPage comes from the HtmlUnit library. I use an instance of WebClient, another HtmlUnit class, to get the HtmlPage by calling getPage() on it and providing the URL. The WebClient instance is initialized with a simple WebClient webClient = New WebClient().

Getting the form

final HtmlForm form = page.getFormByName("aspnetForm");
Enter fullscreen mode Exit fullscreen mode

The getFormByName() method provides us with another HtmlUnit object, this one representing the form. It is easy to identify the name of any elements that we need using the browser's dev tools as seen below.

Getting and interacting with the input fields

Since we only want court “opinions” we get the checkbox labeled “opinions”:

final HtmlCheckBoxInput checkBoxInput = form.getInputByName("ctl00$ContentPlaceHolder1$chkListDocTypes$0");
Enter fullscreen mode Exit fullscreen mode

We “check” it programatically with the line:

checkBoxInput.setChecked(true);
Enter fullscreen mode Exit fullscreen mode

We also need to set a “from” date. We get the “from” field with:

final HtmlTextInput textField = form.getInputByName("ctl00$ContentPlaceHolder1$dtDocumentFrom$dateInput");
Enter fullscreen mode Exit fullscreen mode

and “type” in a properly formatted string with:

textField.type(getFromDateString());
Enter fullscreen mode Exit fullscreen mode

If you are curious about how we get the date string, here is what that looks like:

This bot runs frequently, so it is sufficient to fetch the most recent opinions.

Lastly, we need the submit button:

final HtmlSubmitInput button = form.getInputByName("ctl00$ContentPlaceHolder1$btnSearchText");
Enter fullscreen mode Exit fullscreen mode

and finally, button.click() will submit the form and return the page generated as a result of the submission.

Our bot is now seeing a neat list of the most recent Supreme Court opinions. The bot can efficiently run through this list downloading and tagging each document.

The results:

After a single run the bot has:

  • downloaded 541 files from Texas courts!

My other computer’s a Mac

  • Inserted the download info into our database. This info includes the source URL as well as the current location on our server.

Yes, I know this is NJ. Sue me.

  • Inserted metadata about the case with a foreign key to the download entry. This metadata includes the date of the decision, case title, and issuing court.

Grand Finale!!🎉🎉

Finally, we can make use of this data by sharing it with the world! All this messy, unorganized data is now beautifully organized, searchable, and tagged. Have a look and you’ll discover some other cool stuff that we’ve done with this data.

Thanks for reading!

I’m Yitzi 👋

Yitzi Ginzberg (@codegician) | Twitter

Top comments (7)

Collapse
 
alanhylands profile image
Alan Hylands

Great post and project. Remember speaking to some solicitors 10-15 years ago who complained about how bad LexisNexis was to use back then.

Excellent use of tech to free up access to this data.

Collapse
 
yitzi profile image
Yitzi Ginzberg

Thanks! We are also doing a cool judicial analytics AI project that will help people self-represent!

Collapse
 
alanhylands profile image
Alan Hylands

I'll keep an eye out for your next blog post if you write about that one as well.

Collapse
 
ianuragkale profile image
Anurag Kale

Wow! Great work Yitzi.

Collapse
 
yitzi profile image
Yitzi Ginzberg

Thanks!! I'm a big fan of your work as well!

Collapse
 
kamranayub profile image
Kamran Ayub

I have been wondering about using something slick and easy like cypress.io to do scraping. Gotta take care to follow Terms of Use but as a fun little project, I bet it would work!

Collapse
 
realyitzi profile image
Yitzi Ginzberg

Never heard of it. I'll check it out.