Submitting forms and scraping data for the public good
The US legal system is stacked against the poor.
The outcome of any court case is mainly dependent on the amount of money the parties involved can spend on their representation. Wealthy criminals who have harmed others with impunity get let off with a slap on the wrist or less. At the same time, the less financially fortunate can end up with outrageous sentences for the pettiest violations.
One of the main drivers of this unjust situation is the cost of legal research. Two companies hold a duopoly on the U.S. market for legal information.
Law firms seeking access to precedent-setting legal opinions will need to pay Westlaw or Lexis upwards of $100 per month per license! See “The Lexis/Westlaw Duopoly and the Proprietization of Legal Research” — Columbia.
To fight this travesty AnyLaw (as seen on forbes.com) has built a robust, fully-featured, and most importantly, FREE legal search engine!
Cool, I know!
How we do it
Because this was to be a free solution, we obviously could not have the overhead of licensing the published opinions annually.
Thankfully, US law requires every state and federal court to maintain a website and publish their legal opinions online.
The obvious solution, therefore, is to scrape and download the data from the individual court sites.
Unfortunately, there is no standard. Sites can be:
- static
- rendered server-side
- dynamic JavaScript sites
Courts are allowed to do as they please in terms of building their site.
It pleases them to do some very interesting things.
For example have a look at the website of Virginia’s Supreme Court:
Yup, that is every Supreme Court decision since 1995 on a single page. 😕
TLDR; Each court website needs a bot 😧
Case study: Texas
Texas is a great example because of the complexity of their site. This is what our bot is confronted with when it navigates to the court system URL:
The first thing our bot needs to do is to get a list of all Supreme Court opinions. Since this bot needs to interact with the site, we chose to use HtmlUnit by Gargoyle Software Inc.
Step 1. Install the HtmlUnit dependency. We place the following code snippet into our app's pom.xml. If you are not using Maven, visit the website for additional options.
Step 2. Create a method that fills out and submits the form. This is what such a method looks like:
What’s happening here?
Let’s go through this line by line.
Getting the page
HtmlPage page = webClient.getPage(pageUrl);
HtmlPage page
represents the initial page. HtmlPage
comes from the HtmlUnit library. I use an instance of WebClient
, another HtmlUnit class, to get the HtmlPage
by calling getPage()
on it and providing the URL. The WebClient
instance is initialized with a simple WebClient webClient = New WebClient()
.
Getting the form
final HtmlForm form = page.getFormByName("aspnetForm");
The getFormByName()
method provides us with another HtmlUnit
object, this one representing the form. It is easy to identify the name of any elements that we need using the browser's dev tools as seen below.
Getting and interacting with the input fields
Since we only want court “opinions” we get the checkbox labeled “opinions”:
final HtmlCheckBoxInput checkBoxInput = form.getInputByName("ctl00$ContentPlaceHolder1$chkListDocTypes$0");
We “check” it programatically with the line:
checkBoxInput.setChecked(true);
We also need to set a “from” date. We get the “from” field with:
final HtmlTextInput textField = form.getInputByName("ctl00$ContentPlaceHolder1$dtDocumentFrom$dateInput");
and “type” in a properly formatted string with:
textField.type(getFromDateString());
If you are curious about how we get the date string, here is what that looks like:
This bot runs frequently, so it is sufficient to fetch the most recent opinions.
Lastly, we need the submit button:
final HtmlSubmitInput button = form.getInputByName("ctl00$ContentPlaceHolder1$btnSearchText");
and finally, button.click()
will submit the form and return the page generated as a result of the submission.
Our bot is now seeing a neat list of the most recent Supreme Court opinions. The bot can efficiently run through this list downloading and tagging each document.
The results:
After a single run the bot has:
- downloaded 541 files from Texas courts!
- Inserted the download info into our database. This info includes the source URL as well as the current location on our server.
- Inserted metadata about the case with a foreign key to the download entry. This metadata includes the date of the decision, case title, and issuing court.
Grand Finale!!🎉🎉
Finally, we can make use of this data by sharing it with the world! All this messy, unorganized data is now beautifully organized, searchable, and tagged. Have a look and you’ll discover some other cool stuff that we’ve done with this data.
Thanks for reading!
I’m Yitzi 👋
Top comments (7)
Great post and project. Remember speaking to some solicitors 10-15 years ago who complained about how bad LexisNexis was to use back then.
Excellent use of tech to free up access to this data.
Thanks! We are also doing a cool judicial analytics AI project that will help people self-represent!
I'll keep an eye out for your next blog post if you write about that one as well.
Wow! Great work Yitzi.
Thanks!! I'm a big fan of your work as well!
I have been wondering about using something slick and easy like cypress.io to do scraping. Gotta take care to follow Terms of Use but as a fun little project, I bet it would work!
Never heard of it. I'll check it out.