Reading in large .JSONL files from the Harvard Caselaw Access Program (CAP) in Python
Reading in large amounts of data can be time-consuming. Social scientists are not always well-trained in best practices for file I/O and manipulating data with minimal run-time and memory usage. This post was inspired by a real-world example while I was working with data from the Caselaw Access Project (“CAP”), newly released by The President and Fellows of Harvard University. CAP provides an API and bulk data downloads of "all official, book-published United States case law" and is a treasure-trove of legal text. This post is intended to help others efficiently read in large data files, whether working with CAP data or another source.
Accessing data from all 50 states (and territories) requires authorization from CAP, but data for all Arkansas and Illinois case law are publicly available. For this tutorial, I utilize their publicly available data, which you can download here. However, the programming principles utilized are even more valuable when working with CAP's full-stream.
To get started, click on the "Arkansas-20180829-text.zip" and "Illinois-20180829-text.zip" links on the page linked to in the preceding paragraph. Extracting the files from the zip folder produces .xz files. If using Windows, I recommend downloading WinZip to extract the .JSONL file. Place both files in a directory, and rename them "arkansas_data.jsonl" and "illinois_data.jsonl" respectively. Now we can get into the code! I have written this code in Python 3.
First, let's important relevant libraries, set our working directory, and store the file names (and corresponding state names). Additionally, I include a function to calculate the decade a case was decided, to be called later.
I am interested in only extracting opinions from state courts of last resort. So, I am going to create a list of state high courts to match from later.
The overall approach is to create a list of pandas dataframes (one df per file), where for each case a dictionary is created. We will extract the desired data from the .jsonl files, store in a dictionary, and append to a list (one list per file/state). Once all cases in a state have been extracted, the list of dictionaries is converted to a single pandas dataframe. This is very efficient! But let's break it down into steps.
I create an empty dictionary, and store empty pandas dataframes (one for each file). The dataframes have columns for "court", "date", and so on. I will extract this information from the .jsonl files.
Now, I write a big loop. While some advise to stay away from loops, when I am working with all 50 state files I prefer to trade some space for time (I do not want to run out of memory and have the script stop!). For each file, I create an empty list. For each line in a file (this is a case), I read in the data from the .jsonl file. Since I am interested in state high courts, I have an if statement to extract data based on whether the case was heard by a court of last resort. Additionally, since I am interested in the text of opinions, I only extract data if at least one opinion is included.
For each case, I create a dictionary and store the court name, date of the decision, case citation, parties to the case, year, decade (with the function included earlier), number of opinions, state, and number of words in the majority opinion. I append each dictionary to the list "rows_list." Once the script has run through a file, the list of converted to a dataframe and stored in the dictionary "state_court_d". Since there are two files (Arkansas and Illinois), this loop iterates twice, producing a dictionary composed of two dataframes.
Since I want to analyze state high court cases across multiple states, I combine the dictionary of dataframes into a single dataframe.
I now have a dataframe of all cases with an opinion in the history of the Arkansas Supreme Court and Illinois Supreme Court. You can time the script, and it performs faster than other approaches (i.e. not using a dictionary data type before converting to a dataframe). Let's save the dataframe with the pickle library at the end of the script.
And that's it! The increased speed this approach lends is especially useful when working with all 50 state files and when conducting analysis in addition to storing the metadata (such as calculating readability scores or extracting citations to other cases). Stay tuned for future posts analyzing this dataset!