Extracting Goodreads metadata

#webdev #reading

This week I posted the list of books and the reviews for 2019, but in addition, I also published the decades the books were published. That information wasn't so easy to extract from Goodreads, so I'm writing up for when I need it again in 12 months.

Overview

I'm using a Mac, but all the tools I use are on the command line. For this process to work you'll need (or I'll need):

jq installed (or use jqterm.com)
xml-to-json-fast - there's nothing specific about this XML to JSON tool, just that's what worked for me
A Goodreads account and an API key - if only to get metadata about a book title

For this you'll need a Goodreads account and the titles of your books. I actually go a step further and use the reviews I've posted on Goodreads to source the book titles from the previous year, but it's not entirely required.

The process will run do the following:

Hit the Goodreads search API for the title of your book
Convert the search result from XML to JSON
Slurp the JSON and transform to HTML/Markdown reorganised by decade and year

1. Getting the year from Goodreads

Using the source title list, we'll generate a list of curl requests that hit the search.books API and convert from XML to JSON. This is the first jq command I'll use:

split("\n")[] | @uri"curl -s \"https://www.goodreads.com/search/index.xml?key=${YOUR_API_KEY}&q=\(.)\" | xml-to-json-fast | jq -r -f book-year.jq"

This takes a list of titles as a source and generates a long list of curl requests. Note that I'm assuming the YOUR_API_KEY is inserted, also that I'm using xml-to-json-fast and finally that the XML output is being piped to a jq script called book-year.jq which I'll show you next.

You can see how this looks - note that both input is slurped and raw and the output is raw. This output I've saved to a single get-years.sh and then on my command line I'll run sh get-years.sh so the whole command runs at once.

Before I run this code, I need to create the book-year.jq script.

2. Extracting the year

In a file in the current working directory a called called book-year.jq will contain:

def mapper:
  if type == "object" and .items then
    { (.name): .items[-1] | mapper }
  elif type == "array" then
      reduce .[] as $el ({}; . + { ($el.name): $el.items[-1] | mapper } )
  else
    .
  end
;

.items[1].items |
last.items[0].items |
mapper + (.[-1].items | mapper) |
{ title, year: .original_publication_year, author: .author.name }

This script will take the JSON output from xml-to-json-fast and transform it into a single object that contains the title, year, author.

As this script is run multiple times, I'll put all the curl commands into a single bash script - that way I can either abort the process or more importantly, I can capture the output in a single file.

sh capture-book-year.sh > book-years.json

Once the process finishes you should have a file that looks a bit like this:

{
  "title": "Skipping Christmas",
  "year": "2001",
  "author": "John Grisham"
}
{
  "title": "Death's End (Remembrance of Earth’s Past #3)",
  "year": "2010",
  "author": "Liu Cixin"
}
{
  "title": "The Afterlife of Walter Augustus",
  "year": "2018",
  "author": "Hannah M. Lynn"
}
{
  "title": "Miss Pettigrew Lives for a Day",
  "year": "1938",
  "author": "Winifred Watson"
}
{
  // and so on
}

The important thing to note is that this is not valid JSON - it is actually a stream of JSON objects, but jq is fine with consuming a stream.

One final touch: the Goodreads API is not great and in fact it can be missing some data at random. In my case, 3 of the 27 titles I sent to their API was missing a year, so I had to manually add those myself. To find any missing years, you can run the following on the command line:

cat book-year.json| jq 'select(.year == null)'

Then edit book-year.json directly adding in the missing data.

3. Transforming to readable content

The aim is to restructure the data so that the books are grouped into their decades and then sorted by year. To do this I need to add the decade to each title and then reduce the dataset into a decaded keyed object.

The jq command needs to slurp the source JSON using the --slurp flag:

cat book-year.json | jq --slurp '…'

The following can transform the JSON into the "right" structure:

map({
  # construct a title prop that is: "[year]: [title] by [author]"
  title: "\(.year): \(.title) by \(.author)",
  # to get the decade I slice first 3 year chars + "0"
  key: (.year | .[:3] + "0")
}) |

# sort by the decade
sort_by(.key) |

# reduce to { [decade]: [{ [year]: [title] }, …], … }
reduce .[] as $e (
  {};
  . + { "\($e.key)": (.["\($e.key)"] + [$e.title] | sort ) }
)

You can see the code above running in jqterm here. The result is now structured the way I want for posting:

{
  "1920": [
    "1921: We by Yevgeny Zamyatin"
  ],
  "1930": [
    "1938: Miss Pettigrew Lives for a Day by Winifred Watson"
  ],
  "1950": [
    "1953: Fahrenheit 451 by Ray Bradbury",
    "1954: I Am Legend by Richard Matheson",
    "1954: Lord of the Flies by William Golding",
    "1955: The Chrysalids by John Wyndham"
  ]
}

The final part (for me) is to transform this into markdown so that I can paste it into my blog post by adding the following line to my code and selecting the output as "raw":

to_entries | map("## \(.key)\n- \(.value | join("\n- "))\n")[]

The result is now ready for my blog post:

## 1920
- 1921: We by Yevgeny Zamyatin

## 1930
- 1938: Miss Pettigrew Lives for a Day by Winifred Watson

## 1950
- 1953: Fahrenheit 451 by Ray Bradbury
- 1954: I Am Legend by Richard Matheson
- 1954: Lord of the Flies by William Golding
- 1955: The Chrysalids by John Wyndham

You can tinker with the final result. I hope this was useful, albeit, in parts!

Originally published on Remy Sharp's b:log