DEV Community

Cover image for Analyzing my Amazon Purchases with Pandas and Dash
Maurice Borgmeier
Maurice Borgmeier

Posted on • Originally published at mauricebrg.com

Analyzing my Amazon Purchases with Pandas and Dash

Thanks to some EU regulation or because out of the goodness of their hearts, Amazon now allows you to download the data they've stored about you. Getting that done isn't super complicated. I'm going to show you how to do it and how to analyze the resulting data.

Requesting your data

Let's start by requesting our data. You have to log in to your Amazon account; in my case, that's for German Amazon, so the domain may be different for you (i.e., .com instead of .de). Next, we navigate to https://www.amazon.de/-/en/hz/privacy-central/data-requests/preview.html, which allows us to request our data. I was curious about what kind of data they store about me since I had opted out of all the tracking things Amazon lets one opt out of.

Data request form

Scrolling through the drop-down menu already indicates what you can expect; they're very thorough in storing your data. Next, it will confirm that the request has been accepted, and you'll receive an e-mail confirming that you want to download all the data.

Data request confirmation

Once you confirm that you'd like a copy of your data by clicking on the link in the E-Mail, it's time to wait. It says it can take up to a calendar month for the data to be gathered and sent to you. My first data request took about 6 hours to complete. I started another request three days ago, and that has yet to be completed, so I am prepared to wait.

When they've consulted each of their tens of thousands of databases, they send you an E-Mail with a download link to a ZIP archive. They require you to log into your account again before downloading the data. From a security perspective, that makes sense because spoiler alert: there's a lot of potentially sensitive data in that archive.


Analyzing the data

Let's see what we've got. Looking into the archive, we can see that the data is in JSON format, which makes working with it reasonably easy. We seem to get data + metadata in the form of a file per category and another one with the JSON Schema that describes the content of the files. What kind of files you get depends on which Amazon Services you're using.

$ tree
.
├── Advertising.InterestBasedOptOut.2.json
├── Advertising.InterestBasedOptOut.2.schema.json
├── Alexa.ShoppingList.2.json
├── Alexa.ShoppingList.2.schema.json
├── CustomerReviews.ReviewsVersions.json
├── CustomerReviews.ReviewsVersions.schema.json
├── DigitalOrders.DigitalItems.2.json
├── DigitalOrders.DigitalItems.2.schema.json
...
Enter fullscreen mode Exit fullscreen mode

In my case, I used to use Amazon Retail a lot, and these days, it's, for the most part, Kindle books that I buy from them. I also paid for Amazon Prime for a few years and used Prime Video.

I'm primarily interested in aggregate statistics of how much money I've been throwing at Amazon. A brief look at the data tells me that I'll have to do some data engineering first. In the data, a few things are tracked separately:

  • Physical goods
  • Returns/refunds for physical goods
  • Digital goods (Kindle Books, Movies and Software)
  • Returns/refunds for digital goods

To make it more complicated, product prices and taxes for digital goods are broken down into individual items, etc.

Given that this is probably only a simplified export of the real underlying systems, I can only imagine what kind of complexities they have to handle to make things happen. Anyway, I don't plan to recreate Amazon; I just want to know what I gave them money for, so let's create a simplified order item and then write a few scripts to parse the information from the export into a common format.

Purchased Item

Attribute Data Type Description
id string Unique identifier of this order item
description string Description
category string One of: physical, digital, membership
price decimal Price that I paid for this order item
refunded bool True if the item has been refunded, otherwise False.
timestamp string ISO8601 formatted datetime string in UTC when the item was bought
year int Year of purchase
month int Month of purchase
day_of_month int Day of Month of purchase
day_of_week int Day of the Week of the purchase

I wrote some scripts to help me analyze this, which I made available on GitHub. The first step is parsing the purchase history and storing that in a summary file:

$ parse-export ~/Downloads/Amazon data.json
Enter fullscreen mode Exit fullscreen mode

This should be done in a brief amount of time. Next, I wrote a small Dash webapp to visualize my spending in some graphs. Since I care mostly about the financial stuff, I focused on this; feel free to extend it to whatever you need.

$ visualize-parsed data.json
Enter fullscreen mode Exit fullscreen mode

The command starts a web server on http://127.0.0.1:8050/, and once you open that site in your browser, you'll see the aggregated data. (All of this stays local; I have no interest in your data, and I suggest you take a look at the code to confirm this.)

In the beginning, the website summarizes some key information about your purchases. In my case, I was surprised that I'd been an Amazon customer for more than half my life. During that time, I ordered many things, in many cases not for me but for my relatives or work, but still, I was surprised at the 43k€ figure.

Screenshot: Summary

I also added a spending per year and category bar chart that distinguishes mainly between physical and digital items. In recent years, the digital fraction has been steadily rising while the physical goods seem to be declining, which can be accounted for by my moving to a place with no Amazon. I think it's also relatively easy to see when I started to earn my own money in this chart.

Screenshot: Spending by year and category

Filtering this to digital spending tells me something about the cost of my reading hobby. These are almost exclusively E-Books, and in earlier years, the physical items also included a significant number of books. In total, I've spent more than 2500€ on E-Books, and I imagine that Amazon and the publishers have a very healthy profit margin here. Anyway, I think that money was very well spent.

Screenshot: Digital Spending by Year

I also looked at my spending by day of the week, but that wasn't very interesting - nevertheless, the graph is included. A histogram of the product prices also didn't yield many insights, but I'm also not really sure what I expected.

If you want to do this yourself, check out the Github repository. I briefly considered building a webapp where you just upload your export file and get the visualization, but I decided that I don't want to be responsible for or have access to that kind of data. I don't mind sharing these aggregate metrics with you here, but the details should stay private.

I'd appreciate it if Amazon provided these kinds of statistics to its customers directly. They're trivial to compute, and I can almost guarantee that they're using these metrics in their fraud detection systems and others. I suspect that looking at this data may make some people question their spending habits, which is not necessarily in Amazon's interest.

Let's step away from the financial information and explore the other data in the export. In my case, I was surprised that I have two Alexa Lists even though I've never touched the service as far back as I can remember. The lists appear to be empty, so they are probably some default lists created for every customer.

There is also something about personalized advertising in the form of a file starting with Advertising.InterestBasedOptOut, which looks like this:

Personalized Ads

It seems to me that I've opted out of personalized ads for the German Amazon Marketplace (DE) and have been for more than five years, but this is apparently tracked per country. I've yet to find the option in the interface to opt out for all countries.

The CustomerReviews.ReviewsVersions.json contains all product reviews that you've ever created. I usually only rate E-Books, but apparently, I also wrote some reviews in 2017 that I had completely forgotten about. While the data structure is pretty intuitive, it would be nice if it included some information about the reviewed product.

Reviews

Fortunately, it's not too tricky to get to the actual product; the asin is the product identifier, and you can just create a URL like this: https://amazon.de/dp/<asin> to get to the item in question. This particular item is the Kindle version of Dune by Frank Herbert.

It gets a bit more creepy with the SearchHistory files. You can see all the searches you've ever done including detailed information about the device, such as the OS-Version. The oldest records in my sample were from October 2022, so they may delete this data after around 18 months, but I'm not sure about that.

Search History

The files SearchHistory.Products and SearchHistory.Queries don't share a common identifier, but with the event data, it's probably possible to figure out which products you clicked on as part of the decision-making process - potentially interesting if you want to re-trace your decision-making process.

Conclusion

In this post, I've looked at the export of my Amazon data and analyzed my spending patterns. I try to disable as many tracking options as possible, so I expected a fairly limited dataset. I can see the need to retain the information that they have stored about me, but I'd like a bit more control over the storage of my search history. Additionally, I don't understand why the interest-based add opt-out should be per marketplace and not global; this seems shady.

— Maurice

Top comments (0)