DEV Community

Megan
Megan

Posted on

Videogame Text Datasets Release

Background

In 2016 I released LibraryofCodexes, a website that aimed to gather videogame text into one uniform place (think in-game notes, books, letters, audio recordings etc). This was because I found that I was too engaged in finishing a quest or killing a monster to take the time to read it, and while wikis existed they can sometimes be tedious to navigate. Ultimately the website has gone through a few iterations since 2016. Most of the original design was stripped away in favor of shifting to an eBook repository and the database that has each individual entry is now private.

However, I've recently started reading through a few academic papers on Natural Language Processing applied to videogames (it's a rather small domain). I realized that there is a lack of easily accessible text and I’ve been sitting on a data set for the past few years that just needed to be formatted and released.

Datasets

I've gone ahead and released the full data set in json format to github. This repository, at the time of release, contains a slew of different game series (see full list below). Each videogame has it's own README which details what data has been collected, what kind of quirks, and the degree of sanitization.

Videogame Series List

  • Assassin's Creed
  • Baldur's Gate
  • Battlefield
  • Crysis
  • Dead Space
  • Destiny
  • Deus Ex
  • Diablo
  • Doom
  • Dragon Age
  • Dying Light
  • Fable
  • Fallout
  • Gears of War
  • Horizon Zero Dawn
  • Kingdoms of Amalur
  • Mass Effect
  • Metroid Prime
  • Middle-Earth
  • Nier
  • Red Dead Redepmtion
  • Resident Evil
  • Star Wars: The Old Republic
  • System Shock
  • The Divison
  • The Elder Scrolls
  • The Last of Us
  • The Witcher
  • Tomb Raider
  • Watch Dogs
  • World of Warcraft

Hopefully this can help make someone's research just a little bit easier. I will continue to update the repository in the future as I update LibraryofCodexes with new games.

Top comments (0)