I’ve completed a lot of play-testing lately for my game Carnival. It’s a fascinating experience to watch people work through the puzzles. And it’s a humbling experience as people stumble and flounder on failings in my designs. Without a doubt, focused user testing has made my game significantly. Here I’d like to write about the principal things I’m watching for.
A friend of mine spurred this article, as she discovered I keep extensive notes during the play-test. She also tested the game and was curious to know if I wrote anything bad about her. I assured her it was all good stuff, while silently discarding the evidence behind my back.
Carnival is played online in the browser with teams that meet in an audio/video chat. While the faces can be helpful for testing, it’s mainly the audio I’m listening to. For the video, I have one participant share their screen. It’d be great if I could have all participants share their screens, but I’d probably need more monitors to make sense of it.
Hearing the people play is super useful, in addition to watching what they do on screen. Where they move their mouse can be telling, but hearing their “hmms” and “huhs”, exasperated laments, and cries of fowl, tell even more. All of this helps me understand their thought process in solving puzzles and easing the play for future players.
In the early rounds of play-testing, I instruct people about what to expect and how the game works. This allows me to test prior to the game being finished, avoiding some redundant work. I try to reduce the verbal preamble quickly as possible for subsequent teams, replacing it with the in-games systems. By the late stages of testing, I start with only a few hellos, then send the teams off on their own.
I don’t record these sessions. First off, I don’t care to deal with the privacy aspect of archiving such data. But second, I’d never watch them again. It’s almost always better to test with a new group of people than labour endlessly on a single session. More people equals better testing, especially with a puzzle oriented game.
As mentioned, I take a lot of notes for these sessions, where lots means 2-3 pages of cryptic scrawling for the roughly 90 minute test sessions. A symbol showing the type of comment prefixes each line . At least in theory, in practice, I have some lines that have nothing before them. If something occurs to me during the test, I note it down, even if I’m unsure it’ll be helpful.
For clarity here, I’ll replace my symbols with emojis. That way we can debate their semantic importance, rather than dissect my post-modern scribbling approach to art.
This became the most important symbol in testing. Red herring’s a problem in puzzle games. These are things, an item, a graphic, some dialog, a pattern, virtually anything that misdirects the player. The player already has to resolve many important pieces of information in their head.
I’m talking about unintentional red herrings. These are colour patterns, or objects, that reasonable look like clues to a puzzle. They arise naturally out of the graphic design by accident. I have nothing but ire for designers who intentionally add red herrings in their game. It’s an undebatable poor design that frustrates players. There are enough unintended red herrings to deal with already.
For example, I have a puzzle in Carnival where you need to set a row of lights to the correct colours. I intend them to match another pattern on the screen, as hinted in some dialog. Lo-and-behold, some testers found another sequence of colours, in some flags, that seemed to match the lights as well. They then proceeded to match that pattern, which of course failed. This gets a big “🐟” mark in the notes.
🐟 sequence of flags matches light pattern
Many small, sometimes large, obvious ideas crop up during the tests. These may be ways to improve the puzzles, graphics improvements, overall UX ideas, or basically anything. The ⭘ means I had a concrete thought and I should go back and improve it later — at which point I add a ✔ to it.
Each circle is a clear opportunity to improve the game. Rather than theorizing about things to improve, these all come from actual players. Fixing them would have a direct impact on some future player. This doesn’t mean they all get resolved, as priorities still play a role, and some of them are hard to fix.
One example I have is with a ticket in the game. The player acquires the ticket and must present it to get into the carnival. I thought it’d be obvious that you show the ticket to the man in the booth, but one player tried to use the ticket on the entry sign. It’s not ridiculous, since the sign, with a “No Entry” label hanging on top, is where you enter the park. What happens then is frustration. In their head the player has resolved the situation: they found a ticket and are using it to get inside. It not working is like throwing a wrench into their thought process.
The obvious resolution was to make the sign accept the ticket, and that’s the note I made:
⭘ tried show ticket to sign, accept ticket
Though, I ended up using a more humourous fix. The booth agent cracks a joke about an inanimate sign. This has the effect of affirming the player’s action made sense, but was slightly off. Additionally, it made it clear where the ticket should be used instead. Oh, I could talk at length about this leading…
Whenever a player solves a puzzle, I note it with a little flag, and the time.
🏁 rabbit hand puppet, 23:03
This is the wall time, since it’s what I have most available. I record the start of the session as well, letting me calculate later the time offsets between puzzles.
My goal here is to establish a good pacing for the game. Droughts, long periods between solutions, demotivates the player, and increases the likelihood they dislike the game. Given my experience with the previous game, and numerous escape rooms I’ve played, the pacing mostly worked as is.
However, there were some clear problems on some puzzles — they took too long. In most cases, I didn’t need to know the time to see that people were frustrated. But sometimes the time helped decide the frustration was okay. Perhaps the player was being impatient, rather than having an actual problem.
Timing can also deceive. In one game the players took nearly 15 minutes for one puzzle, well beyond the typical 5 I consider a maximum — these aren’t like fundamentally fixed numbers, so don’t quote me on them. But after the puzzle they both said “wow, that was great” — or something to that effect, I can’t always read my writing. This makes evaluation tricky, but at least the timed notes give some help.
Everybody gets stuck, but an experienced team should not need hints. It’s why I treat hint requests during play-testing seriously. The game has a built-in progressive hint system, so all I do is note where they requested the hint, and how many they needed.
❓❓❓ dart game
You might think requesting a lot of hints is bad, but for experienced teams, if they need one, they typically need many. This results from the hints being progressive; the players often already know the information in the first couple of hints, and only the later one reveals something new. Short of deciphering the players thoughts with electrodes plugged into their brain, there’s no real way to avoid this.
Well, there is a way to avoid some hints, and I do that. Some hints can be tied to game actions, in particular with the inventory. But the general purpose hints can’t be tied to in-game events. I can never really be certain the player already knows the hint.
Requesting hints is part of puzzle games. It’s totally fine if players ask for them, but it shouldn’t be the standard approach to solving a puzzle. And since people who play-test are mainly interested in this genre of game, it biases the results — I assume the average player will require more. It’s a good sign when some play tests have no ❓'s on them. Past that point, I can consider each tension point more carefully.
To give context to my notes, and the players thought process, I take several notes about what they say, or what they do. The latter also uses quotes, because I couldn’t think quickly what other symbol made sense, and in practice the notes are mixed.
‟Is this random? click on light
These notes primarily serve as anchors to the other ones
I’ll also make notes of solutions they’ve tried that have failed. These can often give ideas of how they are thinking, or, sometimes, I end up accepting alternate solutions if they seem equally valid. For some of these I’ll end up using a CIRCLE-‟ combination in the notes.
⭘ ‟12345, hmm, doesn’t work
I put this last since it’s not the point of this level of play-testing. I have already resolved most of the functional defects, and the game is fully playable before I begin play-testing. Several minor defects still appear, and if I’ve recently changed something, engine defects are possible (the frightening and thankfully rare ❗❗).
❗ B-girl font not converted path
The ❗ is for things that are 100% definitely technical defects. This could be a graphic that is missing, the wrong font used somewhere, or a typo in the text. I suppose I could use them for defects in the puzzles, though oddly, I’ve not have that situation come up yet. As this is a multiplayer game played over flaky networks, I’ve tried hard, from the start, to make the logical game state consistent. That appears to work. While defects are possible in the puzzles still, I’ve likely worked them out prior to starting play-testing.
I think that’s an important point: I’ve played the game entirely many times before I do any play-testing. I want play-testing to focus on the things I can’t find myself. Even the initial play-testers get a game that is working, albeit potentially without the hint system, and not all graphics in their final form, but functionally playable from beginning to end.
These symbols are a general guide to the notes I take. I still have other things written, sometimes in combination, or sometimes with no markings.
I have found though that trying to itemize my thoughts produces firm notes. Rather than write everything, and anything, I focus on specific action items:
-🐟 Something is misleading
-❗ something is broken
-⭘ Point for improvement
-❓ a puzzle may have a problem
Where 🏁 and ‟ are then used to anchor those points, giving context when I go back later.
While not a perfectly defined process, that’s about how I did play-testing of Carnival. Watching people play, think, and laugh is fascinating. I also love watching the streams of people playing my game, as it gives another insight. I take notes from those as well.