DEV Community

Will Blaschko
Will Blaschko

Posted on • Originally published at linkedin.com

Falling Blocks - "Tetris"​ On Alexa (APL)

Falling Blocks Game

TL;DR

I created a fully-working Tetris clone (named Falling Blocks) for Alexa using Alexa Presentation Language (APL) that allows you to play in real-time, controlled entirely by voice. Building on the techniques in my last post, the gameplay area is comprised of a 8x12 grid of Pagers controlled by SetPage commands for forced animation. The biggest challenge was syncing the frontend with the backend, which was solved by abusing ControlMedia (Play/Rewind) commands (in Parallel) and listening for onPlay events for each tick of gameplay. Big thanks to Mark S. (an APL genius) for suggesting this solution!

UK Skill - US Skill

Watch the gameplay video

See the Gists

Introduction

In this post, I will talk about the Alexa (response size) and device (performance) constraints for the Falling Blocks project, share the math I used to calculate viability of the project, and cover the key elements of the solution (including gists).

During my most recent trip to Italy, I had an informal brainstorming session over dinner with my Spanish and Italian colleagues who were amused by my last post. With a mutual interest in games (classic and modern) we started discussing what other games could be APL-ified. It is with only a little shame that I admit that I literally jumped up and down with excitement (multiple times) when we stumbled upon Tetris.

I have a couple other projects in the queue, but the ability to execute on real-time voice gameplay was too tempting to pass up, so I bumped Tetris to top priority. Fast forward two weeks and my finger is hovering over the button to submit for certification.

Of the ~20 hours I spent on this project, about 8 were fighting with the frontend/backend sync challenge. After a very quick chat with Mark, I was able to completely resolve the issue in about an hour, reusing most of the code I had already written (the advantages of working with talented people).

As always, everything here could break tomorrow. Caveat emptor.

Constraints

There are two primary constraints that I had to keep in mind for this project: 1) the response size to Alexa needs to be under 24kb (~24,000 characters) and 2) the more components, the longer an Echo device takes to render and then execute commands. I found that around 300 components, my Echo Show starts to noticeably slow. Originally, my plan was to have a 12x20 grid, but I had to scale this back due to 10+ second render times (on device, more in the emulator).

For this project, I was able to trade one constraint for the other and I experimented in a few different ways before returning to my initial design:

  1. Define grid groupings, so that one Pager controls 2+ tiles (for 3 * 3 combinations) to halve the number of SetPage commands required. This would have allowed for twice as many gameplay commands but 864 (9 * 8 * 12) components was too many for the device (and completely froze the emulator).
  2. Pre-calculate coordinates and sizes and render the tiles in absolute positions resulting in a much larger APL Document. Despite no calculations for dimensions being required at runtime (relative/absolute percentages), I saw a huge performance hit. In the future, I may try to create more simple benchmarks to figure out the root cause because this seems backwards to me.
  3. Reload the document. Even with the goal of a zero-load gameplay, this crossed my mind as I was trying to figure out how to reset the board quickly. The updated document and data source (without optimization) were about the same total size as the commands required to reset all the Pagers. This was not an option: load time killed this.
  4. Calculate the delta after a voice commands and change only those tiles. With a limited response size, I wanted to minimize the number of Pagers I sent SetPage commands. There were some syncing issues (even after delta calculation) leading to ghost tiles. It was more prudent to completely reset the board once after a voice command and then do delta calculations for each tick.

Quick Math

(Please check my math!)

Assuming we follow the original design (which I did), one tile equals one Pager. For the below, active is defined as: the tiles that comprise the falling shape; inactive: the settled structure tiles (former active), and empty: unused grid tiles.

Our grid is going to be of the dimensions X by Y. Each grid tile has 3 states (active, inactive, empty) displayed as Frames within a parent Pager component. Our number of grid tile components is: 4XY (Pager plus Frames). With our Container components used to wrap them, we have an additional Y + 1 (one Container to hold everything and one Container per grid row).

With this design, our minimum number of components is therefore: 4XY + Y + 1.

Our number of SetPage commands will be broken down into: 1 full reset (XY) to all Pagers and one delta calculation per tick. The delta is a minimum of 4 SetPage commands (the O-Shape moving) and a maximum of 4X (clearing 4 rows with the I-Shape). More about shapes.

The minimum number of characters I found per SetPage command is 47 (where the componentId is a two-character grid location):

{"type":"SetPage","componentId":"aa","value":0}

Ignoring all the other response data required (session attributes, speak/reprompt, other commands), that leaves us with the following equation:

Minimum Number of Commands (I-Shape) = (24000/47 - XY)/4X

Maximum Number of Commands (O-Shape) = (24000/47 - XY)/4

As stated above in Constraints, if you can decrease X or Y (by either reducing grid size or by spanning multiple tiles with a single Pager), you can greatly increase the number of SetPage commands you can send. Therefore, lengthier autonomous gameplay is inversely related to higher device performance requirements.

In the case of my 8x12 grid, we have:

Min Components: 397

Min Commands: 12

Max Commands: 103

An intelligent solution could calculate available space left after each tick to see if there’s room for another but this calculation would need to be done as the last step. For this project, I picked a fixed number with enough headroom (27 minus user commands, so about two full shape falls), since I have other content in the response payload.

Solution

Since I already covered the basics in my last post, I will only cover here what is unique to this project.

Gists here.

Audio Marker (Hack)

Despite seeming like there should be, there is almost no way to track what has already happened to your components on the device when you send multiple commands with staggered delays. The request payload appears to have status information on component children, visibility, and state for each component but I did not see these being updated reliably. Pager and Sequence do not have an onChange event. I originally tried to do the math to estimate latency and calculate the current state, but the game was not playable (or at least not enjoyable).

In my conversation with Mark, he suggested I find a way to mark each batched set of commands (tick using Parallel commands). The only commands he knew of that sent a callback to the Lambda were the Media Commands. Long story short, onEnd wasn’t reliable enough (due to timing, it didn’t always fire), but onPlay fired every time.

Listening for the onPlay UserEvent on the Lambda, I was able to shift my server-side state queue and keep track of the current gameplay position.

Event Handler Gist

Layout Gist

InitialPage Attribute of Pager

This is one of the “read the documentation” moments for me. I was trying to figure out how to display the current game state in the best way on initial load with minimal performance overhead (add a firstItem attribute, send a set of reset SetPage commands, etc.). Nope, what I wanted was initialPage.

By passing in a 2D array which conveyed both X by Y dimensions and the starting values, I was able to calculate/display the initialPage (starting tile state) with no additional overhead.

Values Gist

Layout Gist

Multiple Moves Per Interaction

More as a quality of life implementation, I allow my users to send multiple commands at a time. With the most permutations, I get utterances that look like this (see gist):

"{DIRECTION_A} {DIRECTION_B} {DIRECTION_C} {DIRECTION_D} {DIRECTION_E} {DIRECTION_F} {DIRECTION_G} {DIRECTION_H} {DIRECTION_I} {DIRECTION_J}"

Instead of needing to say each “Right” individually (“Alexa, right. Alexa, right, Alexa, right”), the user can say all three (“Alexa, right, right, right.”). In testing, this greatly reduced friction. If I wanted to make the game more difficult, I would figure out the right timing to interleave these between the normal fall speed movements. I think the game is difficult enough (and fun enough) as it is for a proof of concept.

Intent Gist

Saving to Persistent vs Session Attributes

Most of the time it can make sense to save everything in your Session attributes to your Persistent storage and vice versa. If you do so at the start/finish of a session instead of each turn, you save yourself IO latency and cost.

Because I had to pre-calculate the future possible states, and because I was already constrained by my response size, I decided to split my storage logic. All game state queue information is in Persistent storage 100% of the time, which means I hit my DynamoDB for every conversation turn (and every gameplay tick). This is stored as a queue of states, one per gameplay tick. I then use variables in the Session attributes to extract what I need from the persisted information. This is the first legitimate need I’ve had in my own projects to split my user data in this way.

They latency cost of reading from and saving to DynamoDB isn’t high in the grand scheme of things (generally under 150ms). In gameplay, it’s not explicitly noticeable, but everything adds up.

Conclusion

One of my colleagues asked me how I have time and energy for projects like this. It’s a great question. Pushing the boundaries of what technology can do excites me, the creation process relaxes me, and completing projects/milestones is satisfying. There were parts of this project that were painful and frustrating, but that’s part of learning and experimentation (worth it in the end).

Like 2048, I see this game as another compelling reason for voice-controlled devices with screens. The possibilities and experience will only continue to get better as technology improves and developers become more creative.

If you’re new to Alexa or APL, I would not suggest starting with a project like this. In addition to the complexities, APL is still limited in functionality; Pager/Sequence animation is a hack, UI audio markers are a hack. I did a lot of math and design before I wrote my first line of code, even then I wasn’t sure the project was even practical.

That said, if you’re like me, go out there and see what you can break and then what you can create out of those broken pieces. Enjoy!

Disclaimer

This is me representing my personal night/weekend projects and has no association with my employer. All methods discussed here are based on publicly available features, just some used in unintended ways.

Everything here could break tomorrow. Caveat emptor.

https://youtu.be/XdgwUCVAOV4

Top comments (0)