Last month, my partner-in-business-and-life Lauren and I were part of the Project Voice conference in Chattanooga, Tennessee. Project Voice is the first voice-focused technology conference I’ve attended, so it was a great chance to meet some of the folks in the #VoiceFirst community I’ve previously only interacted with online. Bradley Metrock and his crew put together a comprehensive schedule with a wide range of speakers from the largest voice platforms to the tiniest technology companies (thanks for having Voxable on the lineup, Bradley).
At Project Voice, I often found myself wishing I could attend three or four conference tracks simultaneously. I kept meeting new people passionate about the same technology I’ve dedicated myself to for the past half-decade. Hearing about their experiences trying to make machines that can talk was fantastic and enlightening. In short: I was busy...very busy.
I was so busy, in fact, it wasn’t until I returned home that I realized I somehow missed this announcement from Mozilla:
On January 15, 2020, Mozilla released Firefox Voice, an experiment with voice interaction in the Firefox browser. In its current iteration, Firefox Voice runs as a browser extension that gives access to a set of voice interactions in Firefox via a keyboard command:
Amid multiple days of themed content from the major smart speaker platforms at Project Voice (Amazon, Google, and Samsung), how did Mozilla have the most important announcement of the week? After all, aren’t the platforms where most of the action is in the voice space? And, Mozilla wasn’t even present at the conference!
Amazon Alexa continues to dominate mindshare amongst folks working in voice today. This is thanks in large part to a massive investment in developer evangelism on Amazon’s part. Amazon also clearly listened to early feedback on the lack of monetization options for Alexa skills. Currently, Alexa is far and away the best platform for releasing monetized voice experiences.
Even with an increase in monetization options, Project Voice reflected that major challenges remain for independent voice companies. During his excellent keynote, Brian Roemelle pointed out that smart speakers are the fastest adopted consumer technology in history. In a similar time frame after the release of the App Store for iOS, there were already multiple millionaires who made their money during the mobile app early-adopter windfall. Brian appropriately probed, given it’s now five years after the initial release of Alexa, “Where are the voice-first millionaires?”
According to reports, monetization on Alexa underperformed expectations. Some voice firms expressed notable and public displeasure with their ability to attract and retain paying customers for voice experiences. Co-founder of Soundcheck.ai Dr. Daniel Tyreus laid out the issue in his Medium post:
It appears that most developers who have invested in building voice apps are not seeing a return. In the first 10 months of 2019 Alexa developers made about $1.4 million from in-skill purchases on 100,000 Alexa skills. On the other hand, in the first half of 2019, iOS developers made $25.5 billion in revenues from 2.2 million apps on the App Store. That means the average Alexa skill made $12 in the first half of 2019 while the average iOS app made closer to $12,000. That’s an enormous difference.
As entrepreneurs in the voice industry for the past four-and-a-half years, my partner Lauren and I (firmly) agree it’s more of a struggle to make a living than it should be. The fact that we’re still having conversations at major conferences about monetization and discovery a half-decade into this journey is distressing, to say the least. And, independent firms aren’t the only ones feeling skittish about voice.
Another key Project Voice takeaway: brands have been slower to adopt voice than initially anticipated. Over on the Jargon blog, Shaun Withers outlined one important reason for the lag in brand adoption related to discovery:
Domains [...] are becoming the preferred solution for situations where efficiency is a priority. With a domain, the user can go straight to their intent and the platform will recommend the best option and facilitate the handoff to the brand. Google showed this off in a demo where the user bounced between 1st party features, like accessing the camera and flashlight, and 3rd party mobile apps, like IMDB, Instagram, and Walmart. All of this was done while using implicit language and no platform or app invocations between intents. It was a much more efficient interaction than invoking individual voice apps. This level of agency in the hands of companies like Amazon and Google, who often have competing solutions of their own, is unsettling to many brands.
Today, building a multimodal voice experience for a smart speaker platform means relinquishing quite a bit of control over the user experience to the platform provider. There’s also a lack of access to “first-party” integration features created via the voice platforms’ collaborations with major brands. At this time, Google won’t approve multimodal experiences built with the Interactive Canvas feature for anything other than “gaming experiences.”
The difference in affordances provided to third-party voice developers versus first-party integrations built by the voice platforms has long been a bone of contention for those in the voice space. Nowhere was that more eloquently stated than in the NPR Voice Platforms team’s Project Voice presentation description (and they actually won the Google Assistant Action of the Year award!):
Content Challenges: Moving From Alexa To Google Assistant
Rebecca Rolfe (Product Designer for Voice Platforms, NPR); Tommy O'Keefe (Senior Software Engineer, NPR)
"It sounds easy" is the bane of voice developers building for novel and rapidly-changing platforms. Rebecca left Google not sure why third parties weren't building more ideal voice flows, but after a few months working at NPR she found out! Tommy and Rebecca propose key features needed to build the audio experiences users would love and expect, as a shortlist to platform creators on what to support next.
I particularly loved how the NPR team delivered their Project Voice presentation: Rebecca, the conversational designer, proposed cool ideas for NPR’s voice experiences to Tommy, the engineer, who then explained why each wasn’t possible due to restrictions imposed (sometimes intentionally, oftentimes not) by the major voice platforms.
My partner Lauren and I often have similar conversations. Have a cool idea for a music streaming experience on the Google Assistant? Prepare to get hit by any number of long-standing bugs with media responses that are unable to be fixed on their closed platform. Want to monetize work and capture some of the value it added to the Alexa platform? It’s only okay when Amazon says it is.
Understandably, not all brands are willing to sacrifice control over user experience to leverage the platforms created by Google or Amazon. Independent developers continue to chafe under the constraints imposed regarding discovery and monetization as well as the challenges that come with competing directly with the platforms themselves to build brand integrations. It’s evident something needs to change.
This notion was confirmed by talking with others in the halls of Project Voice. The folks I spoke with agreed it’s time for an open voice platform, free from the restrictions imposed by the current providers. Some existing open voice platforms, most notably Mycroft, offer a potential solution. As voice evangelists like Dr. Tyreus noted, the most obvious platform for open voice experiences is the Web:
There could be a better way for publishers to get content onto voice devices. Instead of building proprietary voice apps for each platform, let’s teach assistants to better understand structured web content. Organizations should be able to publish websites with specific structured data to map voice-based intents to web-based fulfillment. Then the virtual assistants do the conversational magic to glue it all together. We can dramatically lower the barrier for publishing to voice, increase the amount of quality content, and provide a more consistent experience for users.
I applaud the effort to bring standards like this to the Web and Soundcheck.ai provides some great tools to give brands this ability. In a Twitter discussion on the subject, Richard Wilson remarked that schema.org already introduced a new PronounceableText type that allows content publishers to specify how content on a web page should be pronounced using SSML. This is a nice option for brands that want to create a voice experience on a major voice platform, but does not relieve many of the pain points voice designers and developers currently endure.
When I look at a device like Amazon’s Echo Show 5, I see a different path forward for voice on the Web. The Echo Show 5 is a fantastic device for Alexa developers; it is compact, offers the ability to develop screen-based multimodal interactions, and allows users to wear headphones so they don’t drive everyone around them up the wall after the 500th, “There was a problem with the selected skill’s response.”
What’s to stop some plucky hardware manufacturer from offering a similar device running Linux with a Web Speech API-capable browser like Firefox? For that matter, what’s to stop them from running that voice-enabled browser on a smart television? Rather than creating a new presentation language from scratch, why not use the presentation languages we already have on the Web? Now that the Web Speech API is (nearly) supported on both Firefox and Chrome, any device with a screen, microphone, and speakers running a modern web browser is capable of delivering a multimodal voice experience.
All we lack at this point for creating great voice experiences on the Web are a few key ingredients:
- Wakeword-enabled voice interactions (There are already some solid efforts in this regard.)
- Open Source examples of exceptional multimodal voice experiences built for the Web
As evident from the list above, some of these efforts are underway. Now, we need a unified group working across the technology industry on a set of open standards and components to bring modern voice experiences to the Web. That brings us back to the importance of Mozilla’s announcement.
Mozilla has worked on making voice technology more open for some time. They partnered with Mycroft to allow users to (optionally!) share their speech data to improve Mozilla’s Open Source DeepSpeech speech recognition engine. Mozilla also built the Common Voice dataset for training speech recognition algorithms to be as inclusive as possible. They continue to work to fully support the Web Speech API on all platforms that run the Firefox browser. The Firefox Voice effort is but the latest proof Mozilla sees potential in a voice-enabled Web.
If you want to help Mozilla improve the Firefox Voice experience, head to the project page, sign up, and use it! Mozilla needs user data to analyze the performance of Firefox Voice and catch bugs. There’s an explanation of what makes a good bug report on their site.
If you’re interested in helping other like-minded individuals move the Web forward, then contact us and subscribe to the Voxable blog. We’ll share our plan for addressing some of these issues in the coming weeks. We’re already in touch with some folks in the voice ecosystem about how we can bring quality voice experiences to the Web and we want to meet more interested in joining us!
The Project Voice conference accompanied by Mozilla’s announcement that week solidified my conviction regarding an Open Voice Web. It’s time to free ourselves from the restrictive voice platforms. It’s time for the first voice millionaires to emerge. Let’s band together and solve these problems for ourselves rather than waiting on the world’s largest technology companies to solve them for us. The Web belongs to all of us, and the Web deserves a voice.