Voice & Audio8 min readNovember 15, 2024

Hold-to-Speak: A Better Model Than Wake Words

Wake words are a solution to a problem we do not have. Here is the interaction model we chose instead.

TL;DR

Learn why hold to speak voice activation is better than wake words like Hey Siri. No always listening, instant activation, and complete privacy for voice commands on Mac.

Voice assistants have trained us to accept a particular interaction model: say a wake word like "Hey Siri" or "Alexa," wait for acknowledgment, then speak your request. This model made sense for smart speakers sitting across the room, but it comes with serious drawbacks that we chose to avoid entirely when building Air.

The Problems with Wake Word Voice Activation

The first issue with wake words is the always listening requirement. For a device to respond to "Hey Siri," it must constantly monitor ambient audio and analyze it for the trigger phrase. While companies claim that this monitoring happens locally and only the commands after the wake word are transmitted, the reality is that your microphone is always on and processing what it hears.

This creates genuine privacy concerns. There have been documented cases of voice assistants accidentally activating and recording private conversations. Even if you trust the company to handle your data responsibly, the technical architecture means that mistakes can happen.

The second problem is false activations. If you have ever had your iPhone respond to someone on television saying "Hey Siri," you know how disruptive this can be. Products with similar sounding names or phrases can trigger the assistant when you did not want it. This is not just annoying; it undermines your trust in the system.

The third issue is social awkwardness. Speaking a wake word out loud is fine when you are alone, but it becomes uncomfortable in meetings, libraries, coffee shops, or anywhere others can hear you. Many people avoid using voice assistants in public precisely because they do not want to announce to everyone nearby that they are about to talk to their computer.

Finally, wake words add latency to every interaction. When you say "Hey Siri," there is a delay while the system recognizes the wake word and prepares to listen. This might only be a few hundred milliseconds, but it makes the interaction feel less responsive than it could be.

Why Push to Talk Buttons Are Not the Answer

You might think the solution is a simple push to talk button, similar to a walkie talkie. Press a button, speak, release. This avoids the privacy and false activation problems of wake words, but it introduces its own issues.

Traditional push to talk requires finding and pressing a physical button or clicking on a specific interface element. This pulls you out of whatever you were doing and forces you to context switch to a different input method. If you are in the middle of typing an email and want to add a quick voice note, you have to stop typing, move your hand to the button, press it, speak, release, and then return to typing.

Many push to talk implementations also require you to hold down a mouse button while speaking, which is awkward for longer dictation. Others require clicking to start and clicking again to stop, which means you need to remember an additional action at the end of your speech.

The fundamental problem is that these implementations treat voice input as a separate mode that you switch into, rather than a natural extension of your keyboard and mouse input.

The Air Approach: Hold the Right Option Key

Our solution is simpler and more elegant. Hold down the Right Option key on your Mac keyboard, speak your request, and release the key. That is the entire interaction.

We chose the Right Option key for several specific reasons. First, it is virtually unused by any other application. While many keyboard shortcuts use the Option key in combination with other keys, the Right Option key on its own does almost nothing in standard software. This means we can claim it without conflicting with your existing workflows.

Second, the Right Option key is positioned on the right side of your keyboard where your thumb naturally rests. You do not need to move your hand from the home row or reach for a special button. The key is already under your finger.

Third, using a modifier key feels natural to experienced computer users. You already hold Shift to type capitals, hold Command for shortcuts, and hold Option for special characters. Holding a key while speaking fits the same mental model.

Instant Activation Without Delays

One of the most important benefits of our hold to speak model is instant activation. The moment you press the Right Option key, Air starts listening. There is no wake word to detect, no delay while the system prepares, no acknowledgment sound that you have to wait for.

In our testing, we measured the time from keypress to active listening at under 70 milliseconds. This is fast enough that you can start speaking immediately after pressing the key without any perceivable delay. The voice capture begins before you finish saying the first syllable.

This responsiveness changes how you use voice input. Instead of thinking of it as a separate mode that requires preparation and context switching, voice becomes something you can invoke instantly whenever you need it. Add a quick note, send a message, set a reminder, all without interrupting your flow.

Silent Activation for Any Environment

Because holding a key is completely silent, you can use Air in any environment without drawing attention. In a quiet office, you can hold the key and whisper your request. In a meeting, you can discreetly add notes or set reminders without announcing to everyone that you are using voice input.

This might seem like a small thing, but it dramatically expands the situations where voice input is practical. The social friction of wake words is one of the main reasons people do not use voice assistants as much as they could. By removing that friction, we make voice input a tool you can actually use throughout your day.

Natural End Detection When You Release

Another advantage of hold to speak is that releasing the key provides a clear signal that you have finished speaking. There is no need for voice activity detection to guess when you are done. There is no awkward pause while the system waits to see if you are going to say more.

This is particularly helpful for complex requests that require a moment of thought in the middle. With wake word systems, a long pause might cause the system to cut off your request early. With Air, you can pause to think while holding the key and then continue speaking, all without worrying about the system misinterpreting your silence.

It also makes dictation more comfortable. You know exactly when the system is listening because you are holding the key. You know exactly when it stops listening because you released the key. There is no ambiguity about whether your words were captured.

Building Muscle Memory with Voice Input

After using Air for about a week, the interaction becomes completely automatic. Your brain learns the connection between wanting to say something and pressing the Right Option key. It becomes as natural as pressing Shift when you want a capital letter.

This muscle memory is crucial for voice input to become a natural part of your workflow. If you have to think about how to activate the voice assistant every time you want to use it, the cognitive overhead reduces the benefit. When activation is automatic and invisible, voice becomes just another input method alongside your keyboard and mouse.

We have heard from users who forgot that Air was even running because using it felt so natural. They would hold the key, speak, release, and continue working without consciously thinking about the voice assistant at all. That invisible integration is exactly what we were aiming for.