Friend says that their device is always listening. From their website:

When connected via bluetooth, your friend is always listening and forming their own internal thoughts.

I've worked with transcribing speech into text before, and know first hand that it isn't a cheap thing to do. My first instinct was "damn, must be a strong team to figure out a cheap way to do this at scale".

Then I saw the team and became suspicious. We'll be investigating what it means for Friend to always be listening from an engineering and financial perspective.

Assumptions

Assume Friend sells 10,000 devices in this initial batch. Each device is transcribing 4 hours of audio a day. I believe this is a fair assumption for the initial customer profile of a Friend buyer, given that they are likely incapable of engaging in more hours beyond than that.

So, we'll assume 10,000 devices, each transcribing 4 hours a day, or 2.4 million minutes a day.

Approaches

Let's go through 5 ways to build always listening.

1. On-Device Transcription

Transcription can be done on-device by using a small model and minimal processing power. However, given the size of the device and it's $99 price point, it's unlikely to have a powerful processor on board. Therefore, it's highly improbable that the device can perform always-on transcription efficiently without draining the battery, maintaining a low word error rate, or processing the audio fast enough.

2. Cloud Services

Friend could be using a 3rd party cloud service for speech to text like AWS, GCP, Azure, OpenAI, Deepgram, or AssemblyAI. At the time of writing, the cheapest seems to be Deepgram, which is priced at $0.0036 per minute of audio.

Let's make this fun and also assume that Friend cuts a fantastic enterprise deal with Deepgram that brings that cost down by half to $0.0018 per minute. At that price, to transcribe 2.4 million minutes per day would cost $4,320, or close to $1.6M per year. Obviously, this cannot be the case.

3. Self Hosted Whisper

Let's try running Whisper, a popular speech to text model, on our own infra. There are several open-source variations of Whisper, like Faster-Whisper, WhisperX, and Whisper Jax that are faster and more efficient than the original implementation.

Here is my benchmark of WhisperX transcribing a 20 minute audio file on a single A100 40GB:

| | Run 1 | Run 2 | Run 3 | |---|---|---|---| | Model | large-v2 | tiny.en | tiny.en | | Batch Size | 16 | 16 | 1 | | Precision | float16 | int8 | int8 | | Time | 15s | 5s | 10s | | Peak GPU Utilization | 100% | 84% | 84% |

The peak GPU utilization is high, which means that each GPU can only handle a single user's transcription at a time. At the time of writing, there doesn't seem to be a way to allow a single GPU to handle multiple transcriptions from different audio files at the same time.

Smaller models might be less resource-intensive, but they also tend to be less accurate, especially in noisy environments. Given that Friend is designed to be used everywhere, both in loud public spaces and quiet private homes, relying on smaller models could lead to significant degradation in transcription quality. Larger models are better equipped to handle background noise, complex audio conditions, and conversations, making them the preferred model choice. You don't want "I'm having falafel" to turn into "I'm having a waffle".

The question then is: how many GPUs would be required to run the large-v2 model, given 10,000 users each speaking for 4 hours a day, with audio recorded in 20 minute chunks?

From Run 1, a single GPU can process (3,600 ÷ 15) = 240 20-minute chunks per hour.

10,000 users each transcribing 4 hours of audio a day is equal to (10,000 × 4) ÷ 24 ≈ 1,666 hours of audio needed to be transcribed per hour.

In terms of chunks, that's 1,666 × 3 = 5,000 20-minute chunks that need to be transcribed per hour.

Therefore the number of GPUs needed is 5,000 ÷ 240 ≈ 21 running 24/7.

The cheapest A100 I could find is $1.69 per hour from RunPod. 21 GPUs at $1.69 per hour per GPU would cost almost $1,000 per day or $365,000 per year. Or, per user, that's $36.5 per user in a single year. Even with a 50% discount, that's still $18.25 per user per year, which isn't sustainable.

4. Self Hosted CPUs

You could run Whisper.cpp or Faster-Whisper on regular servers without the need for a GPU. Although that sounds cheaper in theory, the issue is it takes much, much longer to transcribe the audio, and like GPUs, only one transcription can run at a time. That eliminates the effects of any cost savings, as you'd need significantly more servers to handle the same workload, which drives up the cost in almost the same way as using GPUs.

To be more specific, I ran the large model locally using Whisper.cpp on my MacBook M3 Pro and it took almost 5 minutes (300 seconds) to transcribe 20 minutes.

So a single instance of my computer can process (3,600 ÷ 300) = 12 20-minute transcriptions per hour.

With the same calculations as section 3, the number of servers needed is 5,000 ÷ 12 ≈ 417.

Whisper.cpp's docs say that the big model requires roughly 4 GB of memory. The cheapest EC2 compute-optimized instance with 8 GB of memory has an hourly rate of just over $0.1 on a 1 year compute savings plan.

Again, let's make this fun and assume that this instance performs at the same level as my Mac Pro (it absolutely doesn't). That would be $365,000 per year, which is similar to the GPU cost. Again, this cannot be the case.

5. Transcribe on iPhone

When writing this essay I noticed an important detail that I missed. The quote from Friend's website says something important:

When connected via bluetooth, friend is always listening and forming their own internal thoughts.

Does the always listening feature require a bluetooth connection to the phone? If so, that could mean that audio is recorded on device and then periodically sent to the phone for transcription. This seems plausible, as Apple provides a Speech API for iOS. [1] [2]

The Speech API allows you to run on-device transcriptions from pre-recorded audio files. According to Apple, however, it's not as accurate as a transcription that would run on a server, and can be a drain on battery life. I could be wrong about this, but I believe it's designed for single speaker scenarios in quite environments, which is unlike how the Friend is expected to be used.

If this is the approach Friend ends up using, I predict the first batch of customers to receive the device to complain about battery drains from the Friend app.

Another approach would be to transcribe on the iPhone using Whisper.cpp. The benchmarks show that it takes 1 second to transcribe 4 seconds of audio on an iPhone 13. That's too slow to be used efficiently as you would have to constantly be transcribing, which again is unsustainable.

Closing Thoughts

Most of the problems I mentioned above would be worked out if Friend only transcribes the past 5 minutes of audio from when you tapped on it. Or if it transcribes randomly throughout the day. But transcribing everything that's recorded is not just expensive but also inconvenient if it comes at the cost of draining the battery.

Despite finding holes in every approach, I still believe the Friend will always be listening as marketed. I don't doubt what they're saying, and wish them the best of luck.

If you're interesting in a product that isn't overpromising, check out 🍑 Peach Pod.

Friend's Always Listening Doesn't Add Up