Back to the Future of Audio

Your voice is my favorite sound.

“I can’t wait to hear from you” is the ultimate compliment.

That’s why I was intrigued when House Resident Casper ter Kuile, author of The Power of Ritual, recently reached out to me. Seeking some feedback on a new initiative of his, he sent me a voice memo and asked me to share my response as a voice memo as well. I recorded it, without much prep, just blurting out my thoughts, and sent it to him. He responded with another voice memo, and I was surprised by how easily and yet thoughtfully our asynchronous exchange transpired. While we never saw each other or spoke in real-time, our connection over those few days felt real, and it conveyed more texture than we had expected.

This inspired us, in last week’s Beauty Shot, to experiment with voice memos as well. We invited members of our community to share with us their desires (“what is it you really want?”), and we launched a Telegram group to that end. Among others, my colleague Monika posted a memo, and it was fascinating to hear how she started somewhat lightly, but how her account turned more and more vulnerable until her voice cracked a bit and you could hear her breathing more heavily. It was as if her heart took control of the script.

Such is the beauty of audio. It opens different doors to your soul.

“The human voice is one of the most powerful sounds on the planet. It’s the only sound that can say ‘I love you’ or even start a war,” writes Julian Treasure, the founder of the Sound Agency and author of the books Sound Business and How to Be Heard.

Anne Karpf, in The Human Voice: How This Extraordinary Instrument Reveals Essential Clues About Who We Are, observes that “Human empathy develops early, and it’s expressed vocally.” She refers to newborns’ ability to discriminate between their own and other babies’ cries: “They get upset when they hear other babies cry, which is probably why, when one baby cries in a maternity ward, the others inevitably follow. It’s always assumed that they’re simply copying each other, but they’re probably also pained by the sounds of distress.”

All this may explain why we’ve been witnessing a renaissance of audio during the pandemic. The minimalist allure of voice-only seems to have given us a sense of control amidst all the doom and Zoom. And audio can convey and create more intimacy than other forms of digital presence.

“From phone calls to messaging and back to audio—the way we use our phones may be coming full circle,” Tanya Basu writes in the MIT Technology Review. And tech analyst Jeremiah Owyang considers social audio the “Goldilocks Medium”: “Text social networks are not enough, video conferencing is too much, social audio is just right.” (Check his list of “20 ways businesses will engage social audio.”)

His assessment is proven by the explosive growth of Clubhouse in the past 12 months (helping its valuation soar swiftly to $1 billion). More social audio businesses are emerging, from Twitter Spaces to podcasting with friends on Cappuccino, sharing multi-modal audio messages via Swell, embracing nostalgic call-in radio shows with Capiche, and talking sports with Locker Room.

What if social audio made social media beautiful again? What if the future of business were eavesdropping? (Sounds crazy? Hear us out further below!) What if our voices spoke louder than our actions? What if every business began (again) with a warm call?

I’m at +49 170 229 30 15.

—Tim Leberecht

The social audio wave

Clubhouse and the rest of the pack

Jeremiah Owyang has written a comprehensive outlook on the future of social audio, in which he proposes a feature roadmap, proposes more than 14 possible business models, and six new product categories. He predicts that a post-lockdown mass rush to travel and taking vacations will result in a slight temporary reduction in social audio adoption (he puts it at 30 percent), but that social audio is ultimately here to stay. The market, however, will be consolidated, and only a few dominant players will remain, with social audio increasingly integrated into every other digital interaction.

Against this backdrop, in Clubhouse rooms Clubhouse veterans occasionally reminisce about the early days and bemoan the loss of true community as Clubhouse scales quickly and, even though still invite- and iOS-only, opens its doors to more people around the world. One could argue that the platform will either lose its soul or the battle for market share, but so far it has gotten a lot of things right, and the archaic, stripped-down functionality and UI—while frustrating at times for more demanding users—resembles the compelling simplicity of Google’s search engine.

But the copycats are lurking, and me-too services with slightly improved functionality represent a real threat, as audiences are fickle.

Twitter introduced audio tweets last summer, followed by tests of audio direct messages. In December, it launched Twitter Spaces in beta for live, host-moderated audio conversations, and it recently announced an Android version (beating Clubhouse to the punch).

As drop-in social audio, Twitter Spaces obviously looks very similar to Clubhouse. But among the notable differences are starkly different data policies: Twitter records and stores all conversations for at least 30 days, whereas Clubhouse deletes them immediately after a room ends unless a user reports a violation of its Trust and Safety measures. Twitter also will make recordings and transcripts of conversations available to hosts, whereas Clubhouse strictly prohibits any participants’ recording unless they get consent from all the other speakers.

Moreover, the design of the user experience serves different purposes. While Clubhouse’s maxim appears to be feature minimalism—no emoji reactions, no direct contact among users except if they speak, no way to insert other media into a conversation—combined with some very distinct social cues (e.g. unmuting as a signal to chime in, or applauding by turning the microphone on and off in quick sequence), Twitter Spaces adds audio as a new feature to its existing ecosystem.

Both platforms also think differently about the role of a host: While Twitter Spaces only allows for one host—exemplifying perhaps the vanity and self-promotion of Twitter—Clubhouse is designed for a more collaborative way of running conversations, allowing for multiple hosts and even the opportunity to change host roles in-flight.

All these differences point to different intentions: for Twitter, audio is a feature (and content) to enhance the value of its service; Clubhouse, on the other hand, so far is a purist play, with its conversations deliberately living in the moment.

Twitter is not the only contender for Clubhouse’s “share of ear.” Mark Cuban’s eagerly anticipated Fireside app, said to be a hybrid between Clubhouse and Spotify’s Anchor, is reportedly in beta. Fireside aims to “professionalize” Clubhouse-style live audio conversations by adding more tools and features such as enhanced moderator and speaker control, emojis and chat function, as well as more mechanisms for creators to monetize their content. Basically, it seems as if Fireside is determined to eat Clubhouse’s lunch by beating it at its own game.

Can social audio make social media beautiful again?

Riding the wave of success of Clubhouse, there’s also an array of vertical audio networks popping up. One of them is Quilt, a social audio platform with a focus on wellness, that recently closed a $3.5 million seed round led by Mayfield Fund.“More personal, real, and revealing than a tweet or post, and less obtrusive and image-conscious than photo or video, audio’s atomic unit is the intimate conversation. Which is of course the perfect antidote to social media’s worst vices.”

In the hands of the right founder, audio (and perhaps only audio) can be harnessed to create a new kind of social media experience: authentic, supportive, informative, hilarious, and essential,” Mayfield’s Rishi Garg wrote in a blog post about his decision to invest in Quilt. It’s a vision one would like to believe in: social audio making social media beautiful (again).

The reality is more complicated. House Resident and bestselling author Rahaf Harfoush was an early adopter of Clubhouse, and as a digital anthropologist is fascinated by the new social protocol and culture emerging on the platform. She volunteers her time as part of an “Anti-Grift-Squad” that onboards new users and helps them navigate the new environment that is, despite the apparent generosity and kindness, not free of tricksters and fraudsters. Harfoush believes the intimacy of Clubhouse’s format is prone to scams: “We’re naturally more persuadable by hearing somebody talk to us than reading something,” she told the Los Angeles Times. Users of social audio platforms have come to witness that voice-only spaces can quickly become ugly and hard to police, and abuse and harassment may still be the exception but are too common. Content moderation—the perennial challenge for all social media platforms—is critical, but due to the democratized nature of Clubhouse et al, also left mostly in the hands of hosts. Yet Harfoush highlights the uplifting aspects of Clubhouse: When her mother, who passed away last November, would have turned 65 on a recent Saturday in February, a friend meant to console her with a song live on Clubhouse, and Harfoush decided to open up the room to any user: “We ended up having over a thousand people join and listen, and it turned into this really moving tribute and celebration of the life of my mom that I would have never anticipated,” she said.

Your voice is my favorite sound

Reduced to the max, beyond reductionism

The immediacy and intimacy that audio brings feels different to the overwhelmingness of existing social media networks that have arguably taken too many wrong turns in past years to be able to retreat back to their original purpose to connect.

All we want is to feel more. By resting our oversaturated eyes and focusing on other people’s voice, intonation, breath, how and when they pause, their rhythm, and the moment when their voice cracks, we can.

Business communications trainer Sherie Griffiths calls this “visual listening”: “More and more, we’re in situations where we’re doing business with increasing numbers of people that we never meet. Now, we can learn skills that enable people to speak on the phone to someone and build up a picture.” Audio allows us to fantasize, a little, to break away from the reductionism of expressing ourselves with "thumbs up" and “LOL,” and leaning into what lies beneath.

Similar trends are beginning to emerge for dating, with Chekmate, a text-free app based on voice messages; for dealing with stress or better sleep with ASMR; and audio porn, with companies like Quinn and Dipsea that offer “recorded content that ranges from ‘appreciative boyfriend’ to every possible fantasy your brain can cook up.”

Even sonic branding can assume entirely new dimensions: Russian airline S7 has created virtual flights on Clubhouse, with a room of people hearing the sounds of airports and jet engines, crew announcements, and other epiphenomena of flying that we (or some of us) have missed dearly for the last year or so.

Leading by voice

In light of the audio boom, it has become even more important for us as professionals to find our voice, and express ourselves in more informal, free-flowing conversation. Voice coaches are on the rise, and management-by-audio is no longer a nice touch but a standard requirement and skill for any leader.More than ever, it is paramount to know how to speak so people want to actually hear you. It’s not just a matter of what you say, but HOW you say it.

Julian Treasure is one of the leading experts in this field, and he contends that to sound good, we need first to understand the four ways sounds affect us: physiologically, psychologically, cognitively, behaviorally.

For starters, our body is 70 percent water. He observes that “sound travels well in water, so we’re very good conductors of sound.” Moreover, hearing is our primary warning sense: “We’ve been programmed over hundreds of thousands of years to assume that any sudden or unexplained sound is a threat and your body gets ready to fight or flee.” A sudden sound releases cortisol, the stress hormone. It increases our heart rate and changes our breathing. This is also why it’s not a good idea, he points out, to use an alarm clock with a traditional bell or a beeper—unless you want to start your day in a state of stress.

Furthermore, sound affects us emotionally and can alter our moods, whether it is music, the sounds of nature, or a human voice.

The human voice, by the way, is also the most distracting sound, Treasure argues. “If somebody’s speaking next to you, it’s very difficult to block out that sound. We have no earlids and distracting human conversation hugely impedes your productivity,” he writes. Indeed, studies show that up to 30 percent of employees are dissatisfied with open workspace noise, and 63 percent believe loud colleagues are a primary distraction. Noise affects us cognitively and behaviorally: “It makes us less sociable, less helpful and less approachable if we’re in a noisy setting,” Treasure contends.

On the bright side, research has found that voice is one of the most dominant factors in interpersonal attraction: “When people evaluated someone’s voice as attractive, they also tended to evaluate the person as attractive. Regardless of the sex of the target person, a bright, generous and low-toned voice, with a small range of pitch, was evaluated as being attractive.”

In his popular TED Talk, Treasure cites the key levers we can use to make our voice work for us: register (baritone or falsetto? Yes, we still associate lower voices with authority); in terms of timbre (“the way your voice feels”), research shows that we prefer voices which are “rich, smooth, and warm.” We can also tweak our so-called “prosody” (the sing-song, the melody of our speaking), our pitch, pace, and volume.

With his work, Treasure wants us to become more conscious producers and consumers of sound:

“What would the world be like if we were creating sound consciously and consuming sound consciously and designing all our environments consciously for sound? That would be a world that does sound beautiful.”

The human voice and AI

More conscious human listening and speaking is of course rivaled by AI, which already has “predictive listening” capabilities that outperform humans. Researchers at the University of Southern California have for instance developed AI to predict the future of a couple’s relationship based on how the two partners speak with one another. In the study, after being trained on recordings of couples’ conversations, the algorithm became marginally better (79.3 percent accurate) than the control group of professional therapists at predicting whether the couples would stay together. “Using lots of data, we can find patterns that may be elusive to human eyes and ears,” said Shri Narayanan, an engineer who led the study.

This ability comes in handy when listening to the human heart. Researchers at the College of Washington transformed Amazon’s Echo into a heart rate monitor that can detect irregular heartbeats.

By the way, the sound of Amazon’s Alexa itself—even though sounding somewhat human—is actually based on a process called Neural-Network Text-to-Speech (NTTS). Rather than stringing together pre-recorded human sounds, NTTS generates sounds from scratch, using patterns found in previous recordings, and research has shown that humans find it more natural-sounding.

Amazon says its technology “aligns the speech signal with the text at the level of phonemes, the smallest units of speech. Then, for each phoneme, the system extracts prosodic features—such as changes in pitch or volume—from the spectrograms. These features can be normalized, which makes them easy to apply to new voices.”

Listen to this example:

But Germany thinks she can manage it…Original Transferred Synthesized

These are three different versions of the same text excerpts. “Original” denotes the original recording of the text by a live speaker. “Transferred” denotes a synthesized voice with prosody transferred from the original recording. And “Synthesized” denotes the synthesis of the same excerpt from scratch, using Amazon’s NTTS technology.

You are being heard, literally

We are natural sleuths

Aren’t we all guilty of enjoying eavesdropping on strangers around us? As little as a few words are enough to make up our own story, which is a trait we share with monkeys, recent research suggests. John L. Locke, a professor of language science and author of Eavesdropping: An Intimate History points out that this trait dates back to our primal need to understand the intentions and motives of others. “We have mechanisms in the brain that are designed to draw inferences from partial information that we see and hear and smell. We’re natural sleuths.” In a similar way, we want to be seen and (over)heard, to be socially approved and thus, to prevent ourselves from misbehaving. As the pandemic has lingered on, our desire for “intimacy by theft,” as Locke calls it in his book, has become unbearable:

“We all have a persona, or a public personality, which is different from who we really are. We love it when we get something that’s truly genuinely true about others and so we still prefer taking it, or if not taking it, extracting clues on our own.”

These random moments of “being part of an audience” of the life of others makes our own more liveable. They are now accessible through Clubhouse and other ambient eavesdropping apps such as High Fidelity, a spatial audio solution that lets you overhear and join conversations based on virtual proximity, to mimic the nature of sitting in a semiprivate, semipublic office where information is being carried via the hallway and the energy we pick up from semi-listening to our colleagues. As Clive Thompson observes, “one benefit of the physical office is that it lets us low-key creep on each other. It turns out we might want some of that even in our software.”

Is the future of work an audio room?

Perhaps the future of work is one in which social collaboration is synonymous with eavesdropping: listening in on other organizations’ and people’s conversations, even those outside of your own organization, while also inviting others into your own.

Think of Esther Perel’s couple therapy sessionsWhere Should We Begin?—that take us into the “antechamber of intimate moments” she published (anonymously), and how a similar practice might emerge in business. Imagine a platform that offers an audio live-stream featuring team workshops, town halls, 1:1 meetings, lunch conversations, and yes, even job interviews—for the purposes of reciprocal access to the kind of knowledge that is widely considered the most valuable: tacit knowledge. Not the one captured in textbooks and case studies, but the immediate, live, raw, and honest knowledge that is best shared live and informally.

What if an open audio room were the most consequential form of open innovation, the most radical version of radical transparency, and the most direct performance review, rendering institutional boundaries obsolete, making business truly social, and finally realizing The Cluetrain Manifesto’s vision: markets are conversations, and thus business is a conversation?

On social audio, we can already see businesses taking some steps in that direction.

In Russia, where Clubhouse’s growth has been particularly steep in the past few weeks (even though Vladimir Putin reportedly declined an invitation by Elon Musk to join him in a conversation on Clubhouse), some big companies have begun to run interesting experiments. Sber, the former banking giant turned tech firm, transformed a room that was initially set up as a conversation about the PR into a real-time recruitment drive and public interviews for a job.

Inspired by this practice, it is fathomable that companies will design “learning journeys” that take place fully or at least partly on social audio networks, in the form of curated sets of rooms and a private-room facilitation to help them navigate and reflect on what they’ve heard. Already, some companies use Clubhouse rooms for team bonding, for example Taiwan-based electronics manufacturer Titoma. CEO Keesjan Engelen says: “When one of our team members is running a room, we send out an email to everyone in case people like to join and that has been very popular.”

Now, where do we draw the line between friendly creeping and alarming surveillance? With Oleg, the first Russian-speaking bot on Clubhouse, and fear over the Saudi Arabian government’s monitoring of conversations considered censored or taboo, we need to decide who’s listening when we want to be heard, and for what purpose.

Turn up the unheard

Click this link, close your eyes, and listen.

According to the World Health Organization, at least 2.2 million people have a near or distance vision impairment. With audio comes more accessibility. Responding to the remote working environment, Microsoft’s “Seeing AI” app, for example, uses your smartphone’s camera to audio-describe the world around you, even to the extent of recognizing people you know, their appearance, and how they are feeling. Combining this ability with spatial sound could be key to making audio-based remote work even more effective and real, enabling us to hear a conversation off to the side while listening to a third person.

A few days ago, the Chamber Music New Zealand concert, accompanied by the Ballet Collective Aotearoa, hosted an audio-described concert to give blind or visually impaired patrons the experience of live dance and music. In a similar tone, Tennis Australia, Monash University, and digital agency AKQA have launched Action Audio, an online audio stream designed to make the Australian Open broadcasts accessible to more people.

Brands from P&G to Microsoft and eBay to Amazon, too, are waking up, not at least since this year’s Super Bowl, where ads like this were shown. For real progress, Sheri Lawrence, president of Tylie Ad Solutions, who made it her priority to include blind and visually impaired people in the ad industry, says much more is needed:

“Just like closed captioning, if it’s not mandated, there will always be reasons why people won’t do it,” she says. “To ultimately have, let’s say 100 percent or even 80 or 90 percent [with audio description], it has to be a mandate. However, we’re going to keep pushing and talking and making it a big splash and awareness.”

Playlists, at last!

100 seminal radio moments

(from We Are Broadcasters)

What House Residents listen to

What You Want: a collaborative playlist from A Living Room Named Desire

House Lounge

Want to chill and work alongside others? Join our occasional non-events with live sets played by our musical director Mark Aanderud and our Musicians-in-Residence on Mixlr. Follow us by downloading the app of simply tuning in via your browser:


We don't support this version of your browser, and neither should you!

You are visiting this page because we detected an unsupported browser. Your browser does not support security features that we require. We highly recommend that you update your browser. If you believe you have arrived here in error, please contact us. Be sure to include your browser version.