Why the time of Voice as a user interface has finally arrived

Written by David Shapton | May 26, 2023 8:00:00 AM

It's good to talk. Voice could be the next big user interface. It won't be the first time, but it might be the first time it actually works. Imagine dispensing with those horrible menus on cameras.

Voice interfaces and I go back a long way. It's always been an intriguing idea, but one that has, so far, failed dismally.

I went to a press conference in the late 90s where a household-name tech company was showing its smart home concept. There was a computer you could talk to, apparently, so I gave it a go, with something I thought it would know about. I said:

"Speech recognition".

After an awkward pause, it offered its response:

"Beach wreck ignition".

I'm being harsh here. These were very early days for any kind of voice control. That didn't stop companies from trying to save money by automating their customer services. In 2006 I called the local cinema chain to see when the new James Bond film was showing. It asked me, "What town or city are you in?".

I said, "Nottingham".

It replied confidently, "I think you said you are in Basingstoke."

Speak and Spell

Talking computers are not new. Some of the first demonstrations were weird and science-fictiony. Then came the first commercial product in the form of the Texas Instruments Speak and Spell in 1983. It was undoubtedly a credible spelling tutor, if a little nasal-sounding.

Apart from when my father imported a car from Belgium with voice announcements in Flemish (and no prospect of changing the language), I don't remember much progress until Siri. Incidentally, we learned how to say "Temperature 22.5 degrees Celsius" and "Rear Left Door Open" in Flemish, which has always stood me in good stead when visiting the Low Countries.

Talking to machines has always seemed a good idea, for a few seconds, until you try it. An entire gameshow featured the precariousness of this kind of user interface. Called "The Golden Shot", the show was produced by ATV for ITV between 1967 and 1975. The show's human star was UK comedian Bob Monkhouse, but the real star was (I'm not making this up) a crossbow mounted on a TV camera, controlled by a blindfolded camera operator. What could possibly go wrong? In the game, viewers would phone in as they watched the feed from the weaponised camera while giving instructions to the blindfolded crossbow pointer. It sounded like "Right a bit; left a bit; up a bit, left a bit... FIRE!' At this point, a bolt would issue from the crossbow and hit an exploding target. Or something else. If voice control of this critical (and hazardous!) activity had been precise and efficient, there wouldn't have been a game show. The fact that it even made it to the air makes me think that the health and safety precautions must have been rigorous, in contrast to the "anything goes" attitude that was prevalent at the time.

Even Siri and Alexa aren't all that good. Once you get past the novelty, you realise it's more like using a command line than a graphical interface. There are thousands of "applications" for Siri, but I don't know anyone that uses them (apart from podcasts and radio stations) because you have to know the exact sequence of commands. So it literally feels like xcopy diry1 diry2 /s or something like that. (For anyone younger than about 40, that was an MS-DOS command.)

But that's all about to change; dramatically!

The look and the sound of the voice

Unsurprisingly, that's thanks to AI Large Language generative models. But how would it apply to cameras?

What's certain is that cameras need better user interfaces. There's the argument that a specialist device needs a specialist interface, but that doesn't mean they have to be unintuitive and sometimes positively hostile. A rattlesnake has a better user interface than some camera menus I have encountered.

So a voice interface must offer far more than simply a transliteration of the menu system. Unless a voice command system can "get" exactly what you're saying, then it's likely to be more, not less, cumbersome than the traditional menu structure.

So, what would it look like? There's only one way to find out. I asked an AI model:

"Imagine you're a voice-activated control system for a Canon R5. Tell me what voice prompts I need to use to make you take a long-exposure shot of a scene in the UK in June, ten minutes after sunset."

Within a few seconds came this reply:

"Here are some voice prompts that would make the Canon R5 camera take a long exposure shot ten minutes after sunset in June in the UK:

"Take a long exposure shot of the sunset."
"Take a long exposure shot of the sky."
"Take a long exposure shot of the landscape."
"Take a long exposure shot of the city."
"Take a long exposure shot of the ocean.”

You can also specify the shutter speed, aperture, and ISO settings for your long exposure shot. For example, you could say:

"Take a long exposure shot of the sunset with a shutter speed of 30 seconds, an aperture of f/16, and an ISO of 100."
"Take a long exposure shot of the sky with a shutter speed of 1 minute, an aperture of f/8, and an ISO of 200."
"Take a long exposure shot of the landscape with a shutter speed of 2 minutes, an aperture of f/5.6, and an ISO of 400."
"Take a long exposure shot of the city with a shutter speed of 4 minutes, an aperture of f/4, and an ISO of 800."
"Take a long exposure shot of the ocean with a shutter speed of 8 minutes, an aperture of f/2.8, and an ISO of 1600.”

The Canon R5 camera is a powerful tool that can be used to capture stunning long exposure shots. By using voice commands, you can easily control the camera's settings and take amazing photos without having to take your eyes off the scene."

Obviously, that's a bit generic, and I suspect it doesn't have specific knowledge of the R5's menu system.

I drilled down to see if I could get any further. I asked it:

"Now imagine that the same voice-driven camera can interpret vague artistic intentions. What would some typical prompts look like?"

And this is what it said:

Here are some typical prompts that a voice-driven camera could interpret to create a vague artistic intention:

"Take a photo that captures the feeling of loneliness."
"Take a photo that is both beautiful and disturbing."
"Take a photo that makes me think about the future."
"Take a photo that makes me feel nostalgic."
"Take a photo that is both simple and complex.”

"The camera would then use its artificial intelligence to interpret these prompts and create a photo that matches the artist's intention. This could be done by using different settings, such as shutter speed, aperture, and ISO, or by using different composition techniques. The result could be a photo that is truly unique and expressive.

Here are some more specific examples of vague artistic intentions that could be interpreted by a voice-driven camera:

"Take a photo that captures the feeling of loneliness." The camera could use a long shutter speed to blur the background and focus on the subject, making them appear isolated and alone.
"Take a photo that is both beautiful and disturbing." The camera could use a combination of bright and dark colors, or could use a subject that is both beautiful and unsettling.
"Take a photo that makes me think about the future." The camera could use a futuristic setting, or could use a subject that is symbolic of the future.
"Take a photo that makes me feel nostalgic." The camera could use a setting that is familiar to the artist, or could use a subject that reminds them of their childhood.
"Take a photo that is both simple and complex." The camera could use a simple composition, but could use a complex subject or technique.

These are just a few examples of the many ways that a voice-driven camera could be used to create vague artistic intentions. The possibilities are endless, and the results could be truly unique and expressive."

That's a reasonable effort, although it's unclear how much artistic insight there is. With a bit more prompt iteration, it could come up with a potent tool. And, unbelievably, it could probably write the code to make it happen.

I think we are likely to see these kinds of interfaces on cameras very soon.

View full post