Some of the themes that Martin Geddes and I cover in our Future of Voice workshops (sign up now for Oct 27th!) are that:
- Voice is much more than the basic 100-year old product called "telephony", or even the 30-year old product called "telephony while walking about"
- There is a need for more cleverness in acoustics, especially for Mobile VoIP & VoLTE
- In the network, there is a need for more “cure” as well as "prevention" when it comes to QoS and QoE, eg packet-loss concealment for when packets do (inevitably) get dropped
I've also been quite vocal for some time that bodies like 3GPP and GSMA have largely ignored these issues for now. The very term VoLTE (voice on LTE) tells you all you need to know - it's basically just Telephony 1.1 intended for LTE, not a generalised Voice platform technology. It's ToLTE, not VoLTE. There is also not much in the specs about acoustic requirements on devices, nor about ways to manage the user experience when problems occur.
Certainly, I've not seen anything from these type of organisations asking more fundamental questions like "What is voice communications anyway?", or "Why exactly do people make phone calls, and how is that changing?". Asking these questions and digesting the answers would drive a better understanding of exactly what mobile operators should be doing with voice services to protect against inevitable revenue erosion in the coming years.
These type of things are not unimportant. There are technical, commercial and human-behavioural reasons to consider - and it is too late now to start thinking about merely replicating "old telephony" on LTE (or more generally), when others are defining the new user experiences and sources of value in voice communications. The right time to have started work on "vanilla mobile VoIP telephony" was four years ago - the world has moved on since then rather a lot, and fixing yesterday's problems (badly) is an exceptionally risky way to deal with a critical market transition.
For example, an increasing proportion of users now feel that unsolicited phone calls - even in business - are rude. It's a source of irritation - why is this person phoning me when I'm busy?! They resent being made to interrupt their day - or their more important data applications - for intrusive and unwanted calls that force them to divert 100% of their cognitive load (and device activities) to something that is perhaps of no value. The notion that "voice always come first" harks back 100 years, to the days when voice services were so expensive that all calls were de-facto important, and you dropped everything when the phone rang, with a live operator connecting the call. Those days are gone - this is why the escalation method via IM or SMS ("OK for a call now?") is becoming more prevalent.
And then there's a whole plethora of non-telephony voice apps - I've seen a Skype slide with a VoIP-based baby monitor, for example. That's not a phone call (it's built into dedicated hardware for a start), and neither is in-game voice chat or a hundred other innovations with mashups and interesting hybrid forms of voice.
Think of it in human speech terms - only a certain proportion of vocal activity is in the form of two people having a protracted, two-way conversation. There is singing, arguing, presenting, mumbling, rapping, acting, announcing, mimicking, being part of multiple conversations, interrupting and numerous other "use cases" of speech besides a conversation session. The same is true of electronically-communicated speech.
Browsing through the new work items for 3GPP R11, there are some interesting items, notably "Extensions of Acoustic Test Specifications", and "Codec for Enhanced Voice Services"
I'm not going to do a full analysis of these, but here's a couple of choice quotes from those:
- “Enhanced quality for mixed content and music in conversational applications (for example, in-call music), leading to improved user experience for cases when selection of dedicated 3GPP audio codecs is not possible”
- Robustness to packet loss and delay jitter, leading to optimized behaviour in IP application environments like MTSI within the EPS.
- Acoustics and speech processing in terminals have a strong impact on the perceived quality of voice services. The audio test specifications in TS 26.131 and TS 26.132 do not completely reflect all aspects influencing user experience
The good news is that (a) the 3GPP recognises that what we have today isn't the last word in voice communications, and (b) that all of this is perhaps only 2 years late, rather than 3-4 years as we've seen in the past (maybe someone's been reading this blog or attending some of my conference speeches).
But the bad news is that the context in which all this is taking place is itself evolving. The goalposts are moving very quickly, with problems in multiple parts of the ecosystem - the core, the radio, the device, interconnection points, BSS/OSS and so on.
Not only that, but the very notion of "quality" is itself shifting quickly from the narrow network observable of QoS, to a much richer notion of QoE. If mobile operators are to retain or grow their 100's of billions of dollars of telephony revenue in the IP era, they will need to demonstrate value at the experience level, which isn't necessarily measureable in terms of 3-9's or 5-9's of availability.
Technically, there will be much more variable loss and delay to deal with because of faster networks, high definition audio, with complex interactions between flows at every layer. It's not the absolute loss and jitter that is going to be the problem, but their variability over time.
We are also seeing the extension of voice to new acoustic environments, such as "ambient voice" in the home, devices with different form factors (e.g. smartphones with speakers on the back vs front), in-vehicle systems, "social voice" based around integrating the TV experience with voice communicatiosn and so forth. Consider what happens when you hit the 'speakerphone' button in some of these context - should this be a receiving device problem, or should there be a signal to the sending device to also do microphone audio processing differently?
The example 3GPP cites of hold music is a good one. Things like pre-recorded announcements might also be better sent using TCP and then played, rather than a single non-reliable audio stream. So future audio processing of, say, an interaction with a call centre might use a number of different audio processing strategies at different times in the interaction, depending whether you are talking to the IVR, person, voice authentication system, hold music, etc. [Credit to Martin Geddes for this point]
The bottom line is that voice and future telephony - is not a "solved" problem, especially with LTE.
It's good that 3GPP is aware of some of the issues and is addressing them, but Martin and I are unconvinced that the scale of the problem has yet been fully grasped, nor how compressed the timelines are before third parties solve them (better) first. We'd like to see much more open and clear discussion of future voice use-cases, and a clear roadmap to go from Telephony to The Future of Voice.
And with that - a final plug for the next Future of Voice workshop on October 27th in central London, where we cover all of these themes and many more. It will be an excellent introduction to all these areas, with plenty of scope for networking and collaborative discussion and problem-solving. Details are here.
Use the promo code VIP25 for a discount on the pricing if booked online, or email information AT disruptive-analysis DOT com for more information or invoice-based payment options.