Just musing a bit here.
I'm wading through some 3GPP documents about "Multimedia Telephony" that could run over a 3G network, and it's struck me how there's far too much emphasis on bundling different "media" streams, and not enough about bundling "context" streams.
What's struck me is that much of IMS' cumbersome nature seems to be driven by an unshakeable belief that the most important aspect of new telecom services will be to hook together voice, video, text, image etc. Add media streams, drop media streams, direct them to different people, have the same QoS and control mechanisms & supplementary services to apply equally across the board.
The problem is that 99% of calls are "just" voice. The whole of IMS - right down to its quaint 1990s-era name - seems to be geared around the notion that "the answer" to telcos' problems is "add videotelephony & make sure it works as well as voice". Rather than improving & extending the 99% of voice-only sessions, the vast bulk of effort seems to be around hoping that the 1% of multi-media sessions might become 2%, 5% or 10%. Despite a lack of obvious demand, or the fact that a huge % of mobile conversations are made when people can't watch the screen for fear of walking into a lamp-post.
So, for example, in the literature I've been reading, VoIP on a future 3G network is being treated as a "special case" of multimedia telephony. Personally, I think that it would make far more sense for video/multimedia telephony to be thought of as a special-case "enhancement" on top of vanilla VoIP.
And it's not the only possible enhancement - I'd argue that "context" is much more important in VoIP than adding video. Presence information, mood information, timezone, location, device characteristics, reputation, calendar info, messaging/reachability preferences, numbering preferences, multi-device ringing and 100 other things that go int the whole "Voice 2.0" phenomenon that's evolving.
As a result, I think that a lot of 3GPP IMS standards - and therefore much implementation by mainstream mobile operators - are hugely over-complicated and delayed by this dogged insistence on multimedia sessions. Meanwhile, nimbler startups are starting with voice+context at the core, and adding on other "media" as supplementary services on the side. Skype and most fixed operators are good examples. (Yes, fixed videotelephony is being pitched, but here it's being treated a "special case" and, of course, the lamp-post problem doesn't arise).
It's also worth noting VoIP has been designed-in to CDMA EV-DO Rev A & being deployed now, while it's still very much an afterthought on (maybe) HSUPA or even LTE. The 3GSM community is wasting too much time and effort on "multimedia" when instead they should just focus on plain-old VoIP with "multicontext" capabilities.