You may have noticed that video conferencing with people seems vaguely off, not as nice as in person in a vague visceral sense. You may have attributed this to being stuck in a screen, or low quality video, which both matter, but there are other, possibly even larger contributing factors which you’re probably unaware of.
The two big problems are A/V sync and round trip time. A/V sync is exactly what you think it is: The audio and video are off from each other. Everybody who’s worked in video tech is familiar with the problem that every bug causes the A/V sync to go off. The result is that usually it’s a bit off, because chasing down every bug takes time and effort, and if you ignore problems below some threshold users don’t complain. The problem is that there’s a range of sync problems which are below where users can clearly identify what the problem is but well into where they have a degraded experience without knowing why. This happens constantly.
Round trip time is a bit more subtle. Every step in the chain takes some amount of time before going to the next one: Recording, transmission, routing, and playback. Round trip time is the sum of how long it takes for a signal to get from one end to the other and back. Latency in general has gotten far worse over time. Decades ago if you sat in the bleachers of a baseball game listening to the commentary on the radio you would hear the bat smack the ball no the radio in sync with seeing it with your eyes, then a noticeable fraction of a second later hear the sound as it came through the air to your ears. Those days are long gone.
The practical sources of latency are varied and beyond the scope of this post. (And they’re going to make me blow an aneurysm if I start ranting about them. Maybe I’ll get into that another time.) But improving the problem is straightforward: Tools for people to audit their setup need to be widely available, and people need to start using them and investigating problems which show up.
There are two apps which need to be written, both of which should be straightforward for someone with the requisite skills. The first one is for A/V sync. To measure that you need a phone on the broadcasting and recording ends. The broadcasting phone flashes a light on the screen in sync with a making a distinctive click. The receiving end records video and audio and measures how far off they are. Ideally you should get the two phones physically next to each other up front to get them calibrated, and the phones should do that calibration automatically and be able to recognize the distinctive clicks of the other one to use the appropriate calibration when measuring sync.
The second app is for measuring round trip time. Again there should be a mobile app, this one for making a ping sound and responding to a ping sound with a pong sound, again both with distinctive clicks. This does not require camera or screen support but should also ideally have the two phones calibrated while physically right next to each other. A funny thing about sound is that it’s slow enough that they should be able to do a decent job of calculating their distance even in the same room, which makes for a neat science demonstration.
If you had access to these tools and used them you’d be horrified at how awful all the systems you’re using are. Once someone writes them there should be real pressure to fix everything and systems may get better very rapidly. So please, somebody build these things. The world needs them, and I for one will happily use them on everything and go around telling everyone else they should as well.
Is there any need to calibrate the second app? Seems like you could just measure the roundtrip time.
Don't know much about this stuff, but indeed find it funny that video conferencing is still off in 2023. I had not considered A/V sync and latency as responsible for the poor experience and am curious to look for supporting evidence, though the thesis sounds plausible. I'm definitely interested in hearing the aneurysm-triggering rant!