Basics of Echo Cancellation - I Speak Software

There is some cool math and lots of advanced stuff going on in echo cancellation, but for this post I’ll just talk about some basic concepts and what you can do to make it work better if you ever need it.

Back when I worked on a telepresence product I got to play at being an audio engineer. Not that I was really great at it, but I got to work with some engineers who were really, really great at it, which made it fun and educational. So I want to write down a few things to share that I learned that may even be useful to remote workers trying to get the best out of their video conference sessions.

The old phrase “garbage in, garbage out” applies to anything you do with audio processing. Echo cancellation is an effort to scrape out some of the garbage that your microphone is picking up from your speakers.

To illustrate, we can start with a simple example. We have two people, Bob and Alice, trying to start a conversation. Alice says “Hello, Bob”, and Bob replies before Alice finishes.

Alice: Hello, Bob
Bob  :        Hola, Alice

You can see that the words are overlapping. In a normal face to face conversation that would be a little rude but both people could understand what was said. But in a video conference, there is effectively a third participant – the conference software which is speaking on behalf of participants and listening at the same time.

Now imagine if the speaker from the video conference is loud and being picked up by the microphone. What the mic would hear is a combination of what Alice said (played out on the speaker) and Bob’s reply. One way to picture that would be:

Mic (hears): Hello, BHoobla, Alice

In practice, one easy way to avoid this is to use a headset or directional mic or otherwise isolate the sound played from the speaker from the sound picked up by the mic. But in a conference room or larger area, people tend to not want to wear a lapel mic or a headset for some reason. And sometimes the room itself is acoustically ‘hard’ and echos the sound back, even if the speaker isn’t pointed directly at the mic.

An echo canceller tries to fix this by having a copy of the information that was played out which it then uses to remove signal and try to isolate the new signal. We could picture it like:

Mic (hears): Hello, BHoobla, Alice
Canceller -: Hello, B o b
Result     :         H o la, Alice

The canceller typically uses a mathematical algorithm called a Fast Fourier Transform (FFT) to convert the signals to numbers that can be matched and subtracted quickly. It is some interesting math, but beyond a “basic” blog post. 🙂

While the idea of cancelling out the signal sounds good, in practice there are some limitations and tuning that need to take place to make it work better. FFT usually just gives you an approximation of the signal as it is played out, and doesn’t account for things like distortion in the speakers or different volumes. Because of this, good speaker placement is still important to avoid giving a really strong ‘echo’ signal to the mic, and having a room that doesn’t produce a lot of echo, like you hear in a gymnasium or is common in many conference rooms with large glass windows and hard whiteboards that bounce the sounds back to the mic. Having multiple surfaces bounce a sound back (at slightly offset times) to the mic impacts how well an echo canceller can work. What is played out as one sound may bounce off 3 or 4 surfaces slightly out of sync, and the larger the room the more out of sync it is, which can cause more artifacts in the resulting audio. Multiple echos aren’t usually a big deal for a home conferencing setup in a small office, but even a single echo can become distracting.

One tuning that helps with all this is giving the echo canceller an impression of the room acoustics. One part of this is simply a time offset for when to try to subtract the signal. Its a function of how long it takes the sound to play out through the speaker and be picked up by the mic. If this tuning is a little forward or behind the real echo, then the result will have some strange artifacts left where the signal didn’t line up right. In the example, notice that “Bob” and “Hola” result in two ‘o’s back to the mic, so imagine if the canceller can’t tell which ‘o’ is the echo and removes both.

Another part is something of a shaping of the signal to be subtracted to account for distortions from speakers or echo in the room from multiple sources. Multiple echos of the same signal overlapped could be thought of as stretching out the signal, so a good tuning can try to shape the echo signal to match better. (Hard to illustrate with just letters, this needs a real picture.)

One thing echo cancellation is not going to do is remove echo from your voice (or Bob’s voice in the example) in a big, hard room. It can’t, because the only signal it gets is what comes in the mic(s) or what it plays out the speakers. To the canceller, the echo that comes from your voice bouncing off a hard wall then to the mic is the same as your voice going straight to the mic and there is no separation for it to cancel out. There are other ways to deal with an echoing room, such as changing mic placement to isolate your voice or changing the room (and no, just hanging thick curtains is not going to help).

Just for fun, here is how much of a mess that could be:

Alice(spkr): Hello, Bob
Bob        :        Hola, Alice
Room Echo  :         Hola, Alice
Mic (hears): Hello, BHHooobllaa,, AAlliiccee
Canceller -: Hello, B  o  b
Result     :         HH oo llaa, AAlliiccee

Yes, garbage in, garbage out. 🙂 It would be possible to have a second mic set up behind Bob that was picking up just the room echo and cancelling that out of the result, but again it is much easier to just get a cleaner signal with better mic placement.

Footnote: Another way to avoid needing an echo canceller is to do a ‘push to talk’ model. Everyone has their mic muted unless they want to speak and they have to push a button, which effectively cuts off their speaker while they are talking. Most “walkie talkies” and CB radio work this way, and many Mumble users I know use this mode to avoid the horrible echo they encounter when using it in openSUSE.