Noise Reduction Workflow

Status
Not open for further replies.

karma17

Well-known member
I am trying to reduce background noise in some video I shot and I'm using the iZotope RX7 plug-in. I've done some preliminary tests and it seems to work pretty well.

My question is:

When you are doing noise reduction with audio, do you normalize or increase the volume first, or do the noise reduction first, then apply any normalization? Any pros or cons either way?

Unfortunately, for the audio I have, it is low, and so I had to boost it just to get it up to an acceptable level and that's when I really noticed the background noises.

Any thoughts appreciated.
 
I’m not a specialist sound mixer / editor, but I’ve had a good deal of experience working with sound as an editor. I would recommend applying RX Izotope last, after making any gain/level adjustments, as in my expieence that gives the cleanest results. Good luck! Jason
 
Personally I don't know of anyone who works professionally in sound post for features that "normalizes" any sound files, that is what mixing is for. That said I have had really low production that needed an over all boost just to be usable. As a general thing any destructive gain I do to a file I do before I use noise reduction. I have done it both ways and more times than not boosting after NR brings out artifacts that you didn't hear before. NR after the boost gets a better noise print and seems to do a smoother job. In your case I would also go to 24 bit if the files were not already. Boosting and NR is talking a heavy toll on your sound quality and the extra bits will give you a better chance of a smoother result. You are not "upping" the quality of the original material but the gain and NR will have some space to interpret those extra bits and, as a rule, will sound better than keeping it at 16 bit.

You can of course do more than one NR pass. So you could do NR then boost, then another NR to clean up what showed up after the boost. I have done this once or twice, but I find generally it is better to boost and then do however many NR passes on the boosted signal.

A number of lighter NR passes often sounds better than one where you try to "fix" everything.
 
Thanks! This is very helpful. Yes, it's definitely not a feature film, just a short little project I was helping someone with but I wanted to clean up the audio a bit.
 
Unless the volume change is uniform (for instance raising the level of the entire file 10dB ), I would apply noise reduction before changing levels, and definitely before mixing, otherwise the noise print will not remain accurate throughout. The adaptive mode may work ok for light or moderate noise issues though.
As Scott stated, NR can be applied more than once, which is actually better if a significant amount is needed.
 
Absolutely normalize first before noise reduction and before mixing. This will give the NR software maximum dynamic range with which to lower the noise floor- boosting the signal to noise ratio. This is pure mathematics. After you've done NR, you can lower the level in case any subsequent filters/effects/mixing have issues with clipping.

iZotope RX7 is of course excellent for NR, and the recently updated NR filter in Premiere Pro works amazingly well compared to their prior NR solutions (for those not needing/wanting to spend $400 for a plugin (might be needed for more challenging material though)).
 
Over 40 years professional sound work, and I'm with Scott on this. The only time I normalize a file in 2018 is if it's in the client provided spec sheet. For example the spec requires a file normalized to a set standard before being encoded u-law for a phone system.

Switching to old guy voice... Back in my day the early Digidesign Sound Designer and Sound Designer II setup sometimes needed files to be normalized and D/C offset removed.
A/D converters have come a long way since then.
 
There's also two types of normalization but many DAWs and NLEs do not offer both options. 'Peak' normalizes the entire file, based on the file's highest peak. 'RMS' could squash the peaks so the audio could either be digitally clipped or seriously compressed. Peak normalize cannot exceed -0.00 dBFS.
 
With 32-bit or 64-bit floating point per sample, it's less of an issue regarding normalization (max sample(s) 1.0 or -1.0) because we have a lot of bits to work with.

What audio tool performs compression when normalization is selected? Mathematically, normalization is dividing all samples by the peak value, not any form of average/RMS value etc, so it's impossible to get clipped samples: the peak samples will have magnitude of 1.0.

If working with 16-bit integer, it becomes more important to work with normalized data, as we have limited bits to work with. If we convert 16-bit to float, normalize, perform NR, and convert back to 16 bit, very low level noise artifacts can be quantized and discarded. If we didn't normalize first, those noisy bits will still hang around. Again, we want to maximize the possible signal to noise ratio in the bits we're working with. Feel free to post an example where not normalizing first sounds better than normalizing first. At best they'll sound the same (where the audio tool is working with at least 32-bit float (not likely to be much if any audible difference with 64-bit float (doubles); might be able to see a difference via FFT/spectral analysis, however no one's going to listen to that :)).

In practical terms when editing dialog for video, when performing NR on low-level dialog which later must be raised for the mix and/or final output, I've found that it works better to normalize the track first (peak values 1/-1; no compression), then perform the NR and listen to the result. For quiet dialog with no music or Foley, checking the result with good closed-back headphones is helpful (though sometimes I'll also check with Stax electrostatics since the office is quiet and nothing can match the accuracy and detail of electrostatics). If it sounds good at peak level (normalized), we're good to go. If not normalizing first, it may sound OK, but when having to raise the level later, noise/artifacts can become noticeable for the final mix and we have to go back and deal with NR again: wasting time, and time is money... As for temporarily turning up the volume for critical listening- I've found it best to leave the level constant during the editing session, set to a safe/comfortable level for what the final mix will be. For internet delivery, I make sure to hit peaks around -3dB (even 0 dB is fine) and rarely use compression (way over used IMO).

Another trick is to uses lighter levels of NR if artifacts limit the amount of NR that can be applied. Then use a dynamics filter to further lower or cut low level sound where the noise lives. This will make noise between words/speaking much less or completely gone, and if there's background music/Foley, this method can be used instead of NR in some cases. Noise gates can also work as long as the filter has an intelligent/smooth transition option to prevent abrupt audible effects.
 
lol, if you disagree why not post a rebuttal with facts/evidence/examples/math so other readers can see your point? Years of work and what one does for a living doesn't trump facts/math/evidence, does it? There's an echo chamber joke here :)

Another reason to normalize first is that it makes using dynamics processing easier when set up as an expander/noise-gate to cut low level sound/noise. The shape can look like an S (using splines etc.), where the 'good sound' part on the right is a straight line (no change to original signal), though linear everywhere can also work ok as with these noise gate examples: https://www.presonus.com/learn/tech...tting-Started-With-Compressors-Gates-and-More , and https://medium.com/@jowie/a-podcaster-s-guide-to-noise-reduction-e8cad9fc21f4
 
lol, if you disagree why not post a rebuttal with facts/evidence/examples/math so other readers can see your point? Years of work and what one does for a living doesn't trump facts/math/evidence, does it? There's an echo chamber joke here :)

And maybe you can take your own advice. Perception is not math and you hear with your brain, so practice and experience do come into play. It's a bad rabbit hole many inexperience people coming into sound spend some time running down. What works on paper is great when you are dealing with machines and algorithms but it often falls short when dealing with humans. There are a ton of subjects where raw science is pretty much it. I'm not talking about that. Your facts may be valid and still not work well for perception. Also the biggest mistake in science is asking the wrong question. The right answer to the wrong question is often a much bigger waste of effort than the wrong answer to the right question.

As a couple of examples. A collection of mics that have the same noise level on paper will have different perceived noise levels in practice. The reason is in the specific "shape" of the noise. Some mics have done a much better job of masking the perceived noise even though they are no less noisy.

In your other post you say you will normalize to 0dBfs for the web, but in fact science will tell you that if you peak at 0dBfs you will be clipping part of the spectrum because of D/A converter overshoot. You don't perceive the issue so you ignore the science. Your experience has told you it doesn't matter for what you do.

Which is kind of my point. Science is important and those spec sheets and sound theory are important to know. But that is just the base. You won't interpret that science and theory effectively till you have had experience seeing how the hard math translates to perception. I'm only talking about the arts here.

On the general topic I don't generally use normalization partly out of habit and remembering how badly early normalizers worked. I do realize that the math is so good on todays computers that there is not likely to be a perceived problem. And for some SFX I have at times intentionally normalized over zero just to clip the peaks (adds some impact to things like punches).
The "science" is that on very loud sounds your ears cant handle it and they transmit that distortion to the brain. A loud sound with slightly clipped peaks "sounds" louder than it is because it sounds closer to what a human experiences with very loud sounds. I learned that from experience...

Also I spent a lot of time at one point in sound theory and I never heard a decent explanation for a very common perception. And that is one sounds like one and two sounds like two but three sounds like a bunch. The closest I came was in discussions with Walter Murch and his theory that humans can only perceive three distinct things at anyone time. Which actually fits with the next level that is about 90% true that four sounds like a mess.

And finally if you work in the arts you are studying the science of perception pretty much constantly. So in a pretty real sense experience IS science.
 
And maybe you can take your own advice. Perception is not math and you hear with your brain, so practice and experience do come into play. It's a bad rabbit hole many inexperience people coming into sound spend some time running down. What works on paper is great when you are dealing with machines and algorithms but it often falls short when dealing with humans. There are a ton of subjects where raw science is pretty much it. I'm not talking about that. Your facts may be valid and still not work well for perception. Also the biggest mistake in science is asking the wrong question. The right answer to the wrong question is often a much bigger waste of effort than the wrong answer to the right question.

As a couple of examples. A collection of mics that have the same noise level on paper will have different perceived noise levels in practice. The reason is in the specific "shape" of the noise. Some mics have done a much better job of masking the perceived noise even though they are no less noisy.

In your other post you say you will normalize to 0dBfs for the web, but in fact science will tell you that if you peak at 0dBfs you will be clipping part of the spectrum because of D/A converter overshoot. You don't perceive the issue so you ignore the science. Your experience has told you it doesn't matter for what you do.

Which is kind of my point. Science is important and those spec sheets and sound theory are important to know. But that is just the base. You won't interpret that science and theory effectively till you have had experience seeing how the hard math translates to perception. I'm only talking about the arts here.

On the general topic I don't generally use normalization partly out of habit and remembering how badly early normalizers worked. I do realize that the math is so good on todays computers that there is not likely to be a perceived problem. And for some SFX I have at times intentionally normalized over zero just to clip the peaks (adds some impact to things like punches).
The "science" is that on very loud sounds your ears cant handle it and they transmit that distortion to the brain. A loud sound with slightly clipped peaks "sounds" louder than it is because it sounds closer to what a human experiences with very loud sounds. I learned that from experience...

Also I spent a lot of time at one point in sound theory and I never heard a decent explanation for a very common perception. And that is one sounds like one and two sounds like two but three sounds like a bunch. The closest I came was in discussions with Walter Murch and his theory that humans can only perceive three distinct things at anyone time. Which actually fits with the next level that is about 90% true that four sounds like a mess.

And finally if you work in the arts you are studying the science of perception pretty much constantly. So in a pretty real sense experience IS science.

OK cool- can you show evidence to support your position, specifically: "In your other post you say you will normalize to 0dBfs for the web, but in fact science will tell you that if you peak at 0dBfs you will be clipping part of the spectrum because of D/A converter overshoot.". An example of one way to prove your point would be to record a high quality speaker with a high quality mic on a D/A converter you suspect will clip and cause audible artifacts. Another way would be to use an oscilloscope and show the clipping graphically. And finally, show example audio with one normalized to 0dB, and another carefully compressed to sound as loud but not normalized to 0dB and see if people can tell the difference (without cheating by loading into an audio editor).

I admit that using a Sound Devices DAC + Stax electrostatic headphones, I can't hear any undesirable artifacts. Nor can I hear any issues on an iPhone speaker or headphones, nor on Tannoy monitors, Sony 7506, Focal, etc. No complaints from others re: sound artifacts. Everything I produce ends up on the web as MP3/AAC, thus I suggest to anyone delivering to these formats to maximize the dynamic range of their audio, including normalizing for those targets. This is final product delivery vs. recording sound elements and delivering to others for additional processing and mixing. Are you guys creating final deliverables to the general public or to others for additional processing?

Debating perception is impossible, as everyone perceives reality differently, right? Isn't that the point of science, to determine elements of reality that we can agree on are 'real', and the point of math and physics?

For noise perception and psychoacoustics, we can perform double-blind tests on human subjects, as was done when developing MP3/AAC to determine how the human ear+brain act as filters of sound and perceptions of frequencies.

To your question about one sounds like one, etc., this is something that can be analyzed using machine learning / AI. In previous discussions I linked to papers/videos where machine learning/AI were parsing voices from crowds, known as the 'cocktail party problem" in psychoacoustics, and folks here didn't seem to be interested (down voting the post etc.).

EDIT: while some DACs that clipped a 0dB signal were an issue years ago, the much bigger issue was zero crossing distortion: https://www.stereophile.com/content/pdm-pwm-delta-sigma-1-bit-dacs . While a 0dB signal is perfectly valid and any hardware producing erroneous output (clipping/distortion) is technically buggy, it makes sense to work around known bugs until they are fixed (the same is true for software). The work-around for the zero crossing bug would be to have no zero crossings, resulting in a massive loss in dynamic range (halve the signal and add 1/2MaxSample DC offset)! Sounds like these issues were solved around 20 years ago. 0dB and zero crossings won't be a problem today, especially for content consumed online (not likely going through vintage audio hardware).
 
Last edited:
OK cool- can you show evidence to support your position, specifically: "In your other post you say you will normalize to 0dBfs for the web, but in fact science will tell you that if you peak at 0dBfs you will be clipping part of the spectrum because of D/A converter overshoot.". An example of one way to prove your point would be to record a high quality speaker with a high quality mic on a D/A converter you suspect will clip and cause audible artifacts. Another way would be to use an oscilloscope and show the clipping graphically. And finally, show example audio with one normalized to 0dB, and another carefully compressed to sound as loud but not normalized to 0dB and see if people can tell the difference (without cheating by loading into an audio editor).

No need, since as a sound professional I know that work has already been done, see science!
What you want to search for is "inter-sample peaks" and "true peak metering"
But a quickie link you may like is this one from the iZotope Tech Blog

There is also this AES paper
The paper I was searching for, and I think was one of the first to bring the issue up to the sound community was by Bob Katz but I couldn't find it.

The basics are actually pretty "duh" in that if you have two samples at 0dBfs and your signal is a sine wave then to complete the curve the converter would need to go past 0dBfs. 3dB is about the max overshoot so that is where the -3dBfs comes from and what Bob Katz was recommending in the paper I can't find.

The problem is actually more of an issue with cheap gear, ie internet playback etc.
 
Here's an example with normalized audio (Schoeps CMC641 into the C300 II), and no other audio edits (gap between speaking was removal of a 1 year old making noise):


The same video using normalized audio with the addition of an FFT EQ to reduce sibilance and an expander (Dynamics Processing in Premiere Pro CC) to remove background noise:


The second video is an example of normalizing audio and removing noise using an expander, which when it work, tends to sound much better than FFT/spectral-based noise reduction processes. The challenge with FFT/spectral noise reduction methods is artifacts being generated (not present in original sound), whereas the challenge with expanders is tuning the transition between speaking and silence to make it seamless.

I experimented in the editor between -3dB and 0dB and could not hear any issues at 0dB compared to -3dB using Focal or Stax headphones using a Sound Devices DAC.
 
No need, since as a sound professional I know that work has already been done, see science!
What you want to search for is "inter-sample peaks" and "true peak metering"
But a quickie link you may like is this one from the iZotope Tech Blog

There is also this AES paper
The paper I was searching for, and I think was one of the first to bring the issue up to the sound community was by Bob Katz but I couldn't find it.

The basics are actually pretty "duh" in that if you have two samples at 0dBfs and your signal is a sine wave then to complete the curve the converter would need to go past 0dBfs. 3dB is about the max overshoot so that is where the -3dBfs comes from and what Bob Katz was recommending in the paper I can't find.

The problem is actually more of an issue with cheap gear, ie internet playback etc.

Thanks for the links, I'm familiar with sampling and reconstruction (FFTs and Fourier synthesis)- I've written software to do this (in real-time). While it's true older hardware had issues with clipping, the point is that 0dB is a valid input signal. It's up to the playback hardware to play it correctly. If it needs to attenuate due to analog filter or sampling limitations, that's the responsibility of the hardware manufacturer. The good news is that modern hardware is spec compliant- i.e. can handle 0dB (and zero crossings!) with no issues. The difference between 0 and -3dB is minor, so if one is concerned about ancient audio hardware playback issues, normalizing to -3dB isn't a big deal.

Two samples at 0dB sounds like clipping and/or a square wave, which I'd expect to sound distorted/square-wave like.

I didn't hear any issues at 0dB in the above real-world examples provided.
 
Now I remember why I blocked your posts...

Do what you want. Though I will point out that the first of these last two points you are ignoring the science for your personal perceptions. And in the second you are ignoring the engineering and science. Two samples at zero won't even show as a clip on most meters, the most conservative software requires three in a row and most mastering engineers would put the perceivable point somewhere between four and six samples. Even that is not going to sound like a "square wave" it's going to sound like a click. A two sample square wave would be at 24 kHz and nobody but dogs would hear it.

Anyway I won't see your next post so go for it.
 
Now I remember why I blocked your posts...

Do what you want. Though I will point out that the first of these last two points you are ignoring the science for your personal perceptions. And in the second you are ignoring the engineering and science. Two samples at zero won't even show as a clip on most meters, the most conservative software requires three in a row and most mastering engineers would put the perceivable point somewhere between four and six samples. Even that is not going to sound like a "square wave" it's going to sound like a click. A two sample square wave would be at 24 kHz and nobody but dogs would hear it.

Anyway I won't see your next post so go for it.

In psychology this is called projection: "And in the second you are ignoring the engineering and science."

Repeating samples at max scale will create a square wave (provided a cycle is happening between between +/- max scale): https://en.wikipedia.org/wiki/Square_wave
Distortion_Square-Wave.gif


Another reason a (max sample) square wave was discussed is the fact that it will overshoot when played back on analog hardware (I was agreeing with your point of two repeated samples causing overshoot), as seen with Fourier synthesis we cannot exactly represent a square wave with sums of sine waves (see link above and example below):
Fourier_series_for_square_wave.gif


Sample rate doesn't matter and a dog's hearing is irrelevant: it's math and physics of Fourier synthesis and related to digital to analog converters in practice.

The examples posted in this thread illustrate the concept that normalizing before NR or expanding (Dynamics Processing) work fine even at 0dB: modern hardware won't overshoot/clip/distort- it's now handled correctly (same as the zero crossing bug).

Anyone can try these methods and see the truth for themselves.
 
Status
Not open for further replies.
Back
Top