Discussion:
[music-dsp] Algorithms for finding seamless loops in audio
Element Green
2010-11-25 03:26:09 UTC
Permalink
Hello music-dsp list,

I'm the author of a SoundFont instrument editing application called
Swami (http://swami.sourceforge.net). A while back an interested
developer added a loop finding algorithm which I integrated into the
application. This feature is supposed to generate a list of start/end
loop points which are optimal for "seamless loops".

The original algorithm was based on autocorrelation. There were many
bugs in the implementation and I was having trouble understanding how
it functioned, so I wrote a new algorithm which currently does not use
autocorrelation. The new algorithm seems to come up with good
candidates for "seamless loops", but is slower than the old algorithm
by a factor of 5, but at least it does not suffer from bugs, which
often resulted in unpredictable results.

I have limited knowledge in the area of DSP, so I thought I would seek
advice on this list to determine if the new algorithm makes sense, if
anyone has any ideas on ways to optimize it or if there are better
ways of tackling this task.

First off, is autocorrelation even a solution for this? That is what
the old algorithm used and it seemed to me that sample points closer
to the loop start or end should be given higher priority in the
"quality" calculation, which was not done in that case.

Inputs to the algorithm:
float sample_data[]: Audio data array of floating point samples
normalized to -1.0 to 1.0
int analysis_size: Size of analysis window (number of points compared
around loop points, the loop point is in the center of the window).
int half_analysis_size = analysis_size / 2 which is the center point
of the window (only really center on odd analysis_size values)
int win1start: Offset in sample_data[] of 1st search window (for loop
start point).
int win1size: Size of 1st search window (for loop start point).
int win2start: Offset in sample_data[] of 2nd search window (for loop
end point).
int win2size: Size of 2nd search window (for loop end point).


Description of the new algorithm:
A multiplication "window" array of floats (lets call it
analysis_window[]) is created which is analysis_size in length and
contains a peak value in the center of the window, each point away
from the center is half the value of its closer neighbor and all
values in the window add up to 0.5. 0.5 was chosen because the
maximum difference between two sample points is 2 (1 - -1 = 2), so
this results in a maximum "quality" value of 1.0 (worst quality).

The two search windows are exhaustively compared with two loops, one
embedded in the other. For each loop start/end candidate a quality
factor is calculated. The quality value is calculated from the sum of
the absolute differences of the sample points surrounding the loop
points (analysis window size) multiplied individually by values in the
analysis_window[] array.

C code:

/* Calculate fraction divisor */
for (i = 0, fract = 0, pow2 = 1; i <= half_window; i++, pow2 *= 2)
{
fract += pow2;
if (i < half_window) fract += pow2;
}

/* Even windows are asymetrical, subtract 1 */
if (!(analysis_window & 1)) fract--;

/* Calculate values for 1st half of window and center of window */
for (i = 0, pow2 = 1; i <= half_window; i++, pow2 *= 2)
anwin_factors[i] = pow2 * 0.5 / fract;

/* Copy values for 2nd half of window */
for (i = 0; half_window + i + 1 < analysis_window; i++)
anwin_factors[half_window + i + 1] = anwin_factors[half_window - i - 1];


for (win1 = 0; win1 < win1size; win1++)
{
startpos = win1start + win1;

for (win2 = 0; win2 < win2size; win2++)
{
endpos = win2start + win2;

for (i = 0, quality = 0.0; i < analysis_size; i++)
{
diff = sample_data[startpos + i - half_analysis_size]
- sample_data[endpos + i - half_analysis_size];
if (diff < 0) diff = -diff;
quality += diff * analysis_window[i];
}

...
}
}


Optimization ideas:
If a single value (integer or float) could be calculated which
"describes" an analysis_size window of audio, then these values could
be calculated for each window individually and then used as a
pre-filter prior to performing the calculation using the
analysis_window. I tested this using a simple signal power
calculation sum (sample_val * sample_val) and it increased the speed
substantially, but threw away some good loop candidates.

Thanks in advance for any ideas on this subject. Seems like the
result of this algorithm could be something useful to add to the
music-dsp code archive.

Best regards,
Joshua Element Green
Didier Dambrin
2010-11-25 05:51:03 UTC
Permalink
IMHO "finding loop points" is the wrong problem to solve, it's better to
"make something loop" instead, as (ideally) you're only gonna find the least
bad loop points, nothing guarantees that there's anything loopable.
I would use crossfading, and possibly autocorellation to auto-select the
part to repeat & crossfade (to avoid a volume dip (& timbre change) due to
phasing).
I would also reject too small looping sections, as a "click-free" loop is
one thing but a loop that doesn't sound repeating is another thing.
Even if you wanna find loop points, IMHO it's still better to find them not
caring about noticable clicks, and then do a little crossfade. Unless you
really can't touch your source sample.




> Hello music-dsp list,
>
> I'm the author of a SoundFont instrument editing application called
> Swami (http://swami.sourceforge.net). A while back an interested
> developer added a loop finding algorithm which I integrated into the
> application. This feature is supposed to generate a list of start/end
> loop points which are optimal for "seamless loops".
>
> The original algorithm was based on autocorrelation. There were many
> bugs in the implementation and I was having trouble understanding how
> it functioned, so I wrote a new algorithm which currently does not use
> autocorrelation. The new algorithm seems to come up with good
> candidates for "seamless loops", but is slower than the old algorithm
> by a factor of 5, but at least it does not suffer from bugs, which
> often resulted in unpredictable results.
>
> I have limited knowledge in the area of DSP, so I thought I would seek
> advice on this list to determine if the new algorithm makes sense, if
> anyone has any ideas on ways to optimize it or if there are better
> ways of tackling this task.
>
> First off, is autocorrelation even a solution for this? That is what
> the old algorithm used and it seemed to me that sample points closer
> to the loop start or end should be given higher priority in the
> "quality" calculation, which was not done in that case.
>
> Inputs to the algorithm:
> float sample_data[]: Audio data array of floating point samples
> normalized to -1.0 to 1.0
> int analysis_size: Size of analysis window (number of points compared
> around loop points, the loop point is in the center of the window).
> int half_analysis_size = analysis_size / 2 which is the center point
> of the window (only really center on odd analysis_size values)
> int win1start: Offset in sample_data[] of 1st search window (for loop
> start point).
> int win1size: Size of 1st search window (for loop start point).
> int win2start: Offset in sample_data[] of 2nd search window (for loop
> end point).
> int win2size: Size of 2nd search window (for loop end point).
>
>
> Description of the new algorithm:
> A multiplication "window" array of floats (lets call it
> analysis_window[]) is created which is analysis_size in length and
> contains a peak value in the center of the window, each point away
> from the center is half the value of its closer neighbor and all
> values in the window add up to 0.5. 0.5 was chosen because the
> maximum difference between two sample points is 2 (1 - -1 = 2), so
> this results in a maximum "quality" value of 1.0 (worst quality).
>
> The two search windows are exhaustively compared with two loops, one
> embedded in the other. For each loop start/end candidate a quality
> factor is calculated. The quality value is calculated from the sum of
> the absolute differences of the sample points surrounding the loop
> points (analysis window size) multiplied individually by values in the
> analysis_window[] array.
>
> C code:
>
> /* Calculate fraction divisor */
> for (i = 0, fract = 0, pow2 = 1; i <= half_window; i++, pow2 *= 2)
> {
> fract += pow2;
> if (i < half_window) fract += pow2;
> }
>
> /* Even windows are asymetrical, subtract 1 */
> if (!(analysis_window & 1)) fract--;
>
> /* Calculate values for 1st half of window and center of window */
> for (i = 0, pow2 = 1; i <= half_window; i++, pow2 *= 2)
> anwin_factors[i] = pow2 * 0.5 / fract;
>
> /* Copy values for 2nd half of window */
> for (i = 0; half_window + i + 1 < analysis_window; i++)
> anwin_factors[half_window + i + 1] = anwin_factors[half_window - i -
> 1];
>
>
> for (win1 = 0; win1 < win1size; win1++)
> {
> startpos = win1start + win1;
>
> for (win2 = 0; win2 < win2size; win2++)
> {
> endpos = win2start + win2;
>
> for (i = 0, quality = 0.0; i < analysis_size; i++)
> {
> diff = sample_data[startpos + i - half_analysis_size]
> - sample_data[endpos + i - half_analysis_size];
> if (diff < 0) diff = -diff;
> quality += diff * analysis_window[i];
> }
>
> ...
> }
> }
>
>
> Optimization ideas:
> If a single value (integer or float) could be calculated which
> "describes" an analysis_size window of audio, then these values could
> be calculated for each window individually and then used as a
> pre-filter prior to performing the calculation using the
> analysis_window. I tested this using a simple signal power
> calculation sum (sample_val * sample_val) and it increased the speed
> substantially, but threw away some good loop candidates.
>
> Thanks in advance for any ideas on this subject. Seems like the
> result of this algorithm could be something useful to add to the
> music-dsp code archive.
>
> Best regards,
> Joshua Element Green
> --
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info, FAQ, source code archive, list archive, book reviews,
> dsp links
> http://music.columbia.edu/cmc/music-dsp
> http://music.columbia.edu/mailman/listinfo/music-dsp


--------------------------------------------------------------------------------



No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.869 / Virus Database: 271.1.1/3277 - Release Date: 11/24/10
20:34:00
Alan Wolfe
2010-11-25 06:05:07 UTC
Permalink
Agreed here (:

in 2d graphics and skeletal animation, making tileable 2d art and
seamless blends are basically the same problems.

in both areas they MAKE the things seamless instead of trying to find
how they could be seamless.

in 2d graphics this comes up via texturing (probably obvious), and in
skeletal animation, it is literally in the form that Didier talks
about; they literally cross fade animation weights from an old
animation to a new animation to make a seamless transition.

Of course, even with these seamless techniques, you can still notice
issues like in 2d textures there might be specific features that
really stand out in a texture and you can easily see it repeating. In
3d animation, even though a blend may be seamless it can still look
wrong.

I only bring these parallels up because there is something in 2d
graphics called "wang tiling" which can make some really organic
looking tileable textures.

i think the same technique could apply to audio (and even skeletal
animation) but how you would apply the idea, i'm not sure 100% hehe.

my 2 monopoly dollars! :P

On Wed, Nov 24, 2010 at 9:51 PM, Didier Dambrin <***@skynet.be> wrote:
> IMHO "finding loop points" is the wrong problem to solve, it's better to
> "make something loop" instead, as (ideally) you're only gonna find the least
> bad loop points, nothing guarantees that there's anything loopable.
> I would use crossfading, and possibly autocorellation to auto-select the
> part to repeat & crossfade (to avoid a volume dip (& timbre change) due to
> phasing).
> I would also reject too small looping sections, as a "click-free" loop is
> one thing but a loop that doesn't sound repeating is another thing.
> Even if you wanna find loop points, IMHO it's still better to find them not
> caring about noticable clicks, and then do a little crossfade. Unless you
> really can't touch your source sample.
>
>
>
>
>> Hello music-dsp list,
>>
>> I'm the author of a SoundFont instrument editing application called
>> Swami (http://swami.sourceforge.net).  A while back an interested
>> developer added a loop finding algorithm which I integrated into the
>> application.  This feature is supposed to generate a list of start/end
>> loop points which are optimal for "seamless loops".
>>
>> The original algorithm was based on autocorrelation.  There were many
>> bugs in the implementation and I was having trouble understanding how
>> it functioned, so I wrote a new algorithm which currently does not use
>> autocorrelation.  The new algorithm seems to come up with good
>> candidates for "seamless loops", but is slower than the old algorithm
>> by a factor of 5, but at least it does not suffer from bugs, which
>> often resulted in unpredictable results.
>>
>> I have limited knowledge in the area of DSP, so I thought I would seek
>> advice on this list to determine if the new algorithm makes sense, if
>> anyone has any ideas on ways to optimize it or if there are better
>> ways of tackling this task.
>>
>> First off, is autocorrelation even a solution for this?  That is what
>> the old algorithm used and it seemed to me that sample points closer
>> to the loop start or end should be given higher priority in the
>> "quality" calculation, which was not done in that case.
>>
>> Inputs to the algorithm:
>> float sample_data[]: Audio data array of floating point samples
>> normalized to -1.0 to 1.0
>> int analysis_size: Size of analysis window (number of points compared
>> around loop points, the loop point is in the center of the window).
>> int half_analysis_size = analysis_size / 2 which is the center point
>> of the window (only really center on odd analysis_size values)
>> int win1start: Offset in sample_data[] of 1st search window (for loop
>> start point).
>> int win1size: Size of 1st search window (for loop start point).
>> int win2start: Offset in sample_data[] of 2nd search window (for loop
>> end point).
>> int win2size: Size of 2nd search window (for loop end point).
>>
>>
>> Description of the new algorithm:
>> A multiplication "window" array of floats (lets call it
>> analysis_window[]) is created which is analysis_size in length and
>> contains a peak value in the center of the window, each point away
>> from the center is half the value of its closer neighbor and all
>> values in the window add up to 0.5.  0.5 was chosen because the
>> maximum difference between two sample points is 2 (1 - -1 = 2), so
>> this results in a maximum "quality" value of 1.0 (worst quality).
>>
>> The two search windows are exhaustively compared with two loops, one
>> embedded in the other.  For each loop start/end candidate a quality
>> factor is calculated.  The quality value is calculated from the sum of
>> the absolute differences of the sample points surrounding the loop
>> points (analysis window size) multiplied individually by values in the
>> analysis_window[] array.
>>
>> C code:
>>
>>  /* Calculate fraction divisor */
>>  for (i = 0, fract = 0, pow2 = 1; i <= half_window; i++, pow2 *= 2)
>>  {
>>   fract += pow2;
>>   if (i < half_window) fract += pow2;
>>  }
>>
>>  /* Even windows are asymetrical, subtract 1 */
>>  if (!(analysis_window & 1)) fract--;
>>
>>  /* Calculate values for 1st half of window and center of window */
>>  for (i = 0, pow2 = 1; i <= half_window; i++, pow2 *= 2)
>>   anwin_factors[i] = pow2 * 0.5 / fract;
>>
>>  /* Copy values for 2nd half of window */
>>  for (i = 0; half_window + i + 1 < analysis_window; i++)
>>   anwin_factors[half_window + i + 1] = anwin_factors[half_window - i - 1];
>>
>>
>>  for (win1 = 0; win1 < win1size; win1++)
>>  {
>>   startpos = win1start + win1;
>>
>>   for (win2 = 0; win2 < win2size; win2++)
>>   {
>>     endpos = win2start + win2;
>>
>>     for (i = 0, quality = 0.0; i < analysis_size; i++)
>>     {
>>       diff = sample_data[startpos + i - half_analysis_size]
>>         - sample_data[endpos + i - half_analysis_size];
>>       if (diff < 0) diff = -diff;
>>       quality += diff * analysis_window[i];
>>     }
>>
>>     ...
>>   }
>>  }
>>
>>
>> Optimization ideas:
>> If a single value (integer or float) could be calculated which
>> "describes" an analysis_size window of audio, then these values could
>> be calculated for each window individually and then used as a
>> pre-filter prior to performing the calculation using the
>> analysis_window.  I tested this using a simple signal power
>> calculation sum (sample_val * sample_val) and it increased the speed
>> substantially, but threw away some good loop candidates.
>>
>> Thanks in advance for any ideas on this subject.  Seems like the
>> result of this algorithm could be something useful to add to the
>> music-dsp code archive.
>>
>> Best regards,
>> Joshua Element Green
>> --
>> dupswapdrop -- the music-dsp mailing list and website:
>> subscription info, FAQ, source code archive, list archive, book reviews,
>> dsp links
>> http://music.columbia.edu/cmc/music-dsp
>> http://music.columbia.edu/mailman/listinfo/music-dsp
>
>
> --------------------------------------------------------------------------------
>
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.869 / Virus Database: 271.1.1/3277 - Release Date: 11/24/10
> 20:34:00
>
> --
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info, FAQ, source code archive, list archive, book reviews, dsp
> links
> http://music.columbia.edu/cmc/music-dsp
> http://music.columbia.edu/mailman/listinfo/music-dsp
>
Johannes Kroll
2010-11-25 10:14:29 UTC
Permalink
On Thu, 25 Nov 2010 06:51:03 +0100
"Didier Dambrin" <***@skynet.be> wrote:

> IMHO "finding loop points" is the wrong problem to solve, it's better to
> "make something loop" instead, as (ideally) you're only gonna find the least
> bad loop points, nothing guarantees that there's anything loopable.
> I would use crossfading, and possibly autocorellation to auto-select the
> part to repeat & crossfade (to avoid a volume dip (& timbre change) due to
> phasing).
> I would also reject too small looping sections, as a "click-free" loop is
> one thing but a loop that doesn't sound repeating is another thing.
> Even if you wanna find loop points, IMHO it's still better to find them not
> caring about noticable clicks, and then do a little crossfade. Unless you
> really can't touch your source sample.

IMHO, if the goal would be to make something loop automatically
(without user input) then you would be right. But if you have a user,
probably a musician with a trained ear, who says "I want it to loop
roughly at this spot, now find me some good loop points nearby", then
you shouldn't modify the sample. In this case the OP's algorhithm
sounds fine (given that it finds good loop points, which I didn't
test). It would be even better to give the user the few best loop
points to select from. Crossfading can still be done afterwards if
they want.
Element Green
2010-11-26 04:49:01 UTC
Permalink
On Thu, Nov 25, 2010 at 2:14 AM, Johannes Kroll <***@lavabit.com> wrote:
> On Thu, 25 Nov 2010 06:51:03 +0100
> "Didier Dambrin" <***@skynet.be> wrote:
>
>> IMHO "finding loop points" is the wrong problem to solve, it's better to
>> "make something loop" instead, as (ideally) you're only gonna find the least
>> bad loop points, nothing guarantees that there's anything loopable.
>> I would use crossfading, and possibly autocorellation to auto-select the
>> part to repeat & crossfade (to avoid a volume dip (& timbre change) due to
>> phasing).
>> I would also reject too small looping sections, as a "click-free" loop is
>> one thing but a loop that doesn't sound repeating is another thing.
>> Even if you wanna find loop points, IMHO it's still better to find them not
>> caring about noticable clicks, and then do a little crossfade. Unless you
>> really can't touch your source sample.
>
> IMHO, if the goal would be to make something loop automatically
> (without user input) then you would be right. But if you have a user,
> probably a musician with a trained ear, who says "I want it to loop
> roughly at this spot, now find me some good loop points nearby", then
> you shouldn't modify the sample. In this case the OP's algorhithm
> sounds fine (given that it finds good loop points, which I didn't
> test). It would be even better to give the user the few best loop
> points to select from. Crossfading can still be done afterwards if
> they want.
>

I also agree with this approach. It may be that it would be best to
have 2 different algorithms. One which is used when the user just
wants to find a loop within a rather large portion of an instrument
sample, for which autocorrelation seems to be a good option. The
second algorithm would be something like what is currently implemented
and which you clarified would be good for making smaller adjustments
and for which a cross fade may not be necessary, if a quality loop is
already possible. A cross fade could of course still be performed for
ether option.

Element
Element Green
2010-11-26 04:40:00 UTC
Permalink
Hello Didier,

On Wed, Nov 24, 2010 at 9:51 PM, Didier Dambrin <***@skynet.be> wrote:
> IMHO "finding loop points" is the wrong problem to solve, it's better to
> "make something loop" instead, as (ideally) you're only gonna find the least
> bad loop points, nothing guarantees that there's anything loopable.
> I would use crossfading, and possibly autocorellation to auto-select the
> part to repeat & crossfade (to avoid a volume dip (& timbre change) due to
> phasing).
> I would also reject too small looping sections, as a "click-free" loop is
> one thing but a loop that doesn't sound repeating is another thing.
> Even if you wanna find loop points, IMHO it's still better to find them not
> caring about noticable clicks, and then do a little crossfade. Unless you
> really can't touch your source sample.
>

I like this approach. Cross fading was on my list of features to add,
which could be performed after a good loop candidate is found. I
still think its a good idea to have a loop candidate finding
algorithm, as you mentioned. I can see that autocorrelation would
probably work much better if a cross fade was performed afterwards to
remove the possible click.

While reading your reply it occurred to me that a cross fade
"audition" toggle button would be nice, which automatically cross
fades the played back audio sample with the currently defined loop,
but is temporary and does not actually change the audio sample until
the user applies the cross fade.

Element
robert bristow-johnson
2010-11-26 05:33:58 UTC
Permalink
On Nov 25, 2010, at 11:40 PM, Element Green wrote:

> On Wed, Nov 24, 2010 at 9:51 PM, Didier Dambrin <***@skynet.be>
> wrote:
>> IMHO "finding loop points" is the wrong problem to solve, it's
>> better to
>> "make something loop" instead, as (ideally) you're only gonna find
>> the least
>> bad loop points, nothing guarantees that there's anything loopable.
>> I would use crossfading, and possibly autocorellation to auto-
>> select the
>> part to repeat & crossfade (to avoid a volume dip (& timbre change)
>> due to
>> phasing).
>
> Cross fading was on my list of features to add,
> which could be performed after a good loop candidate is found. I
> still think its a good idea to have a loop candidate finding
> algorithm, as you mentioned. I can see that autocorrelation would
> probably work much better if a cross fade was performed afterwards to
> remove the possible click.


depending on how big your "window" is, i think a better term for this
is *cross-correlation* not autocorrelation. it's a single stream of
audio so in a sense of the word, it *is* autocorrelation, but what i
normally think of, with that semantic is something where the lag is no
bigger or not much bigger than the analysis window of either loop-end
region of the audio and the loop-begin.

if the loop points are separated by a much longer time (number of
samples) than the size (in samples) of the two slices of audio being
correlated, it's really cross-correlation. and you might find poor
correlation given all lags that you're looking at. in fact, doing
cross-correlation from one part of the tone or sound to another part
that has a rapid change in amplitude envelope might fool your
correlation into thinking there is a good match when there really
isn't (because the amplitude is increasing, then the cross-correlation
increases, but not necessarily because of a good match).

so, instead of either cross or autocorrelation, you might want to
consider AMDF between the loop end and potential candidates to loop
back to. instead of looking for a maximum, you're looking for a
minimum and a very low minimum means a good match (or a bad match
during a very low signal level).

find good loop points, then crossfade.

another thing about cross fading is that there is something you can do
to adapt a little to better or poor loop points. if the loop points
(and the window surrounding them) match well, then you're doing a
crossfade between coherent audio and a constant voltage crossfade is
indicated (when the crossfade is half done, both the fade out and fade
in envelopes are at 50%). if the loop points are not well matched
(but it's the best loop points your correlation function can find),
then you want to do a crossfade that is closer to a constant power
crossfade where both fade in and fade out envelopes are at 70.7% at
the midpoint of the crossfade. there is a way to define the optimal
crossfade function for any correlation between 0 (when it's like
crossfading white noise to white noise) to 100% (like crossfading a
perfectly periodic waveform to a similarly appearing portion of the
waveform at loop start).

does any of this make any sense?

can i ask what the application is? (i may have missed it, but i'll
look at earlier posts.) if it's looping for sound/instrument samples,
this is an analysis thing that is not real-time and we can consider
finding the best loop-begin points for a large variety of possible
loop-end points. then pick the pair that looks best, given whatever
your measure of good is. but in a (time-domain) real-time pitch
shifter, having so many choices may not be available to you. you
might find yourself in a situation where your loop-end is pretty well
defined, you have to find a place to splice to and take the best that
you can get from that.

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Element Green
2010-11-30 01:50:15 UTC
Permalink
On Thu, Nov 25, 2010 at 9:33 PM, robert bristow-johnson
<***@audioimagination.com> wrote:
>
> depending on how big your "window" is, i think a better term for this is
> *cross-correlation* not autocorrelation.  it's a single stream of audio so
> in a sense of the word, it *is* autocorrelation, but what i normally think
> of, with that semantic is something where the lag is no bigger or not much
> bigger than the analysis window of either loop-end region of the audio and
> the loop-begin.
>
> if the loop points are separated by a much longer time (number of samples)
> than the size (in samples) of the two slices of audio being correlated, it's
> really cross-correlation.  and you might find poor correlation given all
> lags that you're looking at.  in fact, doing cross-correlation from one part
> of the tone or sound to another part that has a rapid change in amplitude
> envelope might fool your correlation into thinking there is a good match
> when there really isn't (because the amplitude is increasing, then the
> cross-correlation increases, but not necessarily because of a good match).
>
> so, instead of either cross or autocorrelation, you might want to consider
> AMDF between the loop end and potential candidates to loop back to.  instead
> of looking for a maximum, you're looking for a minimum and a very low
> minimum means a good match (or a bad match during a very low signal level).

Looking at the equation here for AMDF:
http://mi.eng.cam.ac.uk/~ajr/SpeechAnalysis/node72.html

It seems like the algorithm I came up with independently is very
similar. The absolute value of the difference of the sample points is
taken as with AMDF. Prior to summing the values together though, I'm
multiplying by the window I described before (with a peak in the
center where the loop point is), giving samples closer to the loop
point more weight.

In practice this seems to work quite well and I'm going to leave it as
is for now. It seems reasonably fast and straight forward.

>
> find good loop points, then crossfade.
>
> another thing about cross fading is that there is something you can do to
> adapt a little to better or poor loop points.  if the loop points (and the
> window surrounding them) match well, then you're doing a crossfade between
> coherent audio and a constant voltage crossfade is indicated (when the
> crossfade is half done, both the fade out and fade in envelopes are at 50%).
>  if the loop points are not well matched (but it's the best loop points your
> correlation function can find), then you want to do a crossfade that is
> closer to a constant power crossfade where both fade in and fade out
> envelopes are at 70.7% at the midpoint of the crossfade.  there is a way to
> define the optimal crossfade function for any correlation between 0 (when
> it's like crossfading white noise to white noise) to 100% (like crossfading
> a perfectly periodic waveform to a similarly appearing portion of the
> waveform at loop start).
>
> does any of this make any sense?
>

I'm not sure I'm following you.
Olli Niemitalo
2010-11-30 17:25:44 UTC
Permalink
On Tue, Nov 30, 2010 at 3:50 AM, Element Green
<***@users.sourceforge.net> wrote:
> I hadn't given this much thought and just assumed a linear cross fade
> (0-100%) would be the way to do it

I'll try to elucidate with two extreme examples. If the signal is
white noise, then a linear cross-fade will make a dip in the volume,
centered at the cross fade. This is because the phases of the two
signals (one being faded in an one being faded out) don't match and
there will be partial cancellation of components of the signal. On
average the phase difference will be 90 deg. I recall my frustration
using a sample editor which only offered linear cross-fade. It would
often create this kind of a dip. The other example: If the signal is
sinusoidal, then you can adjust the loop length to obtain a perfect
match. In this case the linear cross fade will work perfectly: It does
nothing, as the signals are already identical around the loop points.

For uncorrelated signals you'd like to use something like this instead
of a linear cross fade, to compensate for the dip:

x = 0..1 is the time position inside the cross-fade
f(x) is the gain function for the signal that is being faded in
f(1-x) is the gain function for the signal that is being faded out

These constraints should be satisfied:

f(0) = 0
f(1) = 1
f(x)^2 + f(1-x)^2 = 1

The last constraint above normalizes power or amplitude for the
mixture of uncorrelated signals.

Some functions that are applicable:

f(x) = sqrt(x)
f(x) = sin(pi/2*x)
f(x) = x/sqrt(x^2+(1-x)^2)

The last one is probably my favorite, as it has the same gain ratio
f(x)/f(1-x) as a linear cross fade. It doesn't start the fade in as
abruptly as the others.

This is quite a similar problem as finding a nice pan law.

-olli
Element Green
2010-12-01 17:10:24 UTC
Permalink
On Tue, Nov 30, 2010 at 9:25 AM, Olli Niemitalo <***@iki.fi> wrote:
>
> I'll try to elucidate with two extreme examples. If the signal is
> white noise, then a linear cross-fade will make a dip in the volume,
> centered at the cross fade. This is because the phases of the two
> signals (one being faded in an one being faded out) don't match and
> there will be partial cancellation of components of the signal. On
> average the phase difference will be 90 deg. I recall my frustration
> using a sample editor which only offered linear cross-fade. It would
> often create this kind of a dip. The other example: If the signal is
> sinusoidal, then you can adjust the loop length to obtain a perfect
> match. In this case the linear cross fade will work perfectly: It does
> nothing, as the signals are already identical around the loop points.
>
> For uncorrelated signals you'd like to use something like this instead
> of a linear cross fade, to compensate for the dip:
>
> x = 0..1 is the time position inside the cross-fade
> f(x) is the gain function for the signal that is being faded in
> f(1-x) is the gain function for the signal that is being faded out
>
> These constraints should be satisfied:
>
> f(0) = 0
> f(1) = 1
> f(x)^2 + f(1-x)^2 = 1
>
> The last constraint above normalizes power or amplitude for the
> mixture of uncorrelated signals.
>
> Some functions that are applicable:
>
> f(x) = sqrt(x)
> f(x) = sin(pi/2*x)
> f(x) = x/sqrt(x^2+(1-x)^2)
>
> The last one is probably my favorite, as it has the same gain ratio
> f(x)/f(1-x) as a linear cross fade. It doesn't start the fade in as
> abruptly as the others.
>
> This is quite a similar problem as finding a nice pan law.
>

Thanks for your detailed explanation. It is much clearer to me now.
Its seems like ideally the user would be able to adjust the cross fade
curve dynamically with the ability to observe the resulting amplitude
through the cross fade, before applying any destructive changes to the
sample data. That certainly gives me some ideas to chew on.

> -olli
> --
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info, FAQ, source code archive, list archive, book reviews, dsp links
> http://music.columbia.edu/cmc/music-dsp
> http://music.columbia.edu/mailman/listinfo/music-dsp
>

Best regards,
Element Green
robert bristow-johnson
2010-11-30 20:58:54 UTC
Permalink
On Nov 29, 2010, at 8:50 PM, Element Green wrote:

> On Thu, Nov 25, 2010 at 9:33 PM, robert bristow-johnson
> <***@audioimagination.com> wrote:
>>
>> depending on how big your "window" is, i think a better term for
>> this is
>> *cross-correlation* not autocorrelation. it's a single stream of
>> audio so
>> in a sense of the word, it *is* autocorrelation, but what i
>> normally think
>> of, with that semantic is something where the lag is no bigger or
>> not much
>> bigger than the analysis window of either loop-end region of the
>> audio and
>> the loop-begin.
>>
>> if the loop points are separated by a much longer time (number of
>> samples)
>> than the size (in samples) of the two slices of audio being
>> correlated, it's
>> really cross-correlation. and you might find poor correlation
>> given all
>> lags that you're looking at. in fact, doing cross-correlation from
>> one part
>> of the tone or sound to another part that has a rapid change in
>> amplitude
>> envelope might fool your correlation into thinking there is a good
>> match
>> when there really isn't (because the amplitude is increasing, then
>> the
>> cross-correlation increases, but not necessarily because of a good
>> match).
>>
>> so, instead of either cross or autocorrelation, you might want to
>> consider
>> AMDF between the loop end and potential candidates to loop back
>> to. instead
>> of looking for a maximum, you're looking for a minimum and a very low
>> minimum means a good match (or a bad match during a very low signal
>> level).
>
> Looking at the equation here for AMDF:
> http://mi.eng.cam.ac.uk/~ajr/SpeechAnalysis/node72.html
>
> It seems like the algorithm I came up with independently is very
> similar. The absolute value of the difference of the sample points is
> taken as with AMDF. Prior to summing the values together though, I'm
> multiplying by the window I described before (with a peak in the
> center where the loop point is), giving samples closer to the loop
> point more weight.

it's AMDF (with a more window). that link above shows the AMDF with a
rectangular window. usually "no window" means "rectangular window".
and usually we think that rectangular ain't the best kind of window.

you can also square the difference (which is the same thing as
squaring the abs value). i might call this the ASDF. a continuous-
time representation (used for pitch detection) is depicted as Eq. (1)
in the Wavetable-101.pdf paper you can find at the music-dsp site. in
fact, you can raise the abs value of the difference to any power
that's a positive number. this is really the Lp norm where AMDF with
be the L1 norm and ASDF would be the L2 norm. the higher the power,
the more it emphasizes the bigger errors (de-emphasizing the little
errors).

one more thing about the L2 norm and ASDF is that it can be related
directly to a form of autocorrelation. the ASDF is really an upside-
down autocorrelation (with an offset). so the ASDF will have nulls or
valleys precisely where the autocorrelation has peaks.

> In practice this seems to work quite well and I'm going to leave it as
> is for now. It seems reasonably fast and straight forward.
>
>>
>> find good loop points, then crossfade.
>>
>> another thing about cross fading is that there is something you can
>> do to
>> adapt a little to better or poor loop points. if the loop points
>> (and the
>> window surrounding them) match well, then you're doing a crossfade
>> between
>> coherent audio and a constant voltage crossfade is indicated (when
>> the
>> crossfade is half done, both the fade out and fade in envelopes are
>> at 50%).
>> if the loop points are not well matched (but it's the best loop
>> points your
>> correlation function can find), then you want to do a crossfade
>> that is
>> closer to a constant power crossfade where both fade in and fade out
>> envelopes are at 70.7% at the midpoint of the crossfade. there is
>> a way to
>> define the optimal crossfade function for any correlation between 0
>> (when
>> it's like crossfading white noise to white noise) to 100% (like
>> crossfading
>> a perfectly periodic waveform to a similarly appearing portion of the
>> waveform at loop start).
>>
>> does any of this make any sense?
>>
>
> I'm not sure I'm following you.
Element Green
2010-12-01 17:30:45 UTC
Permalink
On Tue, Nov 30, 2010 at 12:58 PM, robert bristow-johnson
<***@audioimagination.com> wrote:
>
> On Nov 29, 2010, at 8:50 PM, Element Green wrote:
>>
>> It seems like the algorithm I came up with independently is very
>> similar.  The absolute value of the difference of the sample points is
>> taken as with AMDF.  Prior to summing the values together though, I'm
>> multiplying by the window I described before (with a peak in the
>> center where the loop point is), giving samples closer to the loop
>> point more weight.
>
> it's AMDF (with a more window).  that link above shows the AMDF with a
> rectangular window.  usually "no window" means "rectangular window".  and
> usually we think that rectangular ain't the best kind of window.
>
> you can also square the difference (which is the same thing as squaring the
> abs value).  i might call this the ASDF.  a continuous-time representation
> (used for pitch detection) is depicted as Eq. (1) in the Wavetable-101.pdf
> paper you can find at the music-dsp site.  in fact, you can raise the abs
> value of the difference to any power that's a positive number.  this is
> really the Lp norm where AMDF with be the L1 norm and ASDF would be the L2
> norm.  the higher the power, the more it emphasizes the bigger errors
> (de-emphasizing the little errors).
>
> one more thing about the L2 norm and ASDF is that it can be related directly
> to a form of autocorrelation.  the ASDF is really an upside-down
> autocorrelation (with an offset).  so the ASDF will have nulls or valleys
> precisely where the autocorrelation has peaks.
>


Now I know what I'm working with. Thanks for that.


>
> Olli has responded with the two particular end-points.  if the two loop
> points are well correlated, you want to use the linear crossfading you
> planned (where the fade-in and fade-out functions always add to 1, that's
> what i call a "constant voltage" crossfade).  but if the two loop points are
> completely uncorrelated (like you would get for white noise), then you want
> crossfade envelopes that are the sqrt() of your constant voltage crossfade
> (you want the square of your envelopes to add to 1, i call that a "constant
> power" crossfade).
>
> you should always be able to find loop points that have correlation of at
> least 0 (completely uncorrelated).  even if it's real crap that you are
> splicing to other real crap, assuming both signals have the DC removed, the
> correlation function will have no DC component and both negative values and
> positive values (with different lags).  you will always want to choose a lag
> with the largest correlation (a normalized correlation as close to 1 as
> possible).  so your correlation will be better than what you would get if it
> was completely uncorrelated white noise.
>

Fortunately this algorithm is usually going to be used on Instruments
which already have a periodic nature. Seldom will there be a case
where looping white noise is desirable and then the user probably
doesn't need assistance figuring out the optimal location (manually
setting a loop and listening to the result should suffice).

> a few years ago i was investigating this and was considering writing a paper
> for the AES about it.  there is a theory that you can design (on the fly)
> optimal crossfade envelopes for any normalized correlation between 0 and 1.
>  if you want, i can dig up the notes and equations about it.
>

That certainly sounds interesting. So ultimately a cross fade
function curve could be picked based on the output of the AMDF
algorithm. I would definitely be interested in the relevant equation
:-)

>> Its a sample/instrument editor, so its all non-realtime.
>
> then what you should do is have your editor test a variety of loop-end times
> and for each of those, find the best loop-begin point (according to your
> AMDF or ASDF measure).  so you would have a list of loop-end candidates,
> each with their optimal loop-begin point (where the begin point and end
> point match the best) and you would choose the loop-end candidate that has
> the best match to its associated loop-begin.
>
> if this is for a simple sampler, and if your sampler can more simply "jump"
> from the loop-end to loop-begin point without the crossfade in real-time,
> then you can *pre*-crossfade the audio data so that the jump is as seamless
> as it would be if it were crossfaded.  that's an old sampling keyboard trick
> (from the 1980s).  it works very well for good matches (well correlated).
>  not sure it would be very good for poor matches, but splices around poor
> matches sound like crap anyway.

Yes, my application (Swami) currently exhaustively searches two ranges
(set by the user) of where to search for start and end loop points. A
list of results is generated, which are sorted by default by relative
"quality" which is the number obtained from the AMDF algorithm. There
are also some other parameters for filtering what sort of results end
up in the final list. This includes the idea of loop groups, where
loops with similar lengths and start positions are considered to be a
part of the same group and only the best result from that group is
used.

The underlying synthesizer is FluidSynth, which does interpolation
around the loop, so it is at least making an attempt to smooth out any
major inconsistencies. I like the idea of doing on the fly cross
fading though. At least for auditioning the effect prior to applying
it.

>
> --
>
> r b-j                  ***@audioimagination.com
>
> "Imagination is more important than knowledge."
>

Seems like this topic is getting close to being exhausted for now.
Thank you all for your very detailed and helpful responses.

Best regards,
Element Green
Olli Niemitalo
2010-12-01 23:42:12 UTC
Permalink
On Tue, Nov 30, 2010 at 10:58 PM, robert bristow-johnson
<***@audioimagination.com> wrote:

> there is a theory that you can design (on the fly)
> optimal crossfade envelopes for any normalized correlation between 0 and 1.
> if you want, i can dig up the notes and equations about it.

I found it tempting enough to try and see if I would come up with the
same equations. :) So here goes, from my part...

For a time position t=0..1 inside the cross-fade, we would like to mix
the two signals, x faded in, y faded out, with gains that satisfy the
proportion gain_x/gain_y = t/(1-t). That is the same proportion as
with a linear fade, which we consider ideal in case the signals are
fully and positively correlated. We will consider the two cross-faded
signals as noise. The amplitude of noise is described by its standard
deviation, stddev_x and stddev_y in this case. For two (normally
distributed?) random variables x and y, there is an equation for the
standard deviation of their sum, given the standard deviations of the
individual random variables and the correlation coefficient R, which
can take values in range -1 (full negative correlation) to 0 (no
correlation) to 1 (full correlation):

stddev_x_plus_y = sqrt(stddev_x ^ 2 + stddev_y ^ 2 + 2 * R * stddev_x
* stddev_y)

That came straight out of the Wikipedia page "Sum of normally
distributed random variables".

If we take into account the gains that we apply, we must modify the
formula a bit:

stddev_mixed = sqrt((gain_x * stddev_x)^2 + (gain_y * stddev_y)^2 + 2
* R * gain_x * stddev_x * gain_y * stddev_y)

Now, if we again consider the behavior of the linear fade between two
fully and positively correlated signals, it will give:

stddev_mixed_linear_correlated = sqrt((t * stddev_x)^2 + ((1-t) *
stddev_y)^2 + 2 * 1 * t * stddev_x * (1-t) * stddev_y) = stddev_x * t
+ stddev_y * (1-t)

That looks correct. It's a linear fade between the amplitudes of the
two signals. We'd like to see the same for any value of R. So, we
write a constraint for gain_x and gain_y:

sqrt((gain_x * stddev_x)^2 + (gain_y * stddev_y)^2 + 2 * R * gain_x *
stddev_x * gain_y * stddev_y) = stddev_x * t + stddev_y * (1-t)

We wanted gain_x and gain_y to be related by gain_x/gain_y = t/(1-t),
so we can plug in gain_y = gain_x * (1-t) / t:

sqrt((gain_x * stddev_x)^2 + (gain_x * (1-t) / t * stddev_y)^2 + 2 * R
* gain_x * stddev_x * gain_x * (1-t) / t * stddev_y) = stddev_x * t +
stddev_y * (1-t)

That's quite a mess, but I managed to beat this out of it using
Wolfram Alpha (available on-line):

gain_x = (t * (stddev_x * t - stddev_y * t +stddev_y)) /
sqrt(stddev_x^2 * t^2 - 2 * stddev_x * stddev_y * R * t * (t-1) +
stddev_y^2 * (t-1)^2)

Quite a monster still. For stddev_x and stddev_y you could use the
square root of the sum of squared differences to the mean, calculated
from the samplepoints within the fade. However, normally you would
deal with signals that have the same amplitude, so we set stddev_y =
stddev_x, resulting in:

gain_x = t / sqrt(2 * t * (R + t - 1 - R * t) + 1)

That should be good enough for practical purposes. To calculate
gain_y, you'd replace t with 1-t. The last formula (and probably also
the previous one) works OK for values of R other than -1. You don't
have to have R >= 0. For uncorrelated noise signals, R calculated from
the samplepoints would be somewhere around 0, so it actually wouldn't
be that uncommon for it to be slightly negative.

-olli
Sebastian Stober
2010-11-25 12:05:40 UTC
Permalink
Hi Joshua,

you might be interested in this blog post:
http://runningwithdata.tumblr.com/post/597154309/earworm-capsule
about a graph-based approach.

Best regards,
Sebastian
Element Green
2010-11-26 04:54:30 UTC
Permalink
On Thu, Nov 25, 2010 at 4:05 AM, Sebastian Stober <***@ovgu.de> wrote:
> Hi Joshua,
>
> you might be interested in this blog post:
> http://runningwithdata.tumblr.com/post/597154309/earworm-capsule
> about a graph-based approach.
>
> Best regards,
> Sebastian
>

Does indeed sound like an interesting project.
Ross Bencina
2010-11-25 12:43:39 UTC
Permalink
Element Green wrote:
> I'm the author of a SoundFont instrument editing application called
> Swami (http://swami.sourceforge.net). A while back an interested
> developer added a loop finding algorithm which I integrated into the
> application. This feature is supposed to generate a list of start/end
> loop points which are optimal for "seamless loops".


What kind of loops are you talking about here? loops in pitched musical
instrument samples or drum/percussive rhythm loops?


> The original algorithm was based on autocorrelation. There were many
> bugs in the implementation and I was having trouble understanding how
> it functioned, so I wrote a new algorithm which currently does not use
> autocorrelation. The new algorithm seems to come up with good
> candidates for "seamless loops", but is slower than the old algorithm
> by a factor of 5, but at least it does not suffer from bugs, which
> often resulted in unpredictable results.


Assumng you're talking about musical instrument samples then autocorrelation
(or AMDF) is one way to do it.

You need to get clear about whether your only goal is "seamless" or if you
want to preserve the pitch of the original accurately too -- that's pretty
important for musical instrument samples :-) and will be more relevant for
short loops. In that case you may be better off with a good spectral
fundamental frequency estimator like this one to choose the loop length:
http://ems.music.uiuc.edu/beaucham/papers/JASA.04.94.pdf

That could be combined with autocorrelation/AMDF and/or crossfading to
choose where to place the loop.

Note that there are a number of different ways you can perform the crossfade
if you choose to do that (linear, equal power, I remember my Ensoniq EPS
even had some weird ensemble "bow-tie" crossfade options).

For longish loops on musical/tonal/synth sounds that are spectrally dynamic
(ie they have lots of phasing or lfo stuff going on) you could also use
spectral descriptors to find good match points (you probably want to combine
this with autocorrelation/AMDF to get both the time domain and spectral
aspects correct).

A while ago James Chandler Jr gave a really good description of an off-line
time stretching algorithm, part of which did looping for pitched sounds --
you might want to have a look in the list archives that.. it's probably of
interest to you.
Element Green
2010-11-26 05:10:49 UTC
Permalink
On Thu, Nov 25, 2010 at 4:43 AM, Ross Bencina
<rossb-***@audiomulch.com> wrote:
> Element Green wrote:
>>
>> I'm the author of a SoundFont instrument editing application called
>> Swami (http://swami.sourceforge.net).  A while back an interested
>> developer added a loop finding algorithm which I integrated into the
>> application.  This feature is supposed to generate a list of start/end
>> loop points which are optimal for "seamless loops".
>
>
> What kind of loops are you talking about here? loops in pitched musical
> instrument samples or drum/percussive rhythm loops?
>

Swami is a "wavetable" instrument editor, so I'm referring to looping
individual instrument audio samples.

>
> Assumng you're talking about musical instrument samples then autocorrelation
> (or AMDF) is one way to do it.
>

I wasn't aware of AMDF, thanks for mentioning that.

> You need to get clear about whether your only goal is "seamless" or if you
> want to preserve the pitch of the original accurately too -- that's pretty
> important for musical instrument samples :-) and will be more relevant for
> short loops. In that case you may be better off with a good spectral
> fundamental frequency estimator like this one to choose the loop length:
> http://ems.music.uiuc.edu/beaucham/papers/JASA.04.94.pdf
>

Yes definitely want to preserve the pitch and for that matter don't
really want to modify the sample data at all. Cross fading would be a
separate operation. I didn't really delve too much into that PDF,
since reading it was taxing my brain with very little knowledge gain
;-)

> That could be combined with autocorrelation/AMDF and/or crossfading to
> choose where to place the loop.
>

I would think autocorrelation or something like that would provide the
location and size of the loop, so I don't see the need for a pitch
detection algorithm. Or am I overlooking something?

> Note that there are a number of different ways you can perform the crossfade
> if you choose to do that (linear, equal power, I remember my Ensoniq EPS
> even had some weird ensemble  "bow-tie" crossfade options).
>

Thanks for mentioning that. I hadn't really thought about what sort
of cross fades could be used or that there could even be multiple
types.

> For longish loops on musical/tonal/synth sounds that are spectrally dynamic
> (ie they have lots of phasing or lfo stuff going on) you could also use
> spectral descriptors to find good match points (you probably want to combine
> this with autocorrelation/AMDF to get both the time domain and spectral
> aspects correct).
>

I'm not sure what you mean by "spectral descriptors". Are you
referring to something like an FFT or another time to frequency domain
conversion? It definitely would be nice to try and take signal
volume, phasing and what not into account when the user is scanning a
large portion of the sample for loop candidates.

> A while ago James Chandler Jr gave a really good description of an off-line
> time stretching algorithm, part of which did looping for pitched sounds --
>  you might want to have a look in the list archives that.. it's probably of
> interest to you.
Ross Bencina
2010-11-26 06:01:00 UTC
Permalink
Element Green wrote:
> I would think autocorrelation or something like that would provide the
> location and size of the loop, so I don't see the need for a pitch
> detection algorithm. Or am I overlooking something?

Peaks in the ACF might be good loop candidates but won't necessarily all be
multiples of the fundamental frequency (consider a sound with a missing
fundamental for example).

The best loops will almost certainly be multiples of the fundamental.. and
that algorithm I posted is more reliable than ACF peak picking. It really
depends how good your algorithm needs to be -- it's definitely a second
order concern -- especially if that PDF is melting your brain.

Also, if you knew the fundamental frequency you would only need to use the
ACF to test a small number of candidate loop points (eg near zero crossings
at multiples of the fundamental frequency) so that could be a good
performance optimisation too.



> I'm not sure what you mean by "spectral descriptors". Are you
> referring to something like an FFT or another time to frequency domain
> conversion? It definitely would be nice to try and take signal
> volume, phasing and what not into account when the user is scanning a
> large portion of the sample for loop candidates.

spectral descriptors are usually summary values computed from a short-term
FFT. there are hundreds in the literature but I was thinking of just using
some simple ones like the MPEG-7 Timbral Spectral descriptors mentioned
here:
3.3.1.7 Timbral Spectral
http://mpeg.chiariglione.org/standards/mpeg-7/mpeg-7.htm#E12E43

For example, spectral centroid:
http://en.wikipedia.org/wiki/Spectral_centroid



HTH

Ross.
robert bristow-johnson
2010-11-26 06:14:20 UTC
Permalink
On Nov 26, 2010, at 1:01 AM, Ross Bencina wrote:

> Element Green wrote:
>> I would think autocorrelation or something like that would provide
>> the
>> location and size of the loop, so I don't see the need for a pitch
>> detection algorithm. Or am I overlooking something?
>
> Peaks in the ACF might be good loop candidates but won't necessarily
> all be multiples of the fundamental frequency (consider a sound with
> a missing fundamental for example).

you can have a periodic (or quasi-periodic) signal with absolutely no
energy at harmonic #1 (what i would call the fundamental), and as long
as it has energy in most other odd harmonics, the autocorrelation
function will work just as well. there will still be peaks at lags
are at integer multiples of the apparent period.

> The best loops will almost certainly be multiples of the
> fundamental.. and that algorithm I posted is more reliable than ACF
> peak picking. It really depends how good your algorithm needs to be
> -- it's definitely a second order concern -- especially if that PDF
> is melting your brain.
>
> Also, if you knew the fundamental frequency you would only need to
> use the ACF to test a small number of candidate loop points (eg near
> zero crossings at multiples of the fundamental frequency) so that
> could be a good performance optimisation too.

whatever finds you a pitch period (the reciprocal of the fundamental
frequency) also finds you loop candidates. i dunno how you would know
the fundamental without also knowing other loop candidates. for me, a
pitch detection algorithm does so by choosing from a set of candidate
pitches. each of those candidate pitches are also candidate loop
displacements, displacement in samples is the reciprocal of candidate
fundamental frequency (which of those candidates is best, for either
reason, is another issue).

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Ross Bencina
2010-11-26 07:21:33 UTC
Permalink
robert bristow-johnson wrote:
> you can have a periodic (or quasi-periodic) signal with absolutely no
> energy at harmonic #1 (what i would call the fundamental), and as long as
> it has energy in most other odd harmonics, the autocorrelation function
> will work just as well. there will still be peaks at lags are at integer
> multiples of the apparent period.

The way I see it, the ACF will be quasi-periodic at the rate of the lowest
present harmonic. This doesn't tell you anything about what the fundamental
is. The peaks may be there, but how do you know which one(s) are related to
the fundamental?

Spectral peaks in an FFT will be spaced by the fundamental (and the two way
mismatch algorithm is a way of getting a more robust detector from that). I
don't know that much about this stuff, but I havn't heard of a time domain
FFE algorithm that works as well -- can you suggest one?


>> Also, if you knew the fundamental frequency you would only need to use
>> the ACF to test a small number of candidate loop points (eg near zero
>> crossings at multiples of the fundamental frequency) so that could be a
>> good performance optimisation too.
>
> whatever finds you a pitch period (the reciprocal of the fundamental
> frequency) also finds you loop candidates. i dunno how you would know
> the fundamental without also knowing other loop candidates.

For this application I am suggesting that the only valid loop candidate are
integer multiples of the fundamental -- so you need to know the fundamental
before you can start talking about loop candidates. At least that's how I'm
seeing it. If you don't care about the fundamental (which is fair enough)
then sure, all peaks in the ACF are loop candidates.


> for me, a pitch detection algorithm does so by choosing from a set of
> candidate pitches. each of those candidate pitches are also candidate
> loop displacements, displacement in samples is the reciprocal of
> candidate fundamental frequency (which of those candidates is best, for
> either reason, is another issue).

For me, being a multiple of the fundamental frequency is one requirement to
qualify as a loop candidate. Other desiderata may include proximity to zero
crossings, spectral and/or waveform similarity at endpoints.

If it's a long sample with vibrato the user may want to track FFE over short
time scales and use this as another match criteria for candidate loop
points.

Ross.
robert bristow-johnson
2010-11-26 18:25:31 UTC
Permalink
On Nov 26, 2010, at 2:21 AM, Ross Bencina wrote:

> robert bristow-johnson wrote:
>> you can have a periodic (or quasi-periodic) signal with absolutely
>> no energy at harmonic #1 (what i would call the fundamental), and
>> as long as it has energy in most other odd harmonics, the
>> autocorrelation function will work just as well. there will still
>> be peaks at lags are at integer multiples of the apparent period.
>
> The way I see it, the ACF will be quasi-periodic at the rate of the
> lowest present harmonic.

but that's not the case. if x(t) is periodic, then Rx(tau) (as
calculated with an infinitely large window) is periodic with the same
period.

a pretty robust definition of the autocorrelation of x(t) might be:

+T
Rx(tau) = lim{ 1/(2T) integral{ x(t)*x(t+tau) dt} }
T->inf -T

now, we don't have infinitely large windows to calculate Rx() with, so
as the lag, tau, increases the overlap of the windows is smaller

let
v(t) = x(t)*w(t) (w(t) is a very wide window function.)

then
Rx(tau) ~= Rv(tau)

+inf
Rv(tau) = integral{ v(t)*v(t+tau) dt}
-inf

so, while x(t) is periodic, v(t) is not and there is an envelope on
the apparent autocorrelation Rx(tau) which is:

+inf
Rw(tau) = integral{ w(t)*w(t+tau) dt}
-inf

if P is the period of x(t), then

x(t+P) = x(t) for all t

and
+inf
Rv(tau) = integral{ x(t)*w(t)*x(t+tau)*w(t+tau) dt}
-inf

+inf
Rv(P) = integral{ x(t)*w(t)*x(t+P)*w(t+P) dt}
-inf

+inf
= integral{ (x(t)^2)*w(t)*w(t+P) dt}
-inf
but
+inf
Rv(0) = integral{ (x(t)^2)*(w(t)^2) dt}
-inf

Rv(P) is reduced from Rv(0) because w(t)*w(t+P) is smaller than w(t)^2
because w(t) and w(t+P) do not overlap as much as w(t) overlaps on top
of itself. i think, for some windows (like the rectangular window),
you can show something like

Rv(n*P) = Rx(n*P)*Rw(n*P)

which is true only at the multiples of the period, P.

underneath that envelope, you will see peaks at multiples of the
period of x(t), whether there is 1st harmonic in there or not.

now there are other ways of computing the autocorrelation (or
something that looks like it) that do not have this envelope resulting
from windowing, so the autocorrelation peaks are as large as Rx(0) if
x(t) is perfectly periodic. the method shown with Rv(tau) is what you
would get if you were using the inverse FT of |v(t)|^2.

> This doesn't tell you anything about what the fundamental is. The
> peaks may be there, but how do you know which one(s) are related to
> the fundamental?

there *is* the octave problem. you can have a tone of 360 Hz (with
all of the harmonics), but mathematically, it is also a tone of 180 Hz
(where all the odd harmonics have zero energy) or 120 Hz or 90 Hz
(with a lot of missing harmonics). the autocorrelation function will
peak at those periods also. so you look for the peak of sufficient
height (so, unfortunately, there is some kind of thresholding needed)
that has the smallest lag, which would be the one at 1/360 sec

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Stefan Stenzel
2010-12-02 12:52:20 UTC
Permalink
Now I wonder, am I the only one to calculate ACF using FFT?

Regarding seamless loops, I found that quantizing frequencies
to integer numbers of periods in the loop works extremely well.

Regards,
Stefan
robert bristow-johnson
2010-12-02 15:39:57 UTC
Permalink
On Dec 2, 2010, at 7:52 AM, Stefan Stenzel wrote:

> Now I wonder, am I the only one to calculate ACF using FFT?

the version of ACF that you calculate using the FFT is

Rv[tau] = SUM{ v[n]*v[n+tau] }

where

v[n] = x[n]*w[n] where w[n] is a window function,
supposedly much wider than any period of x[n].

this results in an ACF for x[n] that approximates as

Rx[tau] ~= Rv[tau] = SUM{ x[n]*w[n] * x[n+tau]*w[n+tau] }

if P is a period for your input x[n], then x[n+P] = x[n] for all n.
at lags that are integer multiples of that period

Rv[m*P] = SUM{ x[n]*x[n+m*P] * w[n]*w[n+m*P] }

= SUM{ x[n]*x[n] * w[n]*w[n+m*P] }

note that

Rv[0] = SUM{ x[n]*x[n] * w[n]*w[n] }

the partially overlapped windows w[n]*w[n+m*P] sum to less than the
fully overlapped window w[n]*w[n], so Rv[m*P] will sum to something
less than Rv[0]. that something less is like an envelope on the ACF
that you are looking for. even for a perfectly periodic x[n], the
peaks in your ACF do not reach the height of Rx[0]. to within a
scaling factor

Rv[m*P] = Rv[0] * Rw[m*P] where

Rw[tau] = SUM{ w[n]*w[n+tau] }

so that leads me to two questions for you, Stefan:

i presume that you find peaks that are reduced by that envelope.

1. how do you discriminate between peaks that are unequally weighted
by that envelope? which peak do you pick?

2. for a real-time operation, what kind of delay are you dealing
with? you must input in your buffer, window it, pass that to the FFT,
magnitude square, and iFFT before you see your ACF (with an
envelope)? even with an infinitely fast computer, you still have the
delay of filling the entire buffer.

for comparison, with a time-domain ACF, you can calculate R[tau] for
increasing lags on the fly as more audio data comes in. you can
always correlate the most current slice of audio against a slice of
audio fixed farther in the past. as time goes on, the lag difference
increases until you are done with this particular ACF and you start
over.

> Regarding seamless loops, I found that quantizing frequencies
> to integer numbers of periods in the loop works extremely well.

that's the whole point to using the ACF (or AMDF) in splicing audio
for loops in samplers or loops in real-time pitch shifters. the only
way to get *all* frequencies present to displace an integer number of
periods, is for the data to have some periodicity. if you choose 2
periods of the fundamental, the same displacement is 4 periods of the
2nd harmonic and 6 periods of the 3rd harmonic. if it's not quasi-
periodic, your ACF (or AMDF) will not give you good candidates, so you
pick the best candidate you can, take the glitch that results, and
consider whether frequency-domain shifting or time-scaling is better.

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
robert bristow-johnson
2010-12-06 07:59:01 UTC
Permalink
This is a continuation of the thread started by Element Green titled:
Algorithms for finding seamless loops in audio

As far as I know, it is not published anywhere. A few years ago, I
was thinking of writing this up and publishing it (or submitting it
for publication, probably to JAES), and had let it fall by the
wayside. I'm "publishing" the main ideas here on music-dsp because of
some possible interest here (and the hope it might be helpful to
somebody), and so that "prior art" is established in case of anyone
like IVL is thinking of claiming it as their own. I really do not
know how useful it will be in practice. It might not make any
difference. It's just a theory.

______________________________________________________________________

Section 0:

This is about the generalization of the different ways we can splice
and crossfade audio that has these two extremes:

(1) Splicing perfectly coherent and correlated signals
(2) Splicing completely uncorrelated signals

I sometimes call the first case the "constant-voltage crossfade"
because the crossfade envelopes of the two signals being spliced add
up to one. The two envelopes meet when both have a value of 1/2. In
the second case, we use a "constant-power crossfade", the square of
the two envelopes add to one and they meet when both have a value of
sqrt(1/2)=0.707.

The questions I wanted to answer are: What does one do for cases in
between, and how does one know from the audio, which crossfade
function to use? How does one quantify the answers to these
questions? How much can we generalize the answer?

______________________________________________________________________

Section 1: Set up the problem.

We have two continuous-time audio signals, x(t) and y(t), and we want
to splice from one to the other at time t=0. In pitch-shifting or
time-scaling or any other looping, y(t) can be some delayed or
advanced version of x(t).

e.g. y(t) = x(t-P)

where P is a period length or some other "good" splice
displacement. We get that value from an algorithm we call a "pitch
detector".

Also, it doesn't matter whether x(t) is getting spliced to y(t) or the
other way around, it should work just as well for the audio played in
reverse. And it should be no loss of generality that the splice
happens at t=0, we define our coordinate system any damn way we damn
well please.

The signal resulting from the splice is

v(t) = a(t)*x(t) + a(-t)*y(t)

By restricting our result to be equivalent if run either forward or
backward in time, we can conclude that "fade-out" function (say that's
a(t)) is the time-reversed copy of the "fade-in" function, a(-t).

For the correlated case (1): a(t) + a(-t) = 1 for all t

For the uncorrelated case (2): (a(t))^2 + (a(-t))^2 = 1 for all t

This crossfade function, a(t), has well-defined even and odd symmetry
components:

a(t) = e(t) + o(t)
where

even part: e(t) = e(-t) = ( a(t) + a(-t) )/2
odd part: o(t) = -o(-t) = ( a(t) - a(-t) )/2

And it's clear that

a(-t) = e(t) - o(t) .


For example, if it's a simple linear crossfade (equivalent to splicing
analog tape with a diagonally-oriented razor blade):

{ 0 for t <= 1
{
a(t) = { 1/2 + t/2 for |t| < 1
{
{ 1 for t >= 1

This is represented simply, in the even and odd components, as:

e(t) = 1/2

{ t/2 for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1


where sgn(t) is the "sign function": sgn(t) = t/|t| .

This is a constant voltage-crossfade, appropriate for perfectly
correlated signals; x(t) and y(t). There is no loss of generality by
defining the crossfade to take place around t=0 and have two time
units in length. Both are simply a matter of offset and scaling of
time.

Another constant-voltage crossfade would be what I might call a "Hann
crossfade" (after the Hann window):

e(t) = 1/2

{ (1/2)*sin(pi/2 * t) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1


Some might like that better because the derivative is continuous
everywhere. Extending this idea, one more constant-voltage crossfade
is what I might call a "Flattened Hann crossfade":

e(t) = 1/2

{ (9/16)*sin(pi/2 * t) - (1/16)*sin(3*pi/2 * t) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1

This splice is everywhere continuous in the zeroth, first, and second
derivative. A very smooth crossfade.

As another example, a constant-power crossfade would be the same as
any of the above, but where the above a(t) is square rooted:

{ 0 for t <= 1
{
a(t) = { sqrt(1/2 + t/2) for |t| < 1
{
{ 1 for t >= 1

This is what we might use to splice to completely uncorrelated signals
together. We can separate this into even and odd parts as:


{ (1/2)*(sqrt(1/2 + t/2) + sqrt(1/2 - t/2)) for |t| < 1
e(t) = {
{ 1/2 for |t| >= 1


{ (1/2)*(sqrt(1/2 + t/2) - sqrt(1/2 - t/2)) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1

______________________________________________________________________

Section 2: Which crossfade function to use?

Now we shall make a definition and an assumption. We shall define an
inner product of two general signals as:

+inf
<x,y> = <x(t), y(t)> = integral{ x(t)*y(t) * w(t) dt}
-inf

w(t) is a window function that is symmetrical about t=0 and is
probably wider than the crossfade. Strictly speaking, if you were
coming at this from out of a graduate course in metric spaces or
functional analysis, one of the components (probably y(t)) should be
complex conjugated, but since x(t) and y(t) are always real, in this
whole theory, I will not bother with that notation.

This inner product is an degenerate case of the more general cross-
correlation evaluated with a lag of zero:

+inf
Rxy(tau) = <x(t), y(t+tau)> = integral{ x(t)*y(t+tau) * w(t) dt}
-inf

If y(t) is a time-offset copy of x(t), then Rxy(tau) is the
autocorrelation of x(t), Rxx(tau), but also accounting for the time
offset in the lag, tau.

So <x,y> = Rxy(0)

A measure of signal energy or average power is:

+inf
Rxx(0) = <x,x> = integral{ (x(t))^2 * w(t) dt}
-inf

Now, the assumption that we are going to toss in here is that the mean
power of the two signals that we are crossfading, x(t) and y(t), are
equal.

<x,x> = <y,y>

We are assuming that we're not crossfading this very quiet tone or
sound to a very loud sound that is 60 dB louder. Similarly, the
resulting spliced sound, v(t), has the same mean power of the two
signals being spliced:

<v,v> = <x,x> = <y,y>

So, assuming we lined up x(t) and y(t) so that we want to splice from
one to the other at t=0, and scaled x(t) and y(t) so that they have
the same mean power in the neighborhood of t=0, then the inner product
is a measure of how well they are correlated. We shall define this
normalized measure of correlation as:

r = <x,y>/<x,x> = <x,y>/<y,y>

If r = 1, they are perfectly correlated and if r = 0, they are
completely uncorrelated.

We will make the additional assumption that our pitch detection
algorithm will find *some* lag where the correlation is at least zero
correlated. We should not have to deal with splicing *negatively*
correlated audio (that would be quite a "glitch" or a bad splice). If
the signals have no DC component, then their autocorrelations and
their cross-correlations to each other) must have no DC component.
That means there will be values of tau such that Rxy(tau) are either
negative or positive. If it was theoretical white noise, Rxx(tau)
would be zero for |tau| > 0 and Rxx(0) would be the noise variance or
power. But Rxx(tau) cannot be negative for *all* values of tau, even
excluding tau=0.

We can find a value of tau so that Rxx(tau) is non-negative and we
want to choose tau so that has the highest value of Rxx(tau). Then
define

y(t) = x(t + tau)

and then

<x,y> = Rxy(0) = Rxx(tau)

Now we shall also assume that the crossfade function, a(t), is
completely uncorrelated and even statistically independent from the
two signals being spliced. a(t) is a volume control that varies in
time, but is unaffected by anything in x(t) or y(t).

We shall also assume something called "ergodicity". This means that
*time* averages of x(t) and y(t) (or combinations of x(t) and y(t))
are equal to *statistical* averages. If this window, w(t) is scaled
(or normalized) so that its integral is 1,

+inf
integral{ w(t) dt} = 1
-inf

then all these inner products can be related to "expectation values":

<x,y> = E{ x(t) * y(t) }

If x(t) and y(t) are thought of as sorta "random" processes (rather
than well defined deterministic functions), the expectation value is
unmoved no matter what t is. But if the envelope a(t) is considered
deterministic, then it simply scales x(t) or y(t) and is treated as a
constant in the expectation. So at some particular time t0,

<a(t0)*x,y> = E{ (a(t0)*x(t)) * y(t) }

= a(t0) * E{ x(t) * y(t) }

= a(t0) * <x,y>

This is a little sloppy, mathematically, because I am "fixing" t for
a(t) to be t0, but not fixing t for x(t) or y(t) (so that "time
averages" for x(t) and y(t) can be meaningful and equated to
statistical averages).

Recall that

v(t) = a(t)*x(t) + a(-t)*y(t)

Then:

<v,v> = <(a(t)*x(t) + a(-t)*y(t)), (a(t)*x(t) + a(-t)*y(t))>

Using identities that we can apply to expectation values

<v,v> = (a(t))^2*<x,x> + 2*a(t)*a(-t)*<x,y> + (a(-t))^2*<y,y>

Since <v,v> = <x,x = <y,y>, we can divide by <v,v> and get to the key
equation of this whole theory:

1 = (a(t))^2 + 2*r*a(t)*a(-t) + (a(-t))^2

Given the normalized correlation measure, we want the above equation
to be true all of the time. If r=0 (completely uncorrelated), one can
see we get a constant-power crossfade:

(a(t))^2 + (a(-t))^2 = 1

If r=1 (completely correlated), one can see that we get a constant-
voltage crossfade:

(a(t))^2 + (a(-t))^2 + 2*a(t)*a(-t) = ( a(t) + a(-t) )^2 = 1

or, assuming a(t) is non-negative,

a(t) + a(-t) = 1 .

______________________________________________________________________

Section 3: Generalizing the crossfade function

Recall that

a(t) = e(t) + o(t)

a(-t) = e(t) - o(t)

and substituting into

(a(t))^2 + (a(-t))^2 + 2*r*a(t)*a(-t) = 1

results in

(e(t) + o(t))^2 + (e(t) - o(t))^2
+ 2*r*(e(t) + o(t))*(e(t) - o(t)) = 1

Blasting through that gets:

(1+r)*(e(t))^2 + (1-r)*(o(t))^2 = 1/2


This means that, if r is measured and known (from the correlation
function) we have the freedom to define either one of e(t) or o(t)
arbitrarily (as long as the even or odd symmetry is kept) and solve
for the other. We can see that square rooting is involved in solving
for either e(t) or o(t) and there is an ambiguity for which sign to
pick. We shall resolve that ambiguity by adding the additional
assumption that the even-symmetry component, e(t), is non-negative.

e(t) = e(-t) >= 0

Given a general and bipolar odd-symmetry component function,

o(t) = -o(-t)

then we solve for the even component (picking the non-negative square
root):

e(t) = sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )

The overall crossfade envelope would be

a(t) = e(t) + o(t)

= sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 ) + o(t)

______________________________________________________________________

Section 4: Implementation:

Given a particular form for the odd part, o(t) (linear or Hann or
Flattened Hann or whatever is your heart's desire), and for a variety
of values of r, ranging from r=0 to r=1, a collection of envelope
functions, a(t), are pre-calculated and stored in memory. Then, when
pitch detection or loop matching is done, a splice displacement that
is optimal is determined, and if autocorrelation of some form is used
in determining a measure of goodness (or seamlessness, using Element's
language) of that loop splice, that autocorrelation is normalized (by
dividing by Rxx(0)) to get r and that value of r is used to choose
which pre-calculated a(t) from the above collection is used for the
crossfade in the splice.

______________________________________________________________________


--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Stefan Stenzel
2010-12-06 18:23:47 UTC
Permalink
On 06.12.2010 08:59, robert bristow-johnson wrote:
>
> This is a continuation of the thread started by Element Green titled: Algorithms for finding seamless loops in audio

I suspect it works better to *construct* a seamless loop instead trying find one where there is none.

Stefan
robert bristow-johnson
2010-12-06 18:49:15 UTC
Permalink
On Dec 6, 2010, at 1:23 PM, Stefan Stenzel wrote:

> On 06.12.2010 08:59, robert bristow-johnson wrote:
>>
>> This is a continuation of the thread started by Element Green
>> titled: Algorithms for finding seamless loops in audio
>
> I suspect it works better to *construct* a seamless loop instead
> trying find one where there is none.

i can't speak for Greenie or any others, but i myself would be very
interested in what you might have to say about constructing seamless
loops. regarding that, i would like to know the context (e.g. looping
a non-realtime sample editor for a sampling synth vs. a realtime pitch
shifter) and the kinds of signals (quasi-periodic vs. aperiodic vs.
periodic but with detuned higher harmonics). processing in frequency
domain or time domain (or some in both)?

dunno if there is any PPG "secrets" or wisdom to confer, but i would
like to hear or read it.

bestest,

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Olli Niemitalo
2010-12-07 10:27:49 UTC
Permalink
RBJ,

I had a look at your theory, and compared it to my approach (dare not
call it a theory, as it was not as rigorously derived). The following
is how I imagine we thought things out.

Both of us wanted to preserve some aspect(s) of the known-to-be-good
constant-voltage crossfade envelopes, and to generalize from those the
envelope functions for arbitrary values of the correlation
coefficient.

You saw that the odd component o(t) determined the shape of the
constant-voltage envelopes. For those, the even component had to be
e(t) = 1/2 to satisfy the symmetry a(t) + a(-t) = 1 required in
constant-voltage crossfades. So apparently o(t) was capturing the
essential aspects of the crossfade envelope. You showed how to
recalculate e(t) for different values of the correlation coefficient
in such a way that o(t) was preserved.

I, on the other hand, chose that the ratio a(t)/a(-t) (using your
notation) should be preserved for each value of t. To accomplish this,
one could first do the crossfade using constant-voltage envelopes and
then apply to the resulting signal a volume envelope to adjust for any
deviation from perfect positive correlation. Or equivalently, the
compensation could be incorporated into a(t), which I showed how to do
in the case of a linear constant-voltage crossfade. Other
constant-voltage crossfade envelopes than linear could be handled by a
time deformation function u(t) which gives the time at which the
linear constant-voltage envelope function reaches the value of the
desired constant-voltage envelope function at time t. u(t) would then
used instead of t in the formula for a(t) derived for generalization
of the linear crossfade for arbitrary r.

I believe your requirement for r >= 0 could be relaxed. For example,
if one is creating a drum-loop, then it would probably make most sense
to put the loop points in the more quiet areas between the transients.
And there you might only have noise that is independent between the
two loop points, thus giving values of the correlation coefficient
slightly positive or slightly negative. Because the length of a drum
loop is fixed, there might not be so much choice in placement of the
loop points, and a spot giving a slightly negative r might actually be
the most natural choice. I do not think your formulas will fall apart
just as long as -1 < r <= 1.

-olli
robert bristow-johnson
2011-07-09 19:53:30 UTC
Permalink
hi Olli (and others)...

i was reviewing this thread because i wanted to read what Stefan
Stenzel had said and realized that you had posted this response, and i
don't think i or anyone had responded to it. i don't remember reading
it (it must be the cannabis). i hope you're listening Olli - i have a
lot of respect for what i have read from you (the pink elephant paper).

since this comes from last December, i reposted (with more
corrections) the original "theory" at the bottom.

On Dec 7, 2010, at 5:27 AM, Olli Niemitalo wrote:

> RBJ,
>
> I had a look at your theory, and compared it to my approach (dare not
> call it a theory, as it was not as rigorously derived). The following
> is how I imagine we thought things out.
>
> Both of us wanted to preserve some aspect(s) of the known-to-be-good
> constant-voltage crossfade envelopes, and to generalize from those the
> envelope functions for arbitrary values of the correlation
> coefficient.
>
> You saw that the odd component o(t) determined the shape of the
> constant-voltage envelopes. For those, the even component had to be
> e(t) = 1/2 to satisfy the symmetry a(t) + a(-t) = 1 required in
> constant-voltage crossfades.

it need not be the case that e(t) = 1/2 in the non-constant-voltage
crossfades.

> So apparently o(t) was capturing the
> essential aspects of the crossfade envelope. You showed how to
> recalculate e(t) for different values of the correlation coefficient
> in such a way that o(t) was preserved.

i wasn't trying to preserve o(t). it's just that it was easier to get
a handle on a(t) (and a(-t)) if i split it into e(t) and o(t). and
then in the final solution, a square root was involved in solving for
either o(t) or e(t). since o(t) *has* to be bipolar, solving for o(t)
in terms of e(t) is a little more problematic than vise versa because
you *know* that o(t) is necessarily bipolar and you have to deal with
the +/- sqrt() issue. but if you specify o(t) and solve for e(t),
there is no problem with defining e(t) to be always non-negative.

> I, on the other hand, chose that the ratio a(t)/a(-t) (using your
> notation) should be preserved for each value of t.

now, i do not understand why you would do that. by "preserved", do
you mean constant over all t? even for simple, linear crossfades,
you cannot satisfy that.

> To accomplish this,
> one could first do the crossfade using constant-voltage envelopes and
> then apply to the resulting signal a volume envelope to adjust for any
> deviation from perfect positive correlation. Or equivalently, the
> compensation could be incorporated into a(t), which I showed how to do
> in the case of a linear constant-voltage crossfade. Other
> constant-voltage crossfade envelopes than linear could be handled by a
> time deformation function u(t) which gives the time at which the
> linear constant-voltage envelope function reaches the value of the
> desired constant-voltage envelope function at time t. u(t) would then
> used instead of t in the formula for a(t) derived for generalization
> of the linear crossfade for arbitrary r.

so if a(t)/a(-t) is not "preserved" over different values of t but is
preserved over different values of r, i am not sure you want to do that.

what is the fundamental reason for preserving a(t)/a(-t) ?

> I believe your requirement for r >= 0 could be relaxed. For example,
> if one is creating a drum-loop, then it would probably make most sense
> to put the loop points in the more quiet areas between the transients.
> And there you might only have noise that is independent between the
> two loop points, thus giving values of the correlation coefficient
> slightly positive or slightly negative. Because the length of a drum
> loop is fixed, there might not be so much choice in placement of the
> loop points, and a spot giving a slightly negative r might actually be
> the most natural choice. I do not think your formulas will fall apart
> just as long as -1 < r <= 1.

but i don't think it is necessary to deal with lags where Rxx(tau) <
0. why splice a waveform to another part of the same waveform that
has opposite polarity? that would create an even a bigger glitch.
you want to find a value of the lag, tau, so that Rxx(tau) is maximum
(not including tau around 0) and then your splice is as seamless as it
can be. then, if the splice is real good (r=1), you use a constant-
voltage crossfade. when your splice is poor (r=0 and it need not be
poorer than that), you use a constant-power crossfade.

but i agree that the crossfade theory i presented does not require r >
1. i just wanted to show that it degenerates to a constant-voltage
crossfade when r=1 and a constant-power crossfade when r=0.

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."




This is a continuation of the thread started by Element Green titled:
Algorithms for finding seamless loops in audio

As far as I know, it is not published anywhere. A few years ago, I
was thinking of writing this up and publishing it (or submitting it
for publication, probably to JAES), and had let it fall by the
wayside. I'm "publishing" the main ideas here on music-dsp because of
some possible interest here (and the hope it might be helpful to
somebody), and so that "prior art" is established in case of anyone
like IVL is thinking of claiming it as their own. I really do not
know how useful it will be in practice. It might not make any
difference. It's just a theory.

______________________________________________________________________

Section 0:

This is about the generalization of the different ways we can splice
and crossfade audio that has these two extremes:

(1) Splicing perfectly coherent and correlated signals
(2) Splicing completely uncorrelated signals

I sometimes call the first case the "constant-voltage crossfade"
because the crossfade envelopes of the two signals being spliced add
up to one. The two envelopes meet when both have a value of 1/2. In
the second case, we use a "constant-power crossfade", the square of
the two envelopes add to one and they meet when both have a value of
sqrt(1/2) = 0.707 .

The questions I wanted to answer are: What does one do for cases in
between, and how does one know from the audio, which crossfade
function to use? How does one quantify the answers to these
questions? How much can we generalize the answer?

______________________________________________________________________

Section 1: Set up the problem.

We have two continuous-time audio signals, x(t) and y(t), and we want
to splice from one to the other at time t=0. In pitch-shifting or
time-scaling or any other looping, y(t) can be some delayed or
advanced version of x(t).

e.g. y(t) = x(t+P)

where P is a period length or some other "good" splice
displacement. We get that value, P, from an algorithm
we call a "pitch detector".

Also, it doesn't matter whether x(t) is getting spliced to y(t) or the
other way around, it should work just as well for the audio played in
reverse. And it should be no loss of generality that the splice
happens at t=0, we define our coordinate system any damn way we damn
well please.

The signal resulting from the splice is

v(t) = a(t)*y(t) + a(-t)*x(t)

By restricting our result to be equivalent if run either forward or
backward in time, we can conclude that "fade-in" function (say that's
a(t)) is the time-reversed copy of the "fade-out" function, a(-t).

For the correlated case (1): a(t) + a(-t) = 1 for all t

For the uncorrelated case (2): (a(t))^2 + (a(-t))^2 = 1 for all t

This crossfade function, a(t), has well-defined even and odd symmetry
components:

a(t) = e(t) + o(t)
where

even part: e(t) = e(-t) = ( a(t) + a(-t) )/2
odd part: o(t) = -o(-t) = ( a(t) - a(-t) )/2

And it's clear that

a(-t) = e(t) - o(t) .


For example, if it's a simple linear crossfade (equivalent to splicing
analog tape with a diagonally-oriented razor blade):

{ 0 for t <= -1
{
a(t) = { 1/2 + t/2 for -1 < t < 1
{
{ 1 for t >= 1

This is represented simply, in the even and odd components, as:

e(t) = 1/2

{ t/2 for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1


where sgn(t) is the "sign function":


{ -1 for t < 0
{
sgn(t) = { 0 for t = 0
{
{ +1 for t > 0

a shorthand: sgn(t) = t/|t| .

This is a constant voltage-crossfade, appropriate for perfectly
correlated signals; x(t) and y(t). There is no loss of generality by
defining the crossfade to take place around t=0 and have two time
units in length. Both are simply a matter of offset and scaling of
time.

Another constant-voltage crossfade would be what I might call a "Hann
crossfade" (after the Hann window):

e(t) = 1/2

{ (1/2)*sin(pi/2 * t) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1


Some might like that better because the derivative is continuous
everywhere. Extending this idea, one more constant-voltage crossfade
is what I might call a "Flattened Hann crossfade":

e(t) = 1/2

{ (9/16)*sin(pi/2 * t) + (1/16)*sin(3*pi/2 * t) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1

This splice is everywhere continuous in the zeroth, first, and second
derivative. A very smooth crossfade.

As another example, a constant-power crossfade would be the same as
any of the above, but where the above a(t) is square rooted:

{ 0 for t <= -1
{
a(t) = { sqrt(1/2 + t/2) for -1 < t < 1
{
{ 1 for t >= 1

This is what we might use to splice to completely uncorrelated signals
together. We can separate this into even and odd parts as:


{ (1/2)*(sqrt(1/2 + t/2) + sqrt(1/2 - t/2)) for |t| < 1
e(t) = {
{ 1/2 for |t| >= 1


{ (1/2)*(sqrt(1/2 + t/2) - sqrt(1/2 - t/2)) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1

______________________________________________________________________

Section 2: Which crossfade function to use?

Now we shall make a definition and an assumption. We shall define an
inner product of two general signals as:

+inf
<x,y> = <x(t), y(t)> = integral{ x(t)*y(t) * w(t) dt}
-inf

w(t) is a window function that is symmetrical about t=0 and is
probably wider than the crossfade. Strictly speaking, if you were
coming at this from out of a graduate course in metric spaces or
functional analysis, one of the components (probably y(t)) should be
complex conjugated, but since x(t) and y(t) are always real, in this
whole theory, I will not bother with that notation.

This inner product is an degenerate case of the more general cross-
correlation evaluated with a lag of zero:

+inf
Rxy(tau) = <x(t), y(t+tau)> = integral{ x(t)*y(t+tau) * w(t) dt}
-inf

If y(t) is a time-offset copy of x(t), then Rxy(tau) is the
autocorrelation of x(t), Rxx(tau), but also accounting for the time
offset in the lag, tau.

So <x,y> = Rxy(0)

A measure of signal energy or average power is:

+inf
Rxx(0) = <x,x> = integral{ (x(t))^2 * w(t) dt}
-inf

Now, the assumption that we are going to toss in here is that the mean
power of the two signals that we are crossfading, x(t) and y(t), are
equal.

<x,x> = <y,y>

We are assuming that we're not crossfading this very quiet tone or
sound to a very loud sound that is 60 dB louder. Similarly, the
resulting spliced sound, v(t), has the same mean power of the two
signals being spliced:

<v,v> = <x,x> = <y,y>

So, assuming we lined up x(t) and y(t) so that we want to splice from
one to the other at t=0, and scaled x(t) and y(t) so that they have
the same mean power in the neighborhood of t=0, then the inner product
is a measure of how well they are correlated. We shall define this
normalized measure of correlation as:

r = <x,y>/<x,x> = <x,y>/<y,y>

If r = 1, they are perfectly correlated and if r = 0, they are
completely uncorrelated.

We will make the additional assumption that our pitch detection
algorithm will find *some* lag, P, where the correlation is at least
zero correlated. We should not have to deal with splicing
*negatively* correlated audio (that would have quite a "glitch" or a
bad splice). If the two signals, x(t) and y(t), have no DC component,
then their autocorrelations and their cross-correlations to each other
must have no DC component. That means there will be values of tau
such that Rxy(tau) are either negative or positive. If it was
theoretical white noise, Rxx(tau) would be zero for |tau| > 0 and
Rxx(0) would be the noise variance or power. But Rxx(tau) cannot be
negative for *all* values of tau, even excluding tau=0.

For the splicing done in a time-domain pitch shifting or time scaling
algorithm, we can find a value of tau so that Rxx(tau) is non-negative
and we want to choose tau = P so that has the highest value of
Rxx(tau). Then define

y(t) = x(t+P)

and then

<x,y> = Rxy(0) = Rxx(P)

Now we shall also assume that the crossfade function, a(t), is
completely uncorrelated and even statistically independent from the
two signals being spliced. a(t) is a volume control that varies in
time, but is unaffected by anything in x(t) or y(t).

We shall also assume something called "ergodicity". This means that
*time* averages of x(t) and y(t) (or combinations of x(t) and y(t))
are equal to *statistical* averages. If this window, w(t) is scaled
(or normalized) so that its integral is 1,

+inf
integral{ w(t) dt} = 1
-inf

then all these inner products (which are time averages) can be related
to "expectation values" (which are statistical averages):

<x,y> = E{ x(t)*y(t) }

If x(t) and y(t) are thought of as sorta "random" processes (rather
than well defined deterministic functions), the expectation value is
unmoved no matter what t is. But if the envelope a(t) is considered
deterministic, then it simply scales x(t) or y(t) and is treated as a
constant in the expectation. So at some particular time t0,

<a(t0)*x,y> = E{ (a(t0)*x(t)) * y(t) }

= a(t0) * E{ x(t) * y(t) }

= a(t0) * <x,y>

This is a little sloppy, mathematically, because I am "fixing" t for
a(t) to be t0, but not fixing t for x(t) or y(t) (so that "time
averages" for x(t) and y(t) can be meaningful and equated to
statistical averages).

Recall that

v(t) = a(t)*y(t) + a(-t)*x(t)

Then:

<v,v> = <(a(t)*y(t) + a(-t)*x(t)), (a(t)*y(t) + a(-t)*x(t))>

Using identities that we can apply to expectation values

<v,v> = (a(t))^2*<y,y> + 2*a(t)*a(-t)*<x,y> + (a(-t))^2*<x,x>

Since <v,v> = <x,x> = <y,y>, we can divide by <v,v> and get to the key
equation of this whole theory:

1 = (a(t))^2 + 2*r*a(t)*a(-t) + (a(-t))^2

Given the normalized correlation measure, we want the above equation
to be true all of the time. If r=0 (completely uncorrelated), one can
see we get a constant-power crossfade:

(a(t))^2 + (a(-t))^2 = 1

If r=1 (completely correlated), one can see that we get a constant-
voltage crossfade:

(a(t))^2 + (a(-t))^2 + 2*a(t)*a(-t) = ( a(t) + a(-t) )^2 = 1

or, assuming a(t) is non-negative,

a(t) + a(-t) = 1 .

______________________________________________________________________

Section 3: Generalizing the crossfade function

Recall that

a(t) = e(t) + o(t)

a(-t) = e(t) - o(t)

and substituting into

(a(t))^2 + (a(-t))^2 + 2*r*a(t)*a(-t) = 1

results in

(e(t) + o(t))^2 + (e(t) - o(t))^2
+ 2*r*(e(t) + o(t))*(e(t) - o(t)) = 1

Blasting through that gets:

(1+r)*(e(t))^2 + (1-r)*(o(t))^2 = 1/2


This means that, if r is measured and known (from the correlation
function) we have the freedom to define either one of e(t) or o(t)
arbitrarily (as long as the even or odd symmetry is kept) and solve
for the other. We can see that square rooting is involved in solving
for either e(t) or o(t) and there is an ambiguity for which sign to
pick. We shall resolve that ambiguity by adding the additional
assumption that the even-symmetry component, e(t), is non-negative.

e(t) = e(-t) >= 0

Given a general and bipolar odd-symmetry component function,

o(t) = -o(-t)

then we solve for the even component (picking the non-negative square
root):

e(t) = sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )

The overall crossfade envelope would be

a(t) = e(t) + o(t)

= sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 ) + o(t)

______________________________________________________________________

Section 4: Implementation:

Given a particular form for the odd part, o(t) (linear or Hann or
Flattened Hann or whatever is your heart's desire), and for a variety
of values of r, ranging from r=0 to r=1, a collection of envelope
functions, a(t), are pre-calculated and stored in memory. Then, when
pitch detection or loop matching is done, a splice displacement that
is optimal is determined, and if autocorrelation of some form is used
in determining a measure of goodness (or seamlessness, using Element's
language) of that loop splice, that autocorrelation is normalized (by
dividing by Rxx(0)) to get r and that value of r is used to choose
which pre-calculated a(t) from the above collection is used for the
crossfade in the splice.

______________________________________________________________________
Olli Niemitalo
2011-07-13 13:29:39 UTC
Permalink
On Sat, Jul 9, 2011 at 10:53 PM, robert bristow-johnson
<***@audioimagination.com> wrote:
> On Dec 7, 2010, at 5:27 AM, Olli Niemitalo wrote:
>
> > [I] chose that the ratio a(t)/a(-t) [...] should be preserved
>
> by "preserved", do you mean constant over all t?

Constant over all r.

> what is the fundamental reason for preserving a(t)/a(-t) ?

I'm thinking outside your application of automatic finding of splice
points. Think of crossfades between clips in a multi-track sample
editor. For a cross-fade in which one signal is faded in using a
volume envelope that is a time-reverse of the volume envelope using
which the other signal is faded out, a(t)/a(-t) describes by what
proportions the two signals are mixed at each t. The fundamental
reason then is that I think it is a rather good description of the
shape of the fade, to a user, as it will describe how the second
signal swallows the first by time. The user might choose one "shape"
for a particular crossfade. Then, depending on the correlation between
the superimposed signals, an appropriate symmetrical volume envelope
could be applied to the mixed signal to ensure that there is no peak
or dip in the contour of the mixed signal. Because the envelope is
symmetrical, applying it "preserves" a(t)/a(-t). It can also be
incorporated directly into a(t).

All that is not so far off from the application you describe.

> but i don't think it is necessary to deal with lags where Rxx(tau) < 0.  why
> splice a waveform to another part of the same waveform that has opposite
> polarity?  that would create an even a bigger glitch.

Splicing at quiet regions with negative correlation can give a smaller
glitch than splicing at louder regions with positive correlation. This
applies particularly to rhythmic material like drum loops, where the
time lag between the splice points is constrained, and it may make
most sense to look for quiet spots. However, if it's already so quiet
in there, I don't know how much it matters what you use for a
cross-fade.

Apart from "it's so quiet it doesn't matter", I can think of one other
objection against using cross-fades tailored for r < 0: For example,
let's imagine that our signal is white noise generated from a Gaussian
distribution, and we are dealing with given splice points for which
Rxx(tau) < 0 (slightly). Now, while the samples of the signal were
generated independently, there is "by accident" a bit of negative
correlation in the instantiation of the noise, between those splice
points. Knowing all this, shouldn't we simply use a constant-power
fade, rather than a fade tailored for r < 0, because random deviations
in noise power are to be expected, and only a constant-power fade will
produce noise that is statistically identical to the original. I would
imagine that noise with long-time non-zero autocorrelation (all the
way across the splice points) is a very rare occurrence. Then again,
do we really know all this, or even that we are dealing with noise.

I should note that Rxx(tau) < 0 does not imply opposite polarity, in
the fullest sense of the adjective. Two equal sinusoids that have
phases 91 degrees apart have a correlation coefficient of about
-0.009.

RBJ, I'd like to return the favor and let you know that I have great
respect for you in these matters (and absolutely no disrespect in any
others :-) ). Hey, I wonder if you missed also my other post in the
parent thread? You can search for
AANLkTim=eM_kgPeibOqFGEr2FdKyL5uCCB_wJhz1Vne

-olli
robert bristow-johnson
2011-07-14 18:22:19 UTC
Permalink
On Jul 13, 2011, at 9:29 AM, Olli Niemitalo wrote:

> On Sat, Jul 9, 2011 at 10:53 PM, robert bristow-johnson
> <***@audioimagination.com> wrote:
>> On Dec 7, 2010, at 5:27 AM, Olli Niemitalo wrote:
>>
>>> [I] chose that the ratio a(t)/a(-t) [...] should be preserved
>>
>> by "preserved", do you mean constant over all t?
>
> Constant over all r.
>

i think i figgered that out after hitting the Send button.

>> what is the fundamental reason for preserving a(t)/a(-t) ?
>
> I'm thinking outside your application of automatic finding of splice
> points. Think of crossfades between clips in a multi-track sample
> editor. For a cross-fade in which one signal is faded in using a
> volume envelope that is a time-reverse of the volume envelope using
> which the other signal is faded out, a(t)/a(-t) describes by what
> proportions the two signals are mixed at each t. The fundamental
> reason then is that I think it is a rather good description of the
> shape of the fade, to a user, as it will describe how the second
> signal swallows the first by time.

okay, i get it.

so instead of expressing the crossfade envelope as

a(t) = e(t) + o(t)

i think we could describe it as a constant-voltage crossfade (those
used for splicing perfectly correlated snippets) bumped up a little by
an overall loudness function. an envelope acting on the envelope.
and, as you correctly observed, for constant-voltage crossfades, the
even component is always

e(t) = 1/2

so, pulling another couple of letters outa the alfabet, we can
represent the crossfade function as

a(t) = e(t) + o(t) = g(t)*( 1/2 + p(t) )

where

g(-t) = g(t) is even
and
p(-t) = -p(t) is odd


g(t) = 1 for constant-voltage crossfades, when r=1.
for constant-power crossfades, r=0, we know that g(0) = sqrt(2) > 1

the shape p(t) is preserved for different values of r and we want to
solve for g(t) given a specified correlation value r and a given
"shape" family p(t). indeed

a(t)/a(-t) = (1/2 + p(t))/(1/2 - p(t))

and remains preserved over r if p(t) remains unchanged.

p(t) can be spec'd initially exactly like o(t) (linear crossfade,
Hann, Flattened Hann, or whatever odd function your heart desires). i
think it should be easy to solve for g(t). we know that


e(t) = 1/2 * g(t)

o(t) = g(t) * p(t)

and recall the result

e(t) = sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )

which comes from

(1+r)*( e(t) )^2 + (1-r)*( o(t) )^2 = 1/2

so
(1+r)*( 1/2*g(t) )^2 + (1-r)*( g(t)*p(t) )^2 = 1/2


( g(t) )^2 * ( (1+r)/4 + (1-r)*(p(t))^2 ) = 1/2

and picking the positive square root for g(t) yields

g(t) = 1/sqrt( (1+r)/2 + 2*(1-r)*(p(t))^2 )

might this result match what you have? (assemble a(t) from g(t) and
p(t) just as we had previously from e(t) and o(t).)

remember that p(t) is odd so p(0)=0 so when

r=1 ---> g(t) = 1 (constant-voltage crossfade)
and

r=0 ---> g(0) = sqrt(2) (constant-power crossfade)


> The user might choose one "shape"
> for a particular crossfade. Then, depending on the correlation between
> the superimposed signals, an appropriate symmetrical volume envelope
> could be applied to the mixed signal to ensure that there is no peak
> or dip in the contour of the mixed signal. Because the envelope is
> symmetrical, applying it "preserves" a(t)/a(-t). It can also be
> incorporated directly into a(t).
>
> All that is not so far off from the application you describe.
>
>> but i don't think it is necessary to deal with lags where Rxx(tau)
>> < 0. why
>> splice a waveform to another part of the same waveform that has
>> opposite
>> polarity? that would create an even a bigger glitch.
>
> Splicing at quiet regions with negative correlation can give a smaller
> glitch than splicing at louder regions with positive correlation.

okay. i would still like to "hunt" for a splice displacement around
that quiet region that would have correlation better than zero. and,
if both x(t) and y(t) have no DC, it should be possible to find
something.

> This
> applies particularly to rhythmic material like drum loops, where the
> time lag between the splice points is constrained, and it may make
> most sense to look for quiet spots. However, if it's already so quiet
> in there, I don't know how much it matters what you use for a
> cross-fade.
>
> Apart from "it's so quiet it doesn't matter", I can think of one other
> objection against using cross-fades tailored for r < 0: For example,
> let's imagine that our signal is white noise generated from a Gaussian
> distribution, and we are dealing with given splice points for which
> Rxx(tau) < 0 (slightly).

but you should also be able to find a tau where Rxx(tau) is slightly
greater than zero because Rxx(tau) should be DC free (if x(t) is DC
free). if it were true noise, it should not be far from zero so you
would likely use the r=0 crossfade function.

> Now, while the samples of the signal were
> generated independently, there is "by accident" a bit of negative
> correlation in the instantiation of the noise, between those splice
> points. Knowing all this, shouldn't we simply use a constant-power
> fade, rather than a fade tailored for r < 0, because random deviations
> in noise power are to be expected, and only a constant-power fade will
> produce noise that is statistically identical to the original. I would
> imagine that noise with long-time non-zero autocorrelation (all the
> way across the splice points) is a very rare occurrence. Then again,
> do we really know all this, or even that we are dealing with noise.

are you stuck with a particular displacement between x(t) and y(t)?
can you nudge one or the other over a little bit so you can find a
correlation that is at least as good as r=0?

> I should note that Rxx(tau) < 0 does not imply opposite polarity, in
> the fullest sense of the adjective. Two equal sinusoids that have
> phases 91 degrees apart have a correlation coefficient of about
> -0.009.

yes, but 91 degrees outa phase is a little more opposite polarity than
it is like polarity.

> Hey, I wonder if you missed also my other post in the
> parent thread? You can search for
> AANLkTim=eM_kgPeibOqFGEr2FdKyL5uCCB_wJhz1Vne

i think i had missed it. i will look for it.


thanks for your response, Olli. i think it's better to define p(t)
(with the same restrictions as o(t)) and find g(t) as a function of r
than it is to do it with o(t) and e(t). then your "mix-shape" is
preserved for different values of r and for r<1, we are just bumping
up the overall loudness a little to preserve constant power for all t.

L8r,

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Olli Niemitalo
2011-07-14 21:36:43 UTC
Permalink
On Thu, Jul 14, 2011 at 9:22 PM, robert bristow-johnson
<***@audioimagination.com> wrote:
>
>      g(t)  =  1/sqrt( (1+r)/2 + 2*(1-r)*(p(t))^2 )
>
> might this result match what you have?

Yes! I only derived the formula for the linear ramp, p(t) = t/2,
because one can get the other shapes by warping time and I didn't want
to bloat the cumbersome equations. With the linear ramp our results
match exactly.

> okay.  i would still like to "hunt" for a splice displacement around that
> quiet region that would have correlation better than zero

Sometimes you are stuck with a certain displacement. Think drum loops;
changing tau would change tempo.

> i think it's better to define p(t) (with the same restrictions as o(t)) and find g(t) as a
> function of r than it is to do it with o(t) and e(t).

I agree, even though the theory was quite elegant with o(t) and e(t)...

-olli
robert bristow-johnson
2011-07-15 01:05:42 UTC
Permalink
On Jul 14, 2011, at 5:36 PM, Olli Niemitalo wrote:

> On Thu, Jul 14, 2011 at 9:22 PM, robert bristow-johnson
> <***@audioimagination.com> wrote:
>>
>> g(t) = 1/sqrt( (1+r)/2 + 2*(1-r)*(p(t))^2 )
>>
>> might this result match what you have?
>
> Yes! I only derived the formula for the linear ramp, p(t) = t/2,
> because one can get the other shapes by warping time and I didn't want
> to bloat the cumbersome equations. With the linear ramp our results
> match exactly.
>
>> okay. i would still like to "hunt" for a splice displacement
>> around that
>> quiet region that would have correlation better than zero
>
> Sometimes you are stuck with a certain displacement. Think drum loops;
> changing tau would change tempo.
>
>> i think it's better to define p(t) (with the same restrictions as
>> o(t)) and find g(t) as a
>> function of r than it is to do it with o(t) and e(t).
>
> I agree, even though the theory was quite elegant with o(t) and
> e(t)...
>

do you have any of this in a document? i wonder if one of us should
put this down in a pdf and put it in the music-dsp "code" archive.


--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Sampo Syreeni
2011-07-15 04:46:34 UTC
Permalink
On 2011-07-15, Olli Niemitalo wrote:

What are you trying to accomplish here, really? Optimum splicing, sure,
but against which precise criterion?
--
Sampo Syreeni, aka decoy - ***@iki.fi, http://decoy.iki.fi/front
+358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Olli Niemitalo
2011-07-15 09:53:15 UTC
Permalink
On Fri, Jul 15, 2011 at 7:46 AM, Sampo Syreeni <***@iki.fi> wrote:
> On 2011-07-15, Olli Niemitalo wrote:
>
> What are you trying to accomplish here, really? Optimum splicing, sure, but
> against which precise criterion?

My objective has not been to find a method for automatic splicing, but
to do nice cross-fades at given splice points.

There were multiple objectives:
* Intuitive definition of the cross-fade shape. Mixing ratio as a
function of time is a good definition.
* For stationary signals, there should be no clicks or transients
produced. This is taken care of by the smoothness of the cross-fade
envelopes.
* For stationary signals, the resulting measurable transition from the
volume level of signal 1 to volume level of signal 2 should follow the
chosen cross-fade shape. This can be accomplished knowing the volume
levels of the two signals and the correlation coefficient between the
two signals.

-olli
Wen Xue
2011-07-15 11:24:07 UTC
Permalink
I have the following made-up scenarios -
1) If I twist the 2nd half of some x(t) by 180 degrees then it becomes
orthogonal to the original x(t). How do we cross-fade it with x(t)?
2) If I twist the 1st third of x(t) by 180 degrees and 3rd third by 90
degrees?
3) If I twist 2nd and 4th and quarters of x(t) by 180 degrees?
In all such cases the correlation is 0. Do we cross-fade them in the same
way?

Xue



-----Original Message-----
My objective has not been to find a method for automatic splicing, but
to do nice cross-fades at given splice points.

There were multiple objectives:
* Intuitive definition of the cross-fade shape. Mixing ratio as a
function of time is a good definition.
* For stationary signals, there should be no clicks or transients
produced. This is taken care of by the smoothness of the cross-fade
envelopes.
* For stationary signals, the resulting measurable transition from the
volume level of signal 1 to volume level of signal 2 should follow the
chosen cross-fade shape. This can be accomplished knowing the volume
levels of the two signals and the correlation coefficient between the
two signals.

-olli
Olli Niemitalo
2011-07-15 12:01:47 UTC
Permalink
That won't be a problem if you measure the correlation locally, but
how exactly? Certainly anything outside the cross-fade region should
be excluded from the measurement. And inside it matters most wherever
the mixing ratio is close to 50-50, as in that cases phase difference
of the two signals gives the greatest contribution to the resulting
measurable volume envelope of the mixed signal. Probably the data
should be windowed for measurement of correlation (and volume),
depending on the mixing function...

-olli

On Fri, Jul 15, 2011 at 2:24 PM, Wen Xue <***@eecs.qmul.ac.uk> wrote:
> I have the following made-up scenarios -
> 1) If I twist the 2nd half of some x(t) by 180 degrees then it becomes
> orthogonal to the original x(t). How do we cross-fade it with x(t)?
> 2) If I twist the 1st third of x(t) by 180 degrees and 3rd third by 90
> degrees?
> 3) If I twist 2nd and 4th and quarters of x(t) by 180 degrees?
> In all such cases the correlation is 0. Do we cross-fade them in the same
> way?
>
> Xue
>
>
>
> -----Original Message-----
> My objective has not been to find a method for automatic splicing, but
> to do nice cross-fades at given splice points.
>
> There were multiple objectives:
> * Intuitive definition of the cross-fade shape. Mixing ratio as a
> function of time is a good definition.
> * For stationary signals, there should be no clicks or transients
> produced. This is taken care of by the smoothness of the cross-fade
> envelopes.
> * For stationary signals, the resulting measurable transition from the
> volume level of signal 1 to volume level of signal 2 should follow the
> chosen cross-fade shape. This can be accomplished knowing the volume
> levels of the two signals and the correlation coefficient between the
> two signals.
>
> -olli
>
> --
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info, FAQ, source code archive, list archive, book reviews, dsp links
> http://music.columbia.edu/cmc/music-dsp
> http://music.columbia.edu/mailman/listinfo/music-dsp
>
robert bristow-johnson
2011-07-15 16:01:56 UTC
Permalink
On Jul 15, 2011, at 12:46 AM, Sampo Syreeni wrote:

>
> What are you trying to accomplish here, really? Optimum splicing,
> sure, but against which precise criterion?

the precise criterion is how well the two signals being spliced
correlate to one another. i tried to set that up with the inner
product notation.

+inf
<x,y> = <x(t), y(t)> = integral{ x(t)*y(t) * w(t) dt}
-inf

where w(t) is a window function centered at t=0.

the normalized correlation measure is:

r = <x,y>/<x,x> = <x,y>/<y,y>

if r=1, they are perfectly correlated and a constant-voltage splice
should be used. if r=0 they are completely uncorrelated and a
constant-power splice should be used. if 0 < r < 1 then some kinda
splice in between a constant-voltage and constant-power splice should
be used. if r < 0, then there has to be a boost of even *more* than 3
dB (that sqrt(2) factor at g(0)) to keep the expected loudness
envelope constant. Olli and i see the need for such slightly
differently.

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Stefan Stenzel
2010-12-13 10:57:30 UTC
Permalink
Moin Robert others,

On 06.12.2010 19:49, robert bristow-johnson wrote:
>
> On Dec 6, 2010, at 1:23 PM, Stefan Stenzel wrote:
>
>> On 06.12.2010 08:59, robert bristow-johnson wrote:
>>>
>>> This is a continuation of the thread started by Element Green titled: Algorithms for finding seamless loops in audio
>>
>> I suspect it works better to *construct* a seamless loop instead trying find one where there is none.
>
> i can't speak for Greenie or any others, but i myself would be very interested in what you might have to say about constructing seamless loops. regarding that, i would like to know the context (e.g. looping a non-realtime sample editor for a sampling synth vs. a realtime pitch shifter) and the kinds of signals (quasi-periodic vs. aperiodic vs. periodic but with detuned higher harmonics). processing in frequency domain or time domain (or some in both)?

I construct seemless loops in frequency domain in a non-realtime application, and I
am quite happy with the results. If you ask for a recipe, this is what I am doing:

- detect pitch of (whole) sample via AC (via FFT)
- decide on block to be looped (behind attack segment, usually more than 1 s long)
- detect frequency peaks in that block (frequency domain)
- shift to integer fractions of loop length but preserve amplitude and initial phase
- back to time domain
- fade to loop in original sample (only played once as no fade is inside loop)

> dunno if there is any PPG "secrets" or wisdom to confer, but i would like to hear or read it.

None of this in any PPG or Waldorf. Currently I use it for automatically looping huge piano
sample sets, not for the memory but in order to fight noise. Tried it with other material
like chords with surprisingly good results though.

Stefan
robert bristow-johnson
2010-12-14 05:15:56 UTC
Permalink
thanks, Stefan, for getting back on this.

On Dec 13, 2010, at 5:57 AM, Stefan Stenzel wrote:

> I construct seemless loops in frequency domain in a non-realtime
> application, and I
> am quite happy with the results. If you ask for a recipe, this is
> what I am doing:
>
> - detect pitch of (whole) sample via AC (via FFT)
> - decide on block to be looped (behind attack segment, usually more
> than 1 s long)
> - detect frequency peaks in that block (frequency domain)
> - shift to integer fractions of loop length but preserve amplitude
> and initial phase
> - back to time domain
> - fade to loop in original sample (only played once as no fade is
> inside loop)

from just reading this, it appears to be the about same thing that a
certain unnamed keyboard synth manufacturer does. they detuned (very
slightly) some of the partials so that each partial or overtone had an
integer number of cycles over the length of the loop, even if they
were slightly inharmonic, they were nudged slightly to be some other
inharmonic ratio to the fundamental. i doubt that the original peak
nor the moved peak sat exactly on integer bin indices in the FFT.
then interpolation in the frequency domain is necessary (besides
having to delimit each peak from the adjacent peaks) to move those
peaks slightly.

this isn't a problem with piano, but what if the sample is of some
acoustic instrument with vibrato in the recording of a single note.
then there isn't an exact pitch for the whole sample of the note,
because it varies in time.

>> dunno if there is any PPG "secrets" or wisdom to confer, but i
>> would like to hear or read it.
>
> None of this in any PPG or Waldorf.

i can see that. it's about sample loops, not the sequential single-
cycle loops one would construct for wavetable synthesis.

> Currently I use it for automatically looping huge piano
> sample sets, not for the memory but in order to fight noise.

well, for sure you want the splice to be seamless for all harmonics,
or better yet "partials", of any appreciable magnitude. being that
there are non-harmonic partials in a lot of acoustic instruments, most
certainly piano, i know why you would want to adjust them a little so
that phases of all partials are aligned the jump in the loop is
seamless.

> Tried it with other material
> like chords with surprisingly good results though.

sure, if the loop is long enough and if you can adjust the frequencies
slightly. and, of course, it will work better on simple major chords
than it would with a fully diminished chord or something with
dissonant intervals.

this certain unnamed keyboard synth manufacturer didn't think so
either (specifically certain non-experts in their engineering
management), but, for this piano or some other pitched instrument, a
wavetable analysis would do as well or even better for the cases where
there is vibrato to track and deal with.

1. first pitch-detection (using AC or AMDF or whatever) is performed
very often (say once or twice per millisecond) throughout the note,
from beginning to end. octave errors are dealt with and tight pitch
tracking is done.

2. then single-cycle wavetables are computed for each of those
milliseconds with each new period estimate. (but the changing pitch
is recorded and used for resynthesis.)

3. FFT of each wavetable is performed. X[0] is DC, X[1] and X[N-1] is
the first harmonic, etc. the harmonics that are actually a little
detuned and non-harmonic will have phase slipping a little each
adjacent wavetable. for the length of the loop, you would want the
phase of each harmonic to be the same at the end as it was at the
beginning of the loop.

4. the loop length is chosen to accomplish that for the lower
harmonics (there would be an integer number of cycles for each of
these lower harmonics in the loop length). then the higher harmonics
that do not quite get back to the same phase at the loop end that they
were at the loop start, that phase difference is then split evenly for
all wavetables in between. this would cause an integer number of
cycles for every harmonic, but they wouldn't necessarily be integer
multiples of the fundamental. it is true that if one were to consider
the loop length as a "period", then all partials would be integer
harmonics after this adjustment, but what was previously considered
the fundamental would not be the fundamental if the loop length is
called the period.

i suppose i could illustrate what i mean here with a bogus example, if
i haven't made it sufficiently clear. i just think that wavetable
synthesis has application that is broader than just playing single-
cycle loops.


--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Stefan Stenzel
2010-12-15 16:20:38 UTC
Permalink
Moin Robert & others,

On 14.12.2010 06:15, robert bristow-johnson wrote:
> this isn't a problem with piano, but what if the sample is of some acoustic instrument with vibrato in the recording of a single note. then there isn't an exact pitch for the whole sample of the note, because it varies in time.

Right, but if you consider 1/loop length the fundamental frequncy, vibrato becomes simple FM.
This might sound stoopid, as we certainly perceive it our own time domain, but that does not
mean we cannot take advantage of frequency domain processing. The problem here lies not so much
in the frequency alignment itself but the pitch detection, which ideally finds a multiple of
both the fundamental and the modulation frequency.

In reality, if you choose your loop to be long enough, you can almost get away with any length,
even if this is completely unrelated to the original pitch. Consider a 4 sec loop, all frequencies
are multiples of 0.25 Hz. At 440 Hz, this difference is just 1 cent and hardly audible. Works for
major as well as for minor chords, as for some 10CC not-in-love vocal cluster.

> well, for sure you want the splice to be seamless for all harmonics, or better yet "partials", of any appreciable magnitude. being that there are non-harmonic partials in a lot of acoustic instruments, most certainly piano, i know why you would want to adjust them a little so that phases of all partials are aligned the jump in the loop is seamless.

Yes, very seamless, I think this is what a loop should be. I cannot see how any frequency *not*
being a multiple of the loop frequency could be represented in that loop.

[...]
> i suppose i could illustrate what i mean here with a bogus example, if i haven't made it sufficiently clear. i just think that wavetable synthesis has application that is broader than just playing single-cycle loops.
To be honest I didn't quite get that. It could help if the unamed manufacturer could be named,
I cannot yet see why it should remain anonymous.

Regards,
Stefan
robert bristow-johnson
2010-12-18 06:56:52 UTC
Permalink
okay, i don't seem to get any time to deal with this except late at
night.

so this is continuing that thread that was named "A theory of optimal
splicing of audio in the time domain."

On Dec 15, 2010, at 11:20 AM, Stefan Stenzel wrote:

> On 14.12.2010 06:15, robert bristow-johnson wrote:
>> this isn't a problem with piano, but what if the sample is of some
>> acoustic instrument with vibrato in the recording of a single
>> note. then there isn't an exact pitch for the whole sample of the
>> note, because it varies in time.
>
> Right, but if you consider 1/loop length the fundamental frequncy,
> vibrato becomes simple FM.

well, you will have a sparse line spectrum for your "single cycle".
the "real" first harmonic becomes something like the 50th harmonic of
your 1/(loop_length) fundamental if the loop had 50 cycles of the tone
between endpoints. then you will have a spike at around the 100th
harmonic, 150th, 200th and so on. you can DFT the entire loop length
(with no windowing), and the DFT will have the Fourier coefficients of
your big, long "single cycle" (which looks like 50 cycles). if there
was no vibrato, the energy would be nearly all in the X[50], x[100],
X[150], X[200] ... bins. because this is a DFT of an integer number
of cycles, the adjacent bins would be nearly zero, relatively (if
there is no vibrato).

like with a piano, the higher harmonics would start to get a little
sharp and, say, the "real" 12th harmonic would lie perhaps at X[601]
instead of X[600] if that harmonic was 2.88 cents sharp. but the 11th
harmonic and the 13th harmonic would also be sharp and not by exactly
one (or some other small integer) bin. then there *will* neighboring
bins with significant energy, because it would be like a sinc()
function sampled off of the integer values. you would have to
interpolate around these adjacent bins to get the "true" peak location
(at a fractional in-between bin location) and peak height so you would
know that there is not a precise integer number of cycles of that
harmonic in your loop.

now i presume that you would want to move those slightly detuned
harmonics to squarely an integer bin location and you would compute
the distance from the interpolated peak to the nearest integer bin.
higher or lower, i'm not entirely sure - if they're, say, 0.4 bin
width sharp, you might want to bump it up to the next integer bin
rather down to the nearest bin where it would not be slightly sharp
anymore. i dunno, we want to preserve these outa-tune harmonics to
keep the sample "live" sounding.

now one problem, i might guess, would be *if* there is also vibrato,
those harmonic peaks will get spread out among the adjacent bins, and
i am not sure that it will be symmetrical about the "true" peak, and
if it is not, i am not sure how you determine exactly where the peak
is before moving it. not necessarily a big issue. so then you move
each peak (and adjacent bins) to an integer bin location, inverse DFT,
and all of the partials should have an integer number of cycles
between the loop endpoints.

> This might sound stoopid, as we certainly perceive it our own time
> domain, but that does not
> mean we cannot take advantage of frequency domain processing. The
> problem here lies not so much
> in the frequency alignment itself but the pitch detection, which
> ideally finds a multiple of
> both the fundamental and the modulation frequency.

i know, selecting the correct number of cycles for the loop so that
there is an integer number of vibrato cycles would be the main
criterion of choosing a loop length and endpoints. you would do that
with little regard of what those sharpened harmonics are doing and fix
them later with this frequency-domain method. (and there is a
wavetable way to do it, that tracks the varying fundamental.)

> In reality, if you choose your loop to be long enough, you can
> almost get away with any length,
> even if this is completely unrelated to the original pitch. Consider
> a 4 sec loop, all frequencies
> are multiples of 0.25 Hz. At 440 Hz, this difference is just 1 cent
> and hardly audible.

well if you get 1760.5 cycles in the loop (because it's not exactly
440 Hz or not exactly 4 sec) then instead of 1760, you *could* get a
glitch in the splice, no matter how slow the crossfade is, because
when the crossfade is at 50-50 (%), then you will get destructive
interference for all odd harmonics. but, i know you would adjust it a
little to get an exact integer number of cycles. but, in my opinion,
you would have to track the cycle phase tightly to do it, which would
be equivalent to cross-correlating (or AMDF) the two loop endpoints
together to get the best loop length.

> Works for major as well as for minor chords, as for some 10CC not-in-
> love vocal cluster.

works for dissonance? if you were looping that, i might expect a
constant-power crossfade (that hits both envelopes at 70.7% when
halfway through) would be better than a constant-voltage crossfade.
there are sample editors that had options to do this and this optimal
splicing theory was meant to generalize the idea.

>> well, for sure you want the splice to be seamless for all
>> harmonics, or better yet "partials", of any appreciable magnitude.
>> being that there are non-harmonic partials in a lot of acoustic
>> instruments, most certainly piano, i know why you would want to
>> adjust them a little so that phases of all partials are aligned the
>> jump in the loop is seamless.
>
> Yes, very seamless, I think this is what a loop should be. I cannot
> see how any frequency *not*
> being a multiple of the loop frequency could be represented in that
> loop.
>
> [...]
>> i suppose i could illustrate what i mean here with a bogus example,
>> if i haven't made it sufficiently clear. i just think that
>> wavetable synthesis has application that is broader than just
>> playing single-cycle loops.
> To be honest I didn't quite get that. It could help if the unamed
> manufacturer could be named,
> I cannot yet see why it should remain anonymous.


well, i told you separately, but i'm not saying it out loud. it's
such a litigious society we Americans have (even, =ahem=, the non-
Americans). this company is known to have been involved in litigation
in its history.

but i'll try to explain how you would employ wavetable analysis,
modification, and resynthesis, to create the same loop with some
slightly detuned harmonics.

so let's say it's equivalent to above, you have a vibrato going and,
in very close to one or two vibrato cycles, you get 50 of the tone
cycles. but the 12th harmonic is 2.88 cents sharp (a frequency ratio
of 601/600). that's not so bad with the loop, because the sharpened
12th harmonic will still have precisely 601 cycles in the loop. but
the 11th harmonic will not quite have 551 cycles in the loop, but it
has more than 500 cycle. let's say that the 11th harmonic is 1.88
cents sharp and has 550.6 cycles in the loop and you want to bump it
up to 551 cycles. you want to sharpen that harmonic a further 1.26
cents to bump it to exactly 551 cycles in the loop so the splice is
nice.

so here is the wavetable way to do it: let's say you derive some
number (let's say 16 wavetables, for a nice number) of wavetables,
equally spaced throughout the about-to-be-looped segment of tone
(which has 50 cycles in it). now, without considering the vibrato for
the moment, the number of cycles between neighboring wavetables would
be 50/16 (or 25/8). or 3 1/8 cycles between centers of the frames you
plop down and derive a wavetable from each. this means, if no
wavetable alignment is done, the phase of the fundamental would
advance by 1/8 cycle or 45 degrees between one wavetable and the
next. so, to align the wavetables, you rotate or spin the second
wavetable back by 1/8 cycle (say, by 128 samples if we're allocating
1024 samples per wavetable) to line them up. but we do our
bookkeeping and retain the fact that this wavetable was rotated 1/8
cycle when resynthesizing.

now that will do nicely for the fundamental and lower harmonics that
are very harmonic. after doing this rotating, you can perform nice
DFTs on the wavetables (if there are 1024 samples per wavetable, then
N=1024 in the DFT). X[0] is the DC component and let's set it to zero
just so we don't have to think about it. X[1] and X[N-1] make the
Fourier series coefficient for the 1st harmonic, exactly. X[2] and
X[N-2] make the Fourier coefficient for the 2nd harmonic. now,
because of spinning the second wavetable (lining it up with the
first), the phase of the 1st and 2nd (and other lower harmonics) in
the second wavetable will be nearly identical to the corresponding
phases of first wavetable.

but the 11th harmonic is not exactly the 11th harmonic. if it *were*
exactly harmonic, its phase in the second wavetable would line up with
the phase of the first. it's really the 11.012th harmonic (11.012 =
11*550.6/550). so when the fundamental advanced precisely 25/8 cycle
to go from the first wavetable to the second, the 11th harmonc did not
advance by 11*25/8 cycles but that harmonic advanced 11.012*25/8.
when the wavetable is aligned (to make the lower harmonics line up) by
spinning it 1/8 cycle, the 11th harmonic gets off by 0.012*(25/8)
cycle or 13.5 degrees. now, for each successive wavetable, the 11th
harmonic will advance in phase by 13.5 degrees and in the time of 16
wavetables, the 11th harmonic will advance 16*13.5 = 216 degrees (or
0.6 cycle).

even though the 11th harmonic isn't at exactly 11 times the
fundamental, wavetable synthesis treats it as exactly the 11th
harmonic but with the phase advancing a little with each and every
successive wavetable that is created from the data.

now we *want* the phase of the 11th harmonic to be off a little bit,
because it *is* supposed to be sharp a little. but we want that
harmonic to complete an entire extra cycle in the time of the whole
loop, so we have to help that 11th harmonic on by adding 0.4 cycle (or
144 degrees) in the time of 16 wavetables. this means we have to
advance the phase (artificially, by souping-up the phase for X[11] and
X[N-11] by multiplying X[11] with exp(j*phi) and X[N-11] with exp(-
j*phi)) where phi = 144/16 degrees.

so here is the procedure:

1. decide on loop endpoints based on getting very nearly an integer
number of vibrato cycles in there and getting exactly an integer
number of tone cycles in the loop length.

2. divide that loop length into an decently large integer number of
equally-spaced frames. call that number of frames, K. (my example
above was K=16 frames.)

3. extract the period (as a possible non-integer number of samples)
for each frame and derive a representative wavetable for that frame.

4. knowing what the period length is and knowing the time spacing
from one frame to the next, you know exactly how much to spin each
successive wavetable to best align it with the previous. (problem is
that the harmonics that are a little non-harmonic will not align as
well.)

5. FFT or DFT each wavetable. this is now the Fourier series data
for that waveform "snapshot" (using Andrew Horner's language) of each
frame.

6. for each harmonic observe how far out of phase the last wavetable
is from the first. the last wavetable is K-1 frames displacements
away from the first and the phase in the last frame should be off by
M*360*(K-1)/K degrees from the phase of the first where M is some
integer (M=0 for a "well-tuned" harmonic, M would be the number of
complete cycle "slips" for that harmonic in the whole loop length).
if that is the case, that means in K frame displacements, that this
harmonic advances by M cycles or M*360 degrees.

7. if the phase differential (from first to last wavetable) is off
from that M*360*(K-1)/K degrees then that harmonic does *not* advance
by exactly M cycles, then add (with the correct sign) to that
harmonic's phase k/K times that phase differential (where 0 <= k < K
is the sequential index of each of the K equally-spaced frames). what
you did was hurry up the phase (or slow it down) so that this harmonic
completes an entire extra cycle (or two or some bigger integer) in the
time of the loop length.

8. inverse DFT each Fourier series snapshot data back to the time-
domain wavetable.

9. recreate the time-varying tone using wavetable synthesis (and
interpolating between adjacent wavetables). every harmonic will line
up at the loop endpoints.

does this make sense? i know this is long and wordy, but without
drawings i don't know how to better put it. lemme know if there are
questions i might be able to answer or to better explain.

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Sampo Syreeni
2011-01-21 22:55:09 UTC
Permalink
On 2010-12-06, robert bristow-johnson wrote:

> i can't speak for Greenie or any others, but i myself would be very
> interested in what you might have to say about constructing seamless
> loops.

My best bet? Go into the cepstral domain to find the most likely loop
duration. Then translate back through spectral downto temporal domain.
Pick the right starting point (by hit/amplitude if you can't translate
the cepstral domain outright/well), and apply a short term
psychoacoustical, hill-climbing algorithm to pick the exact, sub-sample
looping point.
--
Sampo Syreeni, aka decoy - ***@iki.fi, http://decoy.iki.fi/front
+358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Victor NK-X-TODEL918ani
2011-01-22 12:42:55 UTC
Permalink
OK, so explain a bit more.

On 21 Jan 2011, at 22:55, Sampo Syreeni wrote:

> My best bet? Go into the cepstral domain to find the most likely
> loop duration
robert bristow-johnson
2010-12-06 08:38:28 UTC
Permalink
< a few mistakes are spotted and corrected before i forget >


This is a continuation of the thread started by Element Green titled:
Algorithms for finding seamless loops in audio

As far as I know, it is not published anywhere. A few years ago, I
was thinking of writing this up and publishing it (or submitting it
for publication, probably to JAES), and had let it fall by the
wayside. I'm "publishing" the main ideas here on music-dsp because of
some possible interest here (and the hope it might be helpful to
somebody), and so that "prior art" is established in case of anyone
like IVL is thinking of claiming it as their own. I really do not
know how useful it will be in practice. It might not make any
difference. It's just a theory.

______________________________________________________________________

Section 0:

This is about the generalization of the different ways we can splice
and crossfade audio that has these two extremes:

(1) Splicing perfectly coherent and correlated signals
(2) Splicing completely uncorrelated signals

I sometimes call the first case the "constant-voltage crossfade"
because the crossfade envelopes of the two signals being spliced add
up to one. The two envelopes meet when both have a value of 1/2. In
the second case, we use a "constant-power crossfade", the square of
the two envelopes add to one and they meet when both have a value of
sqrt(1/2)=0.707.

The questions I wanted to answer are: What does one do for cases in
between, and how does one know from the audio, which crossfade
function to use? How does one quantify the answers to these
questions? How much can we generalize the answer?

______________________________________________________________________

Section 1: Set up the problem.

We have two continuous-time audio signals, x(t) and y(t), and we want
to splice from one to the other at time t=0. In pitch-shifting or
time-scaling or any other looping, y(t) can be some delayed or
advanced version of x(t).

e.g. y(t) = x(t-P)

where P is a period length or some other "good" splice
displacement. We get that value from an algorithm we call a "pitch
detector".

Also, it doesn't matter whether x(t) is getting spliced to y(t) or the
other way around, it should work just as well for the audio played in
reverse. And it should be no loss of generality that the splice
happens at t=0, we define our coordinate system any damn way we damn
well please.

The signal resulting from the splice is

v(t) = a(t)*x(t) + a(-t)*y(t)

By restricting our result to be equivalent if run either forward or
backward in time, we can conclude that "fade-out" function (say that's
a(t)) is the time-reversed copy of the "fade-in" function, a(-t).

For the correlated case (1): a(t) + a(-t) = 1 for all t

For the uncorrelated case (2): (a(t))^2 + (a(-t))^2 = 1 for all t

This crossfade function, a(t), has well-defined even and odd symmetry
components:

a(t) = e(t) + o(t)
where

even part: e(t) = e(-t) = ( a(t) + a(-t) )/2
odd part: o(t) = -o(-t) = ( a(t) - a(-t) )/2

And it's clear that

a(-t) = e(t) - o(t) .


For example, if it's a simple linear crossfade (equivalent to splicing
analog tape with a diagonally-oriented razor blade):

{ 0 for t <= -1
{
a(t) = { 1/2 + t/2 for -1 < t < 1
{
{ 1 for t >= 1

This is represented simply, in the even and odd components, as:

e(t) = 1/2

{ t/2 for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1


where sgn(t) is the "sign function": sgn(t) = t/|t| .

This is a constant voltage-crossfade, appropriate for perfectly
correlated signals; x(t) and y(t). There is no loss of generality by
defining the crossfade to take place around t=0 and have two time
units in length. Both are simply a matter of offset and scaling of
time.

Another constant-voltage crossfade would be what I might call a "Hann
crossfade" (after the Hann window):

e(t) = 1/2

{ (1/2)*sin(pi/2 * t) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1


Some might like that better because the derivative is continuous
everywhere. Extending this idea, one more constant-voltage crossfade
is what I might call a "Flattened Hann crossfade":

e(t) = 1/2

{ (9/16)*sin(pi/2 * t) + (1/16)*sin(3*pi/2 * t) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1

This splice is everywhere continuous in the zeroth, first, and second
derivative. A very smooth crossfade.

As another example, a constant-power crossfade would be the same as
any of the above, but where the above a(t) is square rooted:

{ 0 for t <= -1
{
a(t) = { sqrt(1/2 + t/2) for -1 < t < 1
{
{ 1 for t >= 1

This is what we might use to splice to completely uncorrelated signals
together. We can separate this into even and odd parts as:


{ (1/2)*(sqrt(1/2 + t/2) + sqrt(1/2 - t/2)) for |t| < 1
e(t) = {
{ 1/2 for |t| >= 1


{ (1/2)*(sqrt(1/2 + t/2) - sqrt(1/2 - t/2)) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1

______________________________________________________________________

Section 2: Which crossfade function to use?

Now we shall make a definition and an assumption. We shall define an
inner product of two general signals as:

+inf
<x,y> = <x(t), y(t)> = integral{ x(t)*y(t) * w(t) dt}
-inf

w(t) is a window function that is symmetrical about t=0 and is
probably wider than the crossfade. Strictly speaking, if you were
coming at this from out of a graduate course in metric spaces or
functional analysis, one of the components (probably y(t)) should be
complex conjugated, but since x(t) and y(t) are always real, in this
whole theory, I will not bother with that notation.

This inner product is an degenerate case of the more general cross-
correlation evaluated with a lag of zero:

+inf
Rxy(tau) = <x(t), y(t+tau)> = integral{ x(t)*y(t+tau) * w(t) dt}
-inf

If y(t) is a time-offset copy of x(t), then Rxy(tau) is the
autocorrelation of x(t), Rxx(tau), but also accounting for the time
offset in the lag, tau.

So <x,y> = Rxy(0)

A measure of signal energy or average power is:

+inf
Rxx(0) = <x,x> = integral{ (x(t))^2 * w(t) dt}
-inf

Now, the assumption that we are going to toss in here is that the mean
power of the two signals that we are crossfading, x(t) and y(t), are
equal.

<x,x> = <y,y>

We are assuming that we're not crossfading this very quiet tone or
sound to a very loud sound that is 60 dB louder. Similarly, the
resulting spliced sound, v(t), has the same mean power of the two
signals being spliced:

<v,v> = <x,x> = <y,y>

So, assuming we lined up x(t) and y(t) so that we want to splice from
one to the other at t=0, and scaled x(t) and y(t) so that they have
the same mean power in the neighborhood of t=0, then the inner product
is a measure of how well they are correlated. We shall define this
normalized measure of correlation as:

r = <x,y>/<x,x> = <x,y>/<y,y>

If r = 1, they are perfectly correlated and if r = 0, they are
completely uncorrelated.

We will make the additional assumption that our pitch detection
algorithm will find *some* lag where the correlation is at least zero
correlated. We should not have to deal with splicing *negatively*
correlated audio (that would be quite a "glitch" or a bad splice). If
the signals have no DC component, then their autocorrelations and
their cross-correlations to each other) must have no DC component.
That means there will be values of tau such that Rxy(tau) are either
negative or positive. If it was theoretical white noise, Rxx(tau)
would be zero for |tau| > 0 and Rxx(0) would be the noise variance or
power. But Rxx(tau) cannot be negative for *all* values of tau, even
excluding tau=0.

We can find a value of tau so that Rxx(tau) is non-negative and we
want to choose tau so that has the highest value of Rxx(tau). Then
define

y(t) = x(t + tau)

and then

<x,y> = Rxy(0) = Rxx(tau)

Now we shall also assume that the crossfade function, a(t), is
completely uncorrelated and even statistically independent from the
two signals being spliced. a(t) is a volume control that varies in
time, but is unaffected by anything in x(t) or y(t).

We shall also assume something called "ergodicity". This means that
*time* averages of x(t) and y(t) (or combinations of x(t) and y(t))
are equal to *statistical* averages. If this window, w(t) is scaled
(or normalized) so that its integral is 1,

+inf
integral{ w(t) dt} = 1
-inf

then all these inner products can be related to "expectation values":

<x,y> = E{ x(t) * y(t) }

If x(t) and y(t) are thought of as sorta "random" processes (rather
than well defined deterministic functions), the expectation value is
unmoved no matter what t is. But if the envelope a(t) is considered
deterministic, then it simply scales x(t) or y(t) and is treated as a
constant in the expectation. So at some particular time t0,

<a(t0)*x,y> = E{ (a(t0)*x(t)) * y(t) }

= a(t0) * E{ x(t) * y(t) }

= a(t0) * <x,y>

This is a little sloppy, mathematically, because I am "fixing" t for
a(t) to be t0, but not fixing t for x(t) or y(t) (so that "time
averages" for x(t) and y(t) can be meaningful and equated to
statistical averages).

Recall that

v(t) = a(t)*x(t) + a(-t)*y(t)

Then:

<v,v> = <(a(t)*x(t) + a(-t)*y(t)), (a(t)*x(t) + a(-t)*y(t))>

Using identities that we can apply to expectation values

<v,v> = (a(t))^2*<x,x> + 2*a(t)*a(-t)*<x,y> + (a(-t))^2*<y,y>

Since <v,v> = <x,x = <y,y>, we can divide by <v,v> and get to the key
equation of this whole theory:

1 = (a(t))^2 + 2*r*a(t)*a(-t) + (a(-t))^2

Given the normalized correlation measure, we want the above equation
to be true all of the time. If r=0 (completely uncorrelated), one can
see we get a constant-power crossfade:

(a(t))^2 + (a(-t))^2 = 1

If r=1 (completely correlated), one can see that we get a constant-
voltage crossfade:

(a(t))^2 + (a(-t))^2 + 2*a(t)*a(-t) = ( a(t) + a(-t) )^2 = 1

or, assuming a(t) is non-negative,

a(t) + a(-t) = 1 .

______________________________________________________________________

Section 3: Generalizing the crossfade function

Recall that

a(t) = e(t) + o(t)

a(-t) = e(t) - o(t)

and substituting into

(a(t))^2 + (a(-t))^2 + 2*r*a(t)*a(-t) = 1

results in

(e(t) + o(t))^2 + (e(t) - o(t))^2
+ 2*r*(e(t) + o(t))*(e(t) - o(t)) = 1

Blasting through that gets:

(1+r)*(e(t))^2 + (1-r)*(o(t))^2 = 1/2


This means that, if r is measured and known (from the correlation
function) we have the freedom to define either one of e(t) or o(t)
arbitrarily (as long as the even or odd symmetry is kept) and solve
for the other. We can see that square rooting is involved in solving
for either e(t) or o(t) and there is an ambiguity for which sign to
pick. We shall resolve that ambiguity by adding the additional
assumption that the even-symmetry component, e(t), is non-negative.

e(t) = e(-t) >= 0

Given a general and bipolar odd-symmetry component function,

o(t) = -o(-t)

then we solve for the even component (picking the non-negative square
root):

e(t) = sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )

The overall crossfade envelope would be

a(t) = e(t) + o(t)

= sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 ) + o(t)

______________________________________________________________________

Section 4: Implementation:

Given a particular form for the odd part, o(t) (linear or Hann or
Flattened Hann or whatever is your heart's desire), and for a variety
of values of r, ranging from r=0 to r=1, a collection of envelope
functions, a(t), are pre-calculated and stored in memory. Then, when
pitch detection or loop matching is done, a splice displacement that
is optimal is determined, and if autocorrelation of some form is used
in determining a measure of goodness (or seamlessness, using Element's
language) of that loop splice, that autocorrelation is normalized (by
dividing by Rxx(0)) to get r and that value of r is used to choose
which pre-calculated a(t) from the above collection is used for the
crossfade in the splice.

______________________________________________________________________


--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."
Andy Farnell
2010-12-06 16:22:31 UTC
Permalink
Thanks for sharing these thoughts Robert.

On Mon, 6 Dec 2010 03:38:28 -0500
robert bristow-johnson <***@audioimagination.com> wrote:

> This is a continuation of the thread started by Element Green titled:
> Algorithms for finding seamless loops in audio
>


--
Andy Farnell <***@obiwannabe.co.uk>
Theo Verelst
2010-12-07 14:55:02 UTC
Permalink
"...I'm "publishing" the main ideas here on music-dsp..."

Wish more people would do such things.

I couldn´t resist thinking outloud about some of the main issues about
this subject, summed up as:

1. The looping idea will have approximations of at the least the
kind which makes a short loop like I understand the subject can be about
(for instrument samples) have an issue with pitch. Meaning: a loop is
only as pitch accurate as 1/#samples, so at 44.1 a loop at the final
poiint in the signal path will at e.g. 1kHz be for sure nomore accurate
than abot 2 percent divided by the number of waves in the loop (more
fundamenal waves per loop on an original sample will make for probably a
"noisy" loop).

2. Even to make use of FFTs for detection of loop points I´d think
will have the disadvantage of the quite big transient and
edge-dicontinuities errors of that transform, unless measures like
detuning to fundamental-equals-fft-length are taken

3. transforming the signal to FFT form to interpolate loop ends or to
capture frequencies for repetition will probably cause quite some signal
degradation, so people expecting only sample detuning in a software
package might not be too happy with the idea. A very shot transform
based edit at beginning or end of a loop might be interesting.

I recall getting wavelet transform code sniplets with my Analog Devices
DSP board years ago, THAT would probably be interesting to play with in
the context, but hey, what should a top EE earn to do that ?! :)

Theo.
http://www.theover.org/Linuxconf
Loading...