[music-dsp] FFT for realtime synthesis?

Post by gm
An advantage of using FFT instead of sinusoids would be that you dont
have to worry
about partial trajectories, residual noise components and that sort of
thing.

I think I should add that I want to use it on polyphonic material or any
source material
so sinu oscillators are probably not the way to go cuase you would need
too many of them

David Olofson

2018-10-23 22:43:06 UTC

[...]

Post by gm
I think I should add that I want to use it on polyphonic material or any
source material
so sinu oscillators are probably not the way to go cuase you would need
too many of them

Sine oscillators can be implemented in various very efficient ways,
and more importantly here; you only need to generate one "sample" per
FFT window, so you can actually have insane numbers of them. And, if
you implement more complex waveforms, and maybe various "transforms"
in the render-to-FFT stage, there are probably many more cycles to
save.

--
//David Olofson - Consultant, Developer, Artist, Open Source Advocate

.--- Games, examples, libraries, scripting, sound, music, graphics ---.
| http://consulting.olofson.net http://olofsonarcade.com |
'---------------------------------------------------------------------'

David Olofson

2018-10-23 22:38:52 UTC

Post by gm
Does anybody know a real world product that uses FFT for sound synthesis?

I believe AIR Loom does, and I suspect some wavetable synths use FFT
as well; Waves Codex and DS Audio Thorn comes to mind. (Mentions of
"spectral synthesis," and there's also something about the sound
character of these, especially above 10kHz...)

Post by gm
Do you think its feasable and makes sense?

Yes, and... I'm not sure. The method is essentially a highly optimized
implementation of additive synthesis - but with that comes the same
problem that always plagued additive synthesis: It's not very user
friendly in its basic form. You need to construct more effective
and/or comprehensible synthesis algorithms on top of it. Wavetable
synthesis + various filters would seem like one sensible option.
Either way, you can do a whole lot of interesting stuff with relative
ease, without having to worry about aliasing distortion, undesired
transients and whatnot, but there are some drawbacks as well, some of
which you've already mentioned.

One issue that I'm hearing in most implementations of FFT based
synthesis, effects and the like, is the nature of the FFT; the
frequency bins. Unless you're doing very trivial processing, avoid
steep filters and the like, or keep your content nicely aligned to the
FFT bins, the bins will become clearly audible in the form of metallic
resonances, "wobbly" filter sweeps and whatnot. Many things are easy
and intuitive to do in the frequency domain, but unfortunately, the
FFT can't be viewed as a proper additive synthesis oscillator bank,
except in trivial cases.

[...]

Post by gm
Whether or not it would use much less CPU I am not sure, depends on how
much overlap of frames you have.

It can be extremely efficient while still producing very high quality
sound, especially if you don't need very fast transient response. I'd
say performance is THE reason to use this approach, rather than just
doing brute force additive synthesis.

[...]

Post by gm
Another disadvantage would be that you cant have immediate parameter changes
since everything is frame based, and even though some granularity is
fine for me
the granularity of FFT would be fixed to the overlap/frame size, which
is another disadvantage.

Actually, you can, and you can even render gradients through the FFT
windows, but of course, that complicates "rendering" of the FFT bins a
whole lot, compared to implementing straight sine oscillators. (Which
is not entirely straightforward either, as soon as your frequencies
land in between FFT bins.)

[...]

Post by gm
Any thoughs on this? Experiences?

I played around a bit with it a few years ago, and hacked a quick
prototype in my own odd scripting language, EEL. (Old rubbish that
I've been meaning to update or port to something any century now...)
https://github.com/olofson/eelsynth

Simple demo song + some comments here:
https://soundcloud.com/david-olofson/eelsynth-ifft-flutesong

2018-10-24 00:48:15 UTC

Post by David Olofson
https://soundcloud.com/david-olofson/eelsynth-ifft-flutesong

sounds quite nice actually

robert bristow-johnson

2018-10-23 22:46:40 UTC

---------------------------- Original Message ----------------------------

Subject: [music-dsp] FFT for realtime synthesis?

From: "gm" <***@voxangelica.net>

Date: Tue, October 23, 2018 5:51 pm

To: music-***@music.columbia.edu

--------------------------------------------------------------------------

Post by gm
Does anybody know a real world product that uses FFT for sound synthesis?
Do you think its feasable and makes sense?

so this first question is about synthesis, not modification for effects, right? and real-time, correct? so a MIDI Note-On is received and you want a note coming out as quickly as possible?
i don't know of a hardware product that does inverse FFT for synthesis. i
do know of a couple of effects algorithms that go into products that use FFT. i think it's mostly about doing "fast convolution" for purposes of reverb.
what are you intending to synthesize? notes? or something more wild than that? just
curious.

robert

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."

2018-10-24 00:48:29 UTC

Post by gm
Does anybody know a real world product that uses FFT for sound

synthesis?

Post by gm
Do you think its feasable and makes sense?

so this first question is about synthesis, not modification for
effects, right? and real-time, correct? so a MIDI Note-On is
received and you want a note coming out as quickly as possible?

yes exactly

Post by gm
i don't know of a hardware product that does inverse FFT for
synthesis. i do know of a couple of effects algorithms that go into
products that use FFT. i think it's mostly about doing "fast
convolution" for purposes of reverb.
what are you intending to synthesize? notes? or something more wild
than that? just curious.

basically a sample "mangler", you load an arbitray sample, a loop of
music for instance, and play back parts of it in real time,
time streched, pitch shifted, with formants corrected or altered,
backwards, forwards
I dont need polyphonic playback, though that would be nice for some
things
right now I do this with a granular "cloud", that is many overlapping
grains, which can play polyphonic
or rather paraphonic, which means that the grains play pack at
different pitches simultanously,
depending on the chords you play
but they all play back from the same sample position and have the same
kind of "treatment" like envelope or filtering.
I thought you could maybe do this and some other stuff in the spectral
domain
the idea is to change snippets / loops of existing music into new
music, this idea is not new,
two demo tracks
https://soundcloud.com/transmortal/the-way-you-were-fake
https://soundcloud.com/traumlos-kalt/the-way-we-were-iii
they are mostly made from a snippet of Nancy Sinatras Fridays Child

2018-10-24 00:50:36 UTC

It's quite a nuisance that the lists reply to is set to the person who
wrote the mail
and not to the list adress

Vladimir Pantelic

2018-10-24 11:53:58 UTC

1) http://www.unicom.com/pw/reply-to-harmful.html

2) http://marc.merlins.org/netrants/reply-to-useful.html

3) http://marc.merlins.org/netrants/reply-to-still-harmful.html

4) tbd :)

personally I'm in the 2) camp :)

Post by gm
It's quite a nuisance that the lists reply to is set to the person who
wrote the mail
and not to the list adress

Phil Burk

2018-10-25 14:22:48 UTC

Hmm. For me the "reply-to" in the original email header is set to "music-dsp
@music.columbia.edu".
If I hit Reply it goes to the list.

I recall being annoyed by Reply going to the person and not the list. Why
am I seeing something different? I am using GMail web client.

And BTW, I set my default response in GMail to "Reply-All" instead of
"Reply" in preferences. That is what I usually want to do.

Phil Burk

Post by Vladimir Pantelic
1) http://www.unicom.com/pw/reply-to-harmful.html
2) http://marc.merlins.org/netrants/reply-to-useful.html
3) http://marc.merlins.org/netrants/reply-to-still-harmful.html
4) tbd :)
personally I'm in the 2) camp :)

Post by gm
It's quite a nuisance that the lists reply to is set to the person who
wrote the mail
and not to the list adress

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-10-24 01:37:41 UTC

two demo tracks
https://soundcloud.com/transmortal/the-way-you-were-fake
https://soundcloud.com/traumlos-kalt/the-way-we-were-iii
they are mostly made from a snippet of Nancy Sinatras Fridays Child

I just realize in case s.o. is really interested, I have to be more precise:

bass and synthstrings that come in later on the second track are
ordinary synths

the rest ist granular, samples snippets are from Fridays Child, Some
Velvet Morning
and Summer Vine by Nancy Sinatra, Robots by Balanscu Quartett and a
synth sample

I made so many demo tracks the past days, most of them were made with
the Fridays Child sample
which has the advantage of being old school hardcore stereo, so you get
three
different sources from the same time ...

Ethan Fenn

2018-10-24 15:21:53 UTC

I haven't thought through the details for any particular application, but
the chirp z-transform might be a useful trick to keep in mind for these
sorts of things. It lets you calculate an IFFT with an arbitrary spacing
between bins, or even an arbitrary fundamental in case you want to detune
the partials a bit.

-Ethan

two demo tracks

https://soundcloud.com/transmortal/the-way-you-were-fake
https://soundcloud.com/traumlos-kalt/the-way-we-were-iii
they are mostly made from a snippet of Nancy Sinatras Fridays Child

bass and synthstrings that come in later on the second track are ordinary
synths
the rest ist granular, samples snippets are from Fridays Child, Some
Velvet Morning
and Summer Vine by Nancy Sinatra, Robots by Balanscu Quartett and a synth
sample
I made so many demo tracks the past days, most of them were made with the
Fridays Child sample
which has the advantage of being old school hardcore stereo, so you get
three
different sources from the same time ...
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

Ariane stolfi

2018-10-25 02:39:52 UTC

We did FFT noise synthesis for a participatory performance project "Open
Band",
you can read about it on this paper:

https://qmro.qmul.ac.uk/xmlui/bitstream/handle/123456789/26169/13.pdf?sequence=1

the idea was to crate blocks of noise based on pre-determined frequencies
to draw letters on the spectrum. I would be really interested also in
develop some tool for FFT synthesis for web audio based on drawings also

best,
Ariane

Post by Ethan Fenn
I haven't thought through the details for any particular application, but
the chirp z-transform might be a useful trick to keep in mind for these
sorts of things. It lets you calculate an IFFT with an arbitrary spacing
between bins, or even an arbitrary fundamental in case you want to detune
the partials a bit.
-Ethan

two demo tracks

https://soundcloud.com/transmortal/the-way-you-were-fake
https://soundcloud.com/traumlos-kalt/the-way-we-were-iii
they are mostly made from a snippet of Nancy Sinatras Fridays Child

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

Ariane stolfi

2018-10-25 04:40:46 UTC

Sorry,

the correct paper is this one

https://qmro.qmul.ac.uk/xmlui/bitstream/handle/123456789/26088/11.pdf?sequence=1

Post by Ariane stolfi
We did FFT noise synthesis for a participatory performance project "Open
Band",
https://qmro.qmul.ac.uk/xmlui/bitstream/handle/123456789/26169/13.pdf?sequence=1
the idea was to crate blocks of noise based on pre-determined frequencies
to draw letters on the spectrum. I would be really interested also in
develop some tool for FFT synthesis for web audio based on drawings also
best,
Ariane

two demo tracks

https://soundcloud.com/transmortal/the-way-you-were-fake
https://soundcloud.com/traumlos-kalt/the-way-we-were-iii
they are mostly made from a snippet of Nancy Sinatras Fridays Child

bass and synthstrings that come in later on the second track are
ordinary synths
the rest ist granular, samples snippets are from Fridays Child, Some
Velvet Morning
and Summer Vine by Nancy Sinatra, Robots by Balanscu Quartett and a
synth sample
I made so many demo tracks the past days, most of them were made with
the Fridays Child sample
which has the advantage of being old school hardcore stereo, so you get
three
different sources from the same time ...
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-10-25 10:17:16 UTC

I made a quick test,
original first, then resynthesized with time stretch and pitch shift and
corrected formants:

https://soundcloud.com/traumlos_kalt/ft-resynth-test-1-01/s-7GCLk
https://soundcloud.com/traumlos_kalt/ft-resynth-test-2-01/s-2OJ2H

sounds quite phasey and gurgely
I am using 1024 FFT size and 5 bins moving average to extract a
spectral envelope for formant preservation, which is probably not the
best way to do this

I assume you would need to realign phases at transients
sound quality isn't what you would expect in 2018...
(also I am doing the pitch shift the wrong way at the moment,
first transpose in time domain, then FFT time stretch, cause that was
easier to do for now
but this shouldn't cause an audible problem here)

about latency I dont now yet, I am using my FFT for NI Reaktor which has
a latency of several times FFT size
and is only good for proof of concept stuff

2018-10-25 15:58:00 UTC

One thing I noticed is that it seems to sound better at 22050 Hz sample rate
so I assume 1024 FFT size is too small and you should use 2048.

I dont know if that is because the DC band is too high or if the bins
are too broadband with 1024, or both?

I assume with this and some phase realingment and a better spectral envelope
quality would be somewhat improved.

Unfortunately Reaktor isn't the right tool at all to test these things,
you have to hack around everything,
just changing the FFT size will probably waste a whole day, so I
probably won't investigate this further.

Post by gm
I made a quick test,
original first, then resynthesized with time stretch and pitch shift
https://soundcloud.com/traumlos_kalt/ft-resynth-test-1-01/s-7GCLk
https://soundcloud.com/traumlos_kalt/ft-resynth-test-2-01/s-2OJ2H
sounds quite phasey and gurgely
I am using 1024 FFT size and 5 bins moving average to extract a
spectral envelope for formant preservation, which is probably not the
best way to do this
I assume you would need to realign phases at transients
sound quality isn't what you would expect in 2018...
(also I am doing the pitch shift the wrong way at the moment,
first transpose in time domain, then FFT time stretch, cause that was
easier to do for now
but this shouldn't cause an audible problem here)
about latency I dont now yet, I am using my FFT for NI Reaktor which
has a latency of several times FFT size
and is only good for proof of concept stuff
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-10-25 17:13:52 UTC

here an example at 22050 hz sample rate, FFT size 1024, smoothing for
the spectral envelope 10 bins,

and simple phase realignment: when amplitude is greater than last frames
amplitude

phase is set to original phase, otherwise to the accumulated phase of
the time stretch

didn't expect this wo work but it seems to work

It seems to sound better to me, but still not as good as required:

https://soundcloud.com/traumlos_kalt/ft-resynth-test-3-phasealign-1-22k-01/s-KCHeV

Post by gm
One thing I noticed is that it seems to sound better at 22050 Hz sample rate
so I assume 1024 FFT size is too small and you should use 2048.
I dont know if that is because the DC band is too high or if the bins
are too broadband with 1024, or both?
I assume with this and some phase realingment and a better spectral envelope
quality would be somewhat improved.
Unfortunately Reaktor isn't the right tool at all to test these
things, you have to hack around everything,
just changing the FFT size will probably waste a whole day, so I
probably won't investigate this further.

2018-10-25 19:18:50 UTC

the same sample as before, rearranged and sequenced, transposed

sound quality and latency aside, I think the idea has some potential
https://soundcloud.com/traumlos_kalt/spectromat-test-4-01/s-7W2tR

the second part is from Nancy Sinatras Summervine

I am sorry it's all drenched in a resonant modualated delay effect,
but I think you get the idea

Post by gm
here an example at 22050 hz sample rate, FFT size 1024, smoothing for
the spectral envelope 10 bins,
and simple phase realignment: when amplitude is greater than last
frames amplitude
phase is set to original phase, otherwise to the accumulated phase of
the time stretch
didn't expect this wo work but it seems to work
https://soundcloud.com/traumlos_kalt/ft-resynth-test-3-phasealign-1-22k-01/s-KCHeV

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-10-25 22:29:59 UTC

Post by gm
(also I am doing the pitch shift the wrong way at the moment,
first transpose in time domain, then FFT time stretch, cause that was
easier to do for now
but this shouldn't cause an audible problem here)

Now I think that flaw is actually the way to go

Instead of doing it the standard way,

FFT time stretch & filtering -> time domain pitch shift

where you need an uneven workload (not a fixed number of FFTs/Second)
and additional latency
to write the waveform before you can read and transpose it

My proposal is

Offlineprocess:
FFT convert to spectrum with amplitude, phase and phase derivative
-> create multisample (multispectra), one spectrogramme per half octave

Realtimeprocess:
select multispectrum -> iFFT timestretch and pitch shift in frequency domain
(without moving content from bin to bin, hence the multispectrum for
each 1/2 octave)

this way you have an even workload (fixed number of FFTs/second), and
latency
is just the time you allow for the iFFT, can be as short as 1 sample

8-)
Posting this here to prevent patents ;-) , but what do you think, do I
make sense?

2018-10-26 17:50:23 UTC

it seems that my artefacts have mostly to do with the spectral envelope.

What would be an efficient way to extract a spectral envelope when
you ha e stream of bins, that is one bin per sample, repeating

0,1,2,... 1023,0,1,2...
and the same stream backwards

1023,1022,...0,1023,1022...

?

I was using recursive moving average on the stream of amplitudes,
forwards and backwards, but that doesn't work so well
It turns out that the recursive filter can assume negative values even
though the input is all positive.
Replacing it with a FIR average fiter worked but theres still room for
improvement.

I dont want to use cepstral filtering for several reasons, it should be
simple yet efficient..
(complexitiy, latency, cpu)

any ideas?

2018-10-26 20:15:42 UTC

here I am using 5 point average on the lower bands and 20 point on the
higher bands

doesn't sound too bad now, but I am still looking for a better solution

https://soundcloud.com/traumlos_kalt/spectromat-4-test/s-3WxpJ

Post by gm
it seems that my artefacts have mostly to do with the spectral envelope.
What would be an efficient way to extract a spectral envelope when
you ha e stream of bins, that is one bin per sample, repeating
0,1,2,... 1023,0,1,2...
and the same stream backwards
1023,1022,...0,1023,1022...
?
I was using recursive moving average on the stream of amplitudes,
forwards and backwards, but that doesn't work so well
It turns out that the recursive filter can assume negative values even
though the input is all positive.
Replacing it with a FIR average fiter worked but theres still room for
improvement.
I dont want to use cepstral filtering for several reasons, it should
be simple yet efficient..
(complexitiy, latency, cpu)
any ideas?
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-10-27 14:47:18 UTC

Now I do it like this, 4 moving average FIRs,
5, 10, 20 and 40 taps
and a linear blend between them based on log2 of the bin number

I filter forwards and backwards, backwards after the shift of the bins
for formant shifting
the shift is done reading with a linear interpolation from the forward
prefilterd bins

not very scientific but it works, though there is quite some room for
improvements,
sonically...

here is how it sounds with a 1024 FFT at 22050 kHz SR with four overlaps:

https://soundcloud.com/traumlos_kalt/spectronmat-4e-2b-01/s-DM4kQ

first transpositions with corrected formants, then extreme formant shifting

I am not sure about the sound quality, it's still not good enogh for a
product,
I think you need 8 overlaps to reduce granularity, and a better spectral
envelope
and a better transient detection
(I cant do this in Reaktor though, the structure it will get too messy
and latency way too much)

any comments, ideas for improvments are appreciated

2018-10-28 01:34:05 UTC

Now I tried pitch shifting in the frequency domain instead of time
domain to get rid of one transform step, but it sounds bad and phasey etc.

I do it like this:

multiply phase difference with frequency factor and add to accumulated
phase,
and shift bins according to frequency factor

again there is a formant correction, and the phase is reset to the
original phase
if the amplitude is larger than it was in the previous frame

with a 1024 FFT size it doesn't work at 44 kHz, it works @ 22050 kHz but
sounds
like there is a flanger going on, and especially the bass seems odd
https://soundcloud.com/traumlos_kalt/freq-domain-p-shift-test-1/s-QZBEr

first original then resynthesized

is this the quality that is to be expected with this approach?
am I doing it the right way?

Scott Cotton

2018-10-28 09:46:01 UTC

I don't know if you're "doing it the right way", however, pitch shift by
bin shifting has
the following problems:

-edge effects (using windowing can help)
- pitch shift up puts some frequencies above nyquist limit, they need to be
elided
- the quantised pitch shift is only an approximation of a continuous pitch
shift because
the sinc shaped realisation of a pure sine wave in the quantised frequency
domain can occur
at different distances from the bin centers for different sine waves,
shifting bins doesn't do this
and thus isn't 100% faithful.

From the sound clip, I'd guess that you might have some other problems
related to normalising the
synthesis volume/power

The best quality commonly used pitch shift comes from a phase vocoder TSM:
stretch the time
and then resample (or vice versa) so that the duration of input equals that
of output. Phase vocoders
however vary a lot in the quality of sound they produce, some are even as
bad or worse than the example
you provided.

Hope that helps
Scott

Post by gm
Now I tried pitch shifting in the frequency domain instead of time
domain to get rid of one transform step, but it sounds bad and phasey etc.
multiply phase difference with frequency factor and add to accumulated
phase,
and shift bins according to frequency factor
again there is a formant correction, and the phase is reset to the
original phase
if the amplitude is larger than it was in the previous frame
sounds
like there is a flanger going on, and especially the bass seems odd
https://soundcloud.com/traumlos_kalt/freq-domain-p-shift-test-1/s-QZBEr
first original then resynthesized
is this the quality that is to be expected with this approach?
am I doing it the right way?
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

--
Scott Cotton
http://www.iri-labs.com

Peter P.

2018-10-28 10:22:20 UTC

Dear Scott,

Post by Scott Cotton
I don't know if you're "doing it the right way", however, pitch shift by
bin shifting has
-edge effects (using windowing can help)
- pitch shift up puts some frequencies above nyquist limit, they need to be
elided
- the quantised pitch shift is only an approximation of a continuous pitch
shift because
the sinc shaped realisation of a pure sine wave in the quantised frequency
domain can occur
at different distances from the bin centers for different sine waves,
shifting bins doesn't do this
and thus isn't 100% faithful.
From the sound clip, I'd guess that you might have some other problems
related to normalising the
synthesis volume/power
stretch the time
and then resample (or vice versa) so that the duration of input equals that
of output. Phase vocoders
however vary a lot in the quality of sound they produce, some are even as
bad or worse than the example
you provided.

Scott Cotton

2018-10-28 11:03:19 UTC

Dear Peter,

There are numerous (academic) sources which cite phase vocoding as a
"solved problem" when used
in conjunction with transient detection and phase locking. I don't
entirely agree with that assessment.

Phase vocoders often have limitations around the following
1. integer vs real/float stretch factors
2. total amount of stretch (some phase vocoders deteriorate a lot as the
stretch factor approaches 2 or 1/2.
Given 1) above, this can also mean no stretch factors are in fact useful.
3. synthesis power/volume often is a rough approximation
4. windowing conditions often don't readily scale to different overlap
factors, constraining the available quality/cost tradeoffs available.
5. real time dynamic pv can introduce additional artefacts at points where
the stretch factor changes.

I'm sure existing pv authors have more to say. There are also related
sines + noise decompositions, and a
lot of academic reading material. Many pv authors report that getting it
to work is much harder than textbooks lead one to believe. Often there are
problems associated with computing/estimating the principal value of a
phase difference for
frequency estimation.

I've not yet made a good pv myself, except perhaps using too
computationally costly altos,
so take with a grain of salt. I'd be interested in hearing others' ideas
around current pv limitations and quality too.

Scott

Post by Peter P.
Dear Scott,

Post by Scott Cotton
elided
- the quantised pitch shift is only an approximation of a continuous

pitch

Post by Scott Cotton
shift because
the sinc shaped realisation of a pure sine wave in the quantised

frequency

Post by Scott Cotton
domain can occur
at different distances from the bin centers for different sine waves,
shifting bins doesn't do this
and thus isn't 100% faithful.
From the sound clip, I'd guess that you might have some other problems
related to normalising the
synthesis volume/power
The best quality commonly used pitch shift comes from a phase vocoder
stretch the time
and then resample (or vice versa) so that the duration of input equals

that

Post by Scott Cotton
of output. Phase vocoders
however vary a lot in the quality of sound they produce, some are even as
bad or worse than the example
you provided.

Thank you for this nice explanation, I wonder if you could even add a
few more lines to it regarding the quality of phase vocoders. Your text
ended when it was getting even more exciting. :)
Thanks!
P
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

--
Scott Cotton
http://www.iri-labs.com

2018-10-28 13:19:12 UTC

Post by Scott Cotton
- the quantised pitch shift is only an approximation of a continuous
pitch shift because
the sinc shaped realisation of a pure sine wave in the quantised
frequency domain can occur
at different distances from the bin centers for different sine waves,
shifting bins doesn't do this
and thus isn't 100% faithful.

I think this is one of the problems, frequency wise it seems to work
better at 11025 Hz sample rate with 1024 FFT size
so I assume you would really need 4096 and 8 overlaps minimum for 44 kHz
Its hard to tell because I can't test more than 4 overlaps in Reaktor
right now, it will get too complictated
and with that temporal spacing it's diffcult to judge if a larger FFT is
all thats needed

I am not sure if I calculate the principal value of the phase difference
correctly
I just wrap it back into the -pi..pi range, which seems right to me but
maybe I am missing something

Post by Scott Cotton
From the sound clip, I'd guess that you might have some other problems
related to normalising the
synthesis volume/power

that's possible, but either I don't understand this point or it wouldn't
matter so much?

Post by Scott Cotton
The best quality commonly used pitch shift comes from a phase vocoder
TSM: stretch the time
and then resample (or vice versa) so that the duration of input equals
that of output.

that's what I did before but I am hoping to get something that is more
suitable for real time,
with less latency, calculating the forward transform and spectral
envelopes offline

2018-10-28 15:47:10 UTC

to sum it up, assumptions:

- for the phase vocoder approach you need an FFT size of 4096 @ 44.1 kHz
SR and
- 8 or rather 16 overlaps at this FFT size and SR for a decent quality

- you need two up to 200 tap FIR filters for a spectral envelope
on an ERB scale (or similar) at this FFT size (you can precalculate this
offline though)

- if you calculate a 4096 iFFT just in time (one bin per sample) you
have a latency of
~100 ms with these parameters, sped up with 16 simultanoues FFT overlaps
100/16 = ~5ms,
which would be usable

not sure if all of these assumptions are correct, but
I assume these are the reasons why we dont see so many real time
applications
with this technique

It's doable, but on the border of what is practically useful (in a VST
for instance) I think

I think this is one of the problems, frequency wise it seems to work
better at 11025 Hz sample rate with 1024 FFT size
so I assume you would really need 4096 and 8 overlaps minimum for 44 kHz
Its hard to tell because I can't test more than 4 overlaps in Reaktor
right now, it will get too complictated
and with that temporal spacing it's diffcult to judge if a larger FFT
is all thats needed
I am not sure if I calculate the principal value of the phase
difference correctly
I just wrap it back into the -pi..pi range, which seems right to me
but maybe I am missing something

Post by Scott Cotton
From the sound clip, I'd guess that you might have some other
problems related to normalising the
synthesis volume/power

that's possible, but either I don't understand this point or it
wouldn't matter so much?

that's what I did before but I am hoping to get something that is more
suitable for real time,
with less latency, calculating the forward transform and spectral
envelopes offline
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

Scott Cotton

2018-10-28 17:05:34 UTC

Post by gm
SR and
- 8 or rather 16 overlaps at this FFT size and SR for a decent quality

The coincides with what I've played with. But the FFT size is also a
function of
frequency range of input. For audio a common lower bound on frequency is
20hz, which
gives an FFT of 1/20th of a second to be able to distinguish 1 sinusoid per
bin, as per pv
assumptions. This already is way higher than latency for interactive apps.

Post by gm
- you need two up to 200 tap FIR filters for a spectral envelope
on an ERB scale (or similar) at this FFT size (you can precalculate this
offline though)

Could you explain more about this? What exactly are you doing with ERB and
offline calculation of
spectral envelopes?

Post by gm
- if you calculate a 4096 iFFT just in time (one bin per sample) you
have a latency of
~100 ms with these parameters, sped up with 16 simultanoues FFT overlaps
100/16 = ~5ms,
which would be usable
not sure if all of these assumptions are correct, but
I assume these are the reasons why we dont see so many real time
applications
with this technique
It's doable, but on the border of what is practically useful (in a VST
for instance) I think

I thinks that's why it requires so much work to tune everything to make it
work
reasonably for a why variety of inputs. Once you calculate the
requirements, it's only
doable for a limited frequency range and type of input.

Perhaps sub-band decomposition and then FFT techniques on the sub-bands
would help.

Scott

I think this is one of the problems, frequency wise it seems to work
better at 11025 Hz sample rate with 1024 FFT size
so I assume you would really need 4096 and 8 overlaps minimum for 44 kHz
Its hard to tell because I can't test more than 4 overlaps in Reaktor
right now, it will get too complictated
and with that temporal spacing it's diffcult to judge if a larger FFT
is all thats needed
I am not sure if I calculate the principal value of the phase
difference correctly
I just wrap it back into the -pi..pi range, which seems right to me
but maybe I am missing something

Post by Scott Cotton
From the sound clip, I'd guess that you might have some other
problems related to normalising the
synthesis volume/power

that's possible, but either I don't understand this point or it
wouldn't matter so much?

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

--
Scott Cotton
http://www.iri-labs.com

2018-10-28 17:33:41 UTC

Post by gm
- you need two up to 200 tap FIR filters for a spectral envelope
on an ERB scale (or similar) at this FFT size (you can
precalculate this
offline though)
Could you explain more about this?Â What exactly are you doing with
ERB and offline calculation of
spectral envelopes?

I am not using ERB at the moment I was thinking ahead how to do it more
properly.

What I do at the moment is filter the amplitude spectrum with a
moving average FIR filter, I am using 5 bins average on the lower bands
and 40 bins on the highest bands, (for an FFT size of 1024 at 22050
sample rate)
blending between filter lengths dependeing on the log 2 of the bin number.

In other words, I double the moving average filter each octave.

I filter forwards and backwards through the bins.

This is my spectral envelope, wich I use to whiten the original signal
(devide the original spectrums amplitudes with the smoothed spectrums
amplitudes)
and then use a shifted version of the avaraged spectrum to imprint the
corrected formants
(mulitplying, like vocoding) on the whitened spectrum.

To do this more properly, I assume that the averaging filters should be
based on an ERB scale
though what I do is somewhar similar. Then you would need to average
abou 200 samples
for the highest ERB bands.

My idea was to use the phase vocoder in a sample slicer, so that you can
stretch
and pitch shift samples slices, with reconstructed formants.
For this youd could precalculate the forward FFT, the spectral
envelopes, and the phase
differences/frequencies, so you only have the latency and CPU of the
inverse transform.

2018-10-28 21:28:30 UTC

there had been a mistake in my structure which caused the phase to be
set to zero
now it sounds more like the original when there is no pitch shift applied
(which is a good indicator that there is something wrong when it does not)

https://soundcloud.com/traumlos_kalt/freq-domain-pv-shift-test-3b-22050-2/s-5pAUL
however it sounds as bad as before when shifted (1024 @ 22050 Hz again)

I am thinking now that resetting the phase to the original when the
amplitude exceeds the previous value
is probably wrong too, because the phase should be different when
shifted to a different bin
if you want to preserve the waveshape
I am not sure about this, but when you think of a minimum phase
waveshape the phases are all different, depending on the frequency.

2018-10-28 23:45:00 UTC

Post by gm
I am thinking now that resetting the phase to the original when the
amplitude exceeds the previous value
is probably wrong too, because the phase should be different when
shifted to a different bin
if you want to preserve the waveshape
I am not sure about this, but when you think of a minimum phase
waveshape the phases are all different, depending on the frequency.

This whole phase resetting I had is nonsens:

consider for instance speech, the partials meander from bin to bin
from one frame to the next, so you always have the case that the amplitude
is larger than it was in the previous frame, but that is not a transient
where you would want to reset phase.

On the other hand it sounds tinny when the phases are always running freely
and transients don't have the waveshapes they should have when you
stretch time and shift pitch.

So you would need partial tracking or something to that effect I assume.

Also the formant correction I described worked pretty well in the "TDM"
version but
not well in the phase vocoder version, I dont know why, cause I am doing
it the same way.

Ethan Duni

2018-10-29 04:43:40 UTC

You should have a search for papers by Jean Laroche and Mark Dolson, such
as "About This Phasiness Business" for some good information on phase
vocoder processing. They address time scale modification mostly in that
specific paper, but many of the insights apply in general, and you will
find references to other applications.

Ethan

consider for instance speech, the partials meander from bin to bin
from one frame to the next, so you always have the case that the amplitude
is larger than it was in the previous frame, but that is not a transient
where you would want to reset phase.
On the other hand it sounds tinny when the phases are always running freely
and transients don't have the waveshapes they should have when you
stretch time and shift pitch.
So you would need partial tracking or something to that effect I assume.
Also the formant correction I described worked pretty well in the "TDM"
version but
not well in the phase vocoder version, I dont know why, cause I am doing
it the same way.
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-10-29 07:49:32 UTC

Thanks for tip, I had a brief look at this paper before.
I think the issue it adresses is not the problem I encounter now.
But it might be interesting again at a later stage or if I return to the
time domain pitch shift.

This is how I do it now, it seems simple & correct but I am not 100% sure,
it still sounds bad except for zero pitch shifts, pure timestretches
without pitch shift sound ok-ish though
Especially compared to the previous version with time domain pitch
shifting it sounds much worse.

1.loop through bins -------------

    calculate input phase difference

    subtract phase step of hop size of the original bin //=
phase-frequency-deviation

    multiply by frequency factor

    add phase step of hop size of the target bin

    wrap into -pi...pi range

    accumulate to bins phaseaccumulator

-------- end loop

2.loop through bins ------------
    calculate spectral envelopes/formant correction
-------- end loop

3.loop through bins ------------
    shift bins
-------- end loop

4.iFFT

for 2., the formant correction, you have to do this before the bin shift :

loop through bins -------------
    smooth amplitude spectrum according to ERB scale or similar
----- end loop

loop through bins -------------
    shift bins of a copy of the smoothed spectrum in oppsite direction
(1/freq factor)
    // smooth again, or don't, or use MIPmapping )
    calculate amplitudes * spectrale envelope 2 / spectral envelope 1
----- end loop

this seems correct (?) and does both, pitch shifting and time stretching
However, it doesnt sound good, it actually sounds kind of resonant, and
tinny
(noise elements seem to be converted to ringing sinusoids) and strange
except when the shift is 0, and also sounds better for downshifts then
for upshifts

also missing is a transient detection (in time domain?) to reset phases
to the original phases, and a noise / sinusiod detection, which might
improve things

part of the bad sound may just be the fromant correction, it sounds a
little bit like there were
two voices speaking in synch, another part is the low number of overlaps
example:
https://soundcloud.com/traumlos_kalt/freq-domain-pv-shift-test-4e-2f-4c-test-2/s-6SJ93
1024 @ 22050 kHz

2018-10-29 18:12:01 UTC

Post by Ethan Duni
You should have a search for papers by Jean Laroche and Mark Dolson,
such as "About This Phasiness Business" for some good information on
phase vocoder processing. They address time scale modification mostly
in that specific paper, but many of the insights apply in general, and
you will find references to other applications.
Ethan

2018-10-29 18:19:04 UTC

Post by gm
From the structure displayed in the book, he adds two neighbouring
complex numbered bins,
multiplied. That is, he multiplies their real and imaginary part respectivly
and adds that to the values of the bin - (Fig 9.18 p. 293).
Unfortunately this is not explained in detail,

Sorry I misinterpreted the figure, he doesn't do that but just seems to
add the bins.
The multiplication seems to be a flag to switch this on and off.
Still I dont see what that would do other then adding an unwanted
pertubation,
and it sounds worse actually.
Probably I am missing something essential.

Scott Cotton

2018-10-29 18:50:01 UTC

No phase vocoder applies to arbitrary signals. PVs work for polyphonic
input where
- the change in frequency over time is slow; specifically slow enough so
that the
frequency calculation step over one hop can estimate "the" frequency over
the corresponding time slice.
- there is at most one sinusoidal "component" per bin (this is unrealistic
for many reasons), meaning
the time slice needs to be large enough and FFT large enough to distinguish.

Note the above can't handle, for example, onsets for most musical
instruments.

Nonetheless, the errors when these conditions do not hold are such that
some are able to make
decent sounding TSM/pitch change phase vocoders for a widER variety of
input.

If you put a chirp into a PV and change the rate of frequency change of the
chirp, you'll hear the slow frequency
change problem. Before you hear it, you'll see it in the form of little
steps in the waveform

Scott

Post by gm
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

--
Scott Cotton
http://www.iri-labs.com

2018-10-29 19:09:06 UTC

That's understood.

What is not completely understood by me is the technique in the paper,
and the very much related technique from the book.
How can this apply to arbitrary signals when it relies on sinusioids
seperated by several bins?

Also it seems I dont understand where the artefacts in my pitch shift
come from.
They seem to have to do with phases but it's not understood how exactly.

What is understood is that the neighbouring bins of a sinusoidal peak
have a phase -pi apart.

I dont see the effect of this though, they rotate in the same direction,
at the same speed.

But why is there no artefact of this kind when the signal is only stretched,
but not shifted?

Post by Ethan Duni

Post by Ethan Duni
You should have a search for papers by Jean Laroche and Mark

Dolson,

Post by Ethan Duni
such as "About This Phasiness Business" for some good

information on

Post by Ethan Duni
phase vocoder processing. They address time scale modification

mostly

Post by Ethan Duni
in that specific paper, but many of the insights apply in

general, and

Post by Ethan Duni
you will find references to other applications.
Ethan

I think the technique from the paper only applies for monophonic
harmonic input -?
It picks amplitude peaks and reconstructs the phase on bins around them
depending on the synthetic phase and
neighbouring input phase. I dont really see what it should do exactly
tbh, but
the criterion for a peak is that it is larger than four neighbouring
bins so this doesn't apply to arbitrary signals, I think.
I also tried Miller Puckets phase locking mentioned in his book The
Theory and Technique of Electronic
Music and also mentioned in the paper,
but a) I don't hear any difference and b) I don't see how and why it
should work.
Â From the structure displayed in the book, he adds two neighbouring
complex numbered bins,
multiplied. That is, he multiplies their real and imaginary part respectivly
and adds that to the values of the bin - (Fig 9.18 p. 293).
Unfortunately this is not explained in detail,
I don't see what that would do other than adding a tiny
perturbation to
isolated peaks
and somehwat larger one to neighbouring bins of peaks?
I don't see how this should lock phases of neighbouring bins?
And again this doesn't apply to arbitrary signals?
No phase vocoder applies to arbitrary signals.Â PVs work for
polyphonic input where
- Â the change in frequency over time is slow; specifically slow enough
so that the
frequency calculation step over one hop can estimate "the" frequency
over the corresponding time slice.
- Â there is at most one sinusoidal "component" per bin (this is
unrealistic for many reasons), meaning
the time slice needs to be large enough and FFT large enough to distinguish.
Note the above can't handle, for example, onsets for most musical
instruments.
Nonetheless, the errors when these conditions do not hold are such
that some are able to make
decent sounding TSM/pitch change phase vocoders for a widER variety of
input.
If you put a chirp into a PV and change the rate of frequency change
of the chirp, you'll hear the slow frequency
change problem. Before you hear it, you'll see it in the form of
little steps in the waveform
Scott
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp
--
Scott Cotton
http://www.iri-labs.com
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

Scott Cotton

2018-10-29 20:34:53 UTC

Post by gm
That's understood.
What is not completely understood by me is the technique in the paper, and
the very much related technique from the book.
How can this apply to arbitrary signals when it relies on sinusioids
seperated by several bins?

For music, it is a fairly close approximation during the "sustain" part of
notes.

Post by gm
Also it seems I dont understand where the artefacts in my pitch shift come
from.
They seem to have to do with phases but it's not understood how exactly.
What is understood is that the neighbouring bins of a sinusoidal peak
have a phase -pi apart.
I dont see the effect of this though, they rotate in the same direction,
at the same speed.
But why is there no artefact of this kind when the signal is only stretched,
but not shifted?

I think the big picture answer to this last question is: time stretch is
(or at least can be done) by a continuous factor
while bin shifting is inherently quantised. Moreover the quantised nature
of the bins isn't really the same as say breaking
down a continuous interval into a set of contiguous equi-sized intervals,
because of the sinc shaped appearance of frequencies in the quantised
domain and edge effects. I guess in theory one could find a set of sines
whose projection onto the bins results in the frequency domain picture you
get, but this would be computationally quite expensive compared to bin
shifting (and is normally approximated by peak finding).

If when you are shifting bins, each bin ends up with the value of some
single bin from the original, then the phase
could be close. But then you might be missing some input data in your
transform. On the other hand, If the target value of a bin combines
several input bins, then the phase will likely be messed up most of the
time because you're encoding 2 or more distinct phase values into one.

Otherwise, for more precise treatment of phase than wrap/unwrap, see this
chapter <http://sethares.engr.wisc.edu/vocoders/Transforms.pdf> pages
118-120.

I also myself found it difficult to wrap my head around why TSM gives
better pitch shift than directly, the above is what I arrived at, and I
find it convincing. Frequency domain pitch shift I think can only be as
good as TSM based pitch shift if the frequency domain is treated
continuously rather than in bins. If you do that, then you no longer have
FFT, and things get costly and complicated fast.

Hope that helps
Scott

No phase vocoder applies to arbitrary signals. PVs work for polyphonic
input where
- the change in frequency over time is slow; specifically slow enough so
that the
frequency calculation step over one hop can estimate "the" frequency over
the corresponding time slice.
- there is at most one sinusoidal "component" per bin (this is
unrealistic for many reasons), meaning
the time slice needs to be large enough and FFT large enough to distinguish.
Note the above can't handle, for example, onsets for most musical
instruments.
Nonetheless, the errors when these conditions do not hold are such that
some are able to make
decent sounding TSM/pitch change phase vocoders for a widER variety of
input.
If you put a chirp into a PV and change the rate of frequency change of
the chirp, you'll hear the slow frequency
change problem. Before you hear it, you'll see it in the form of little
steps in the waveform
Scott

Post by gm
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

--
Scott Cotton
http://www.iri-labs.com
_______________________________________________
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

--
Scott Cotton
http://www.iri-labs.com

2018-10-29 23:57:54 UTC

Unfortunately I would have to stick with the "sliding" PD phase locking
structure from the book for now,
iterating through the spectrum to search for peaks and identify groups
will add too many frames of additional latency in Reaktor.

But for me this method unfortunately defintively gave worse results than
without phase locking for music.
It actually sounded broken in some cases, when pitch shifted in frequency.

Both at 22050 and 11025 Hz SR with a 1024 FFT so it's not just a
question of frequency resolution.
or rather, it probably is, but 4096 @ 44.1 kHz would not solve the problem.
On the other hand, on some speech sample I found an improvement.

(It might be, of course, that I am still doing something else wrong...
Also I might be fooled a little bit because it's louder without phase
locking since summing the adjacent bins
will decrease the peak bins amplitude so it seems to sound better
without when you switch it on and off.)

At least meanwhile I understand why and how the method is expected to
work and I like it's approach.

If s.o. is interested I can post another audio example but I think I've
spammed the list enough so far...

About such sophisticated things as intra frame corrections, besides that
it might well be above my head
I first have to get a basic frequency domain shift to work in a decent
quality...

Meanwhile I found what was broken, the shifted frequencies phases were
wrong, but quality is still
dissapointing compared to the time stretch version.

I also made a quick test where adjacent amplitudes around peaks where
simply set to zero
(which doesn't require additional latency here since I can do it on the
output stream)
and that definitively sounds better on up shifts, so I assume I really
do need something to treat the phases around peaks
but I am not sure whats an efficient way yet that I can implement.
Maybe it's just that, setting some bins to zero?

Maybe I need to explain that the FFT I made for Reaktor provides a
constant stream of values like
an audio stream and you have to work on that, sample by sample, or else
you need a frame
of latency to store the values and work on them, again, sample by sample.
This makes prototyping a little bit odd and laborious, but in the end it
forces you to choose
efficient solutions, thats why I like Miller Puckets version of the
phase locking, it just didn't work so well here.

robert bristow-johnson

2018-10-30 00:27:24 UTC

---------------------------- Original Message ----------------------------

Subject: Re: [music-dsp] pitch shifting in frequency domain Re: FFT for realtime synthesis?

From: "gm" <***@voxangelica.net>

Date: Mon, October 29, 2018 7:57 pm

To: music-***@music.columbia.edu

--------------------------------------------------------------------------

Post by gm
About such sophisticated things as intra frame corrections, besides that
it might well be above my head
I first have to get a basic frequency domain shift to work in a decent
quality...
Meanwhile I found what was broken, the shifted frequencies phases were
wrong, but quality is still
dissapointing compared to the time stretch version.

it just seems to me, mr. g, that implementing this *first* as a time-scaling version (so, being not-real-time you would be processing a sound file into another sound file) would be the simplest thing. then deal with pitch-shifting by resampling the time-stretched result so that it's
back to the original length.
don't bother with the intra-frame corrections. it's not worth it. but the phase adjustment on each recognized sinusoid between frames should be done to keep it glitch-free. but it could still mangle attacks and other
transient.

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."

2018-10-30 15:30:42 UTC

Ok, heres a final idea, can't test any of this so it's pure science fiction:

-Take a much larger FFT spectrogramme offline, with really fine overlap
granularity.

-Take the cesptrum, identify regions/groups of transients by new peaks
in the cepstrum.

-Pick peaks in the spectrum, by amplitudes and the phase releationship
around the peaks.

-Seperate sinusoids from noise this way.

-Compress the peaks (without the surrounding regions) and noise into
smaller spectra.
(but how? - can you simply add those that fall into the same bins?)

Now use these smaller spectra to synthesize in realtime.

Use the original phases on transient partial groups and events to
preserve waveshapes on transients.
Use random phases for the noise like components.
Use accumulated phases everywhere else.

Not sure if any of this makes sense.
I am curious about the spectrum compression part, would this work and if
not why not?

2018-10-31 00:17:40 UTC

Post by gm
-Compress the peaks (without the surrounding regions) and noise into
smaller spectra.
(but how? - can you simply add those that fall into the same bins?)

snip...

Post by gm
I am curious about the spectrum compression part, would this work and
if not why not?

Thats a very fundamental DSP question and I am surprised no one wants to
answer.
Probably either no one did read the post or it was too stupid.

My laymans answer is YES OF COURSE, WHAT ARE YOU THINKING but what do I
know, I am all self tought
and even tought myself how to read and write (not kidding) so I am not sure.
The question, in other words is, can you simply add the simultanous
parameters(their "phasors," complex numbers)
of two close sinousoids together to get a meaningful description of that
instance in time?

And another question, to be filed under urban anectdote is:
I've change the resynth from FFT 1024 @ 22050 Hz to FFt 2048 @ 44.1 kHz
and it sounds so much better - why is that? I have no explanation
because the difference between 22 and 44 khz is so miniscule, doesn't
explain that.

I am quite happy with the freq domain resynth now tbh.
Its not "really good" but ok-ish, not bad, good enough for demonstration
purposes.
Even though there is a fundamental flaw with my product sketch:
an FFT slice is too long, time slice wise, to be usful in a tight sample
slicer.
so it's kind of sloppy in its reaction.

robert bristow-johnson

2018-10-31 02:14:47 UTC

---------------------------- Original Message ----------------------------

Subject: [music-dsp] two fundamental questions Re: FFT for realtime synthesis?

From: "gm" <***@voxangelica.net>

Date: Tue, October 30, 2018 8:17 pm

To: music-***@music.columbia.edu

--------------------------------------------------------------------------

Post by gm
-Compress the peaks (without the surrounding regions) and noise into
smaller spectra.
(but how? - can you simply add those that fall into the same bins?)

snip...

Post by gm
I am curious about the spectrum compression part, would this work and
if not why not?

That's a very fundamental DSP question and I am surprised no one wants to
answer.
Probably either no one did read the post or it was too stupid.
My layman's answer is YES OF COURSE, WHAT ARE YOU THINKING but what do I
know, I am all self taught
and even taught myself how to read and write (not kidding) so I am not sure.

mr. g,
i am not exactly "self-taught". i learned my essential math and electrical engineering in a university setting. but most of these audio techniques are things i have learned
since skool and are mostly self-taught.
perhaps a quarter century ago i saw a paper about doing something like real-time additive synthesis by use of the inverse FFT, and while i don't remember the details, i remember thinking to myself that it would be very difficult doing synthesis with an
iFFT for several reasons. if the FFT is too small (like N=2048) the frequency resolution for each bin is crappy for notes below, say, middle C. if you're not just plopping a "lobe" of template data down onto the iFFT buffer, then you have to find some way to interpolate
to move it's position a fraction of a bin. if your FFT is too large, you have a long time to wait between MIDI NoteOn going in and sound coming out. but a big iFFT is what you need for decent frequency resolution at the lower pitches.

if you're synthesizing notes that have harmonic partials, then a periodic or quasi-periodic synthesis technique in the time-domain is likely your simplest bet. i am a partisan of Wavetable Synthesis because, with the one restriction that the partials are nearly or perfectly harmonic, wavetable
synthesis is general. nearly any existing acoustic instrument can be convincingly synthesized with wavetable synthesis. as long as you have a good pitch detector, it's not hard to analyze a given note and break it into wavetables. but you need to know the period (to a sub-sample
precision) and you need to be able to interpolate convincingly between samples. BOTH in the analysis phase and in the synthesis phase. wavetable synthesis works well for notes with an attack (what you might use a sampling and looping "synthesizer" to do), additive synthesis (as
long as the partials are nearly harmonic), classic waveforms (like saw, square, PWM, sync-saw, sync-square) and you can cross-fade (or "interpolate") between wavetables to get the ability to change the sound. you need a good tracking pitch detector in the analysis phase.
but,
for fun, there are other synthesis techniques from the olden days. Classic Subtractive synthesis, Karplus-Strong, Physical Modeling, Non-linear Waveshaping, Geometric Summation formula, IIR impulse response, comb-filter impulse response, "FOF" (something like overlapping wavelets or
grains). but they're all time domain.

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."

2018-10-31 02:35:46 UTC

Thanks for your answer, it's much apreciated.

My goal is to resynthesize arbitary noises. If you do that with
wavetables you end up with a pitch of 20 Hz, hence the FFT.
My product idea was rubbish though and your post confirms that.

For your interest, I recently invented "noisetable synthesis" - take any
sound, randomzie the phases or
circularly convolve with a long chunk of white noise, you get a long
seemless loop of arbitrary spectra.

Good for synthetic cymbals and the like. Take any sound and stretch it
into an endless loop.
Nice for ambient music, too.

But back to my question, I am serious, could you compress a spectrum by
just adding the
bins that fall together? In my understanding, yes, but I am totally not
sure.
My "understanding" might be totally wrong.

Post by robert bristow-johnson
---------------------------- Original Message ----------------------------
Subject: [music-dsp] two fundamental questions Re: FFT for realtime synthesis?
Date: Tue, October 30, 2018 8:17 pm
--------------------------------------------------------------------------

Post by gm
-Compress the peaks (without the surrounding regions) and noise into
smaller spectra.
(but how? - can you simply add those that fall into the same bins?)

snip...

Post by gm
I am curious about the spectrum compression part, would this work and
if not why not?

That's a very fundamental DSP question and I am surprised no one

wants to

Post by gm
answer.
Probably either no one did read the post or it was too stupid.
My layman's answer is YES OF COURSE, WHAT ARE YOU THINKING but what do I
know, I am all self taught
and even taught myself how to read and write (not kidding) so I am

not sure.
mr. g,
i am not exactly "self-taught".Â i learned my essential math and
electrical engineering in a university setting.Â but most of these
audio techniques are things i have learned since skool and are mostly
self-taught.
perhaps a quarter century ago i saw a paper about doing something like
real-time additive synthesis by use of the inverse FFT, and while i
don't remember the details, i remember thinking to myself that it
would be very difficult doing synthesis with an iFFT for several
reasons.Â if the FFT is too small (like N=2048) the frequency
resolution for each bin is crappy for notes below, say, middle C.Â Â if
you're not just plopping a "lobe" of template data down onto the iFFT
buffer, then you have to find some way to interpolate to move it's
position a fraction of a bin.Â if your FFT is too large, you have a
long time to wait between MIDI NoteOn going in and sound coming out.Â
but a big iFFT is what you need for decent frequency resolution at the
lower pitches.
if you're synthesizing notes that have harmonic partials, then a
periodic or quasi-periodic synthesis technique in the time-domain is
likely your simplest bet.Â i am a partisan of Wavetable Synthesis
because, with the one restriction that the partials are nearly or
perfectly harmonic, wavetable synthesis is general.Â nearly any
existing acoustic instrument can be convincingly synthesized with
wavetable synthesis.Â as long as you have a good pitch detector, it's
not hard to analyze a given note and break it into wavetables.Â but
you need to know the period (to a sub-sample precision) and you need
to be able to interpolate convincingly between samples.Â BOTH in the
analysis phase and in the synthesis phase.Â wavetable synthesis works
well for notes with an attack (what you might use a sampling and
looping "synthesizer" to do), additive synthesis (as long as the
partials are nearly harmonic), classic waveforms (like saw, square,
PWM, sync-saw, sync-square) and you can cross-fade (or "interpolate")
between wavetables to get the ability to change the sound.Â you need a
good tracking pitch detector in the analysis phase.
but, for fun, there are other synthesis techniques from the olden
days.Â Classic Subtractive synthesis, Karplus-Strong, Physical
Modeling, Non-linear Waveshaping, Geometric Summation formula, IIR
impulse response, comb-filter impulse response, "FOF" (something like
overlapping wavelets or grains).Â but they're all time domain.
--
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

Ross Bencina

2018-10-31 10:27:14 UTC

Hi,

Sorry, late to the party and unable to read the backlog, but:

The "FFT^-1" technique that Robert mentions is from a paper by Rodet and
Depalle that I can't find right now. It's widely cited in the literature
as "FFT^-1"

That paper only deals with steady-state sinusoids however. It won't
accurately deal with transients or glides.

There has been more recent work on spectral-domain synthesis and I'm
fairly sure that some techniques have found their way into some quite
famous commercial products.

Bonada, J.; Loscos, A.; Cano, P.; Serra, X.; Kenmochi, H. (2001).
"Spectral Approach to the Modeling of the Singing Voice". In Proc. of
the 111th AES Convention.

Post by gm
My goal is to resynthesize arbitary noises.

In that case you need to think about how an FFT represents "arbitrary
noises".

One approach is to split the signal into sinusoids + noise (a.k.a.
spectral modeling synthesis).
https://en.wikipedia.org/wiki/Spectral_modeling_synthesis

It is worth reviewing Xavier Serra's PhD thesis for the basics (what was
already established in the late 1980s.)

http://mtg.upf.edu/content/serra-PhD-thesis

Here's the PDF:
https://repositori.upf.edu/bitstream/handle/10230/34072/Serra_PhDthesis.pdf?sequence=1&isAllowed=y

There was a bunch of in the early 90's on real-time additive synthesis
at CNMAT, e.g.

https://quod.lib.umich.edu/i/icmc/bbp2372.1995.091/1/--bring-your-own-control-to-additive-synthesis?page=root;size=150;view=text

Of course there is a ton of more recent work. You could do worse than
looking at the papers of Xavier Serra and Jordi Bonada:
http://mtg.upf.edu/research/publications

Post by gm
But back to my question, I am serious, could you compress a spectrum by
just adding the bins that fall together?

I'm not sure what "compress" means in this context, nor am I sure what
"fall together" means. But here's some points to note:

A steady state sine wave in the time domain will be transformed by a
short-time fourier transform into a spectral peak, convolved (in the
frequency domain) by the spectrum of the analysis envelope. If you know
that all of your inputs are sine waves, then you can perform "spectral
peak picking" (AKA MQ analysis) and reduce your signal to a list of sine
waves and their frequencies and phases -- this is the sinusoidal
component of Serra's SMS (explained in the pdf linked above).

Note that since a sinusoid ends up placing non-zero values in every FFT
bin, you'd need to account for that in your spectral estimation, which
basic MQ does not -- hence it does not perfectly estimate the sinusoids.

In any case, most signals are not sums of stationary sinusoids. And
since signals are typically buried in noise, or superimposed on top of
each other, so the problem is not well posed. For two very simple
examples: consider two stable sine waves at 440Hz and 441Hz -- you will
need a very long FFT to distinguish this from a single
amplitude-modulated sine wave? or consider a sine wave plus white noise
-- the accuracy of frequency and phase recovery will depend on how much
input you have to work with.

I think by "compression" you mean "represent sparsely" (i.e. with some
reduced representation.) The spectral modeling approach is to "model"
the signal by assuming it has some particular structure (e.g.
sinusoids+noise, or sinusoids+transients+noise) and then work out how to
extract this structure from the signal (or to reassemble it for synthesis).

An alternative (more mathematical) approach is to simply assume that the
signal is sparse in some (unknown) domain. It turns out that if your
signal is sparse, you can apply a constrained random dimensionality
reduction to the signal and not lose any information. This is the field
of compressed sensing. Note that in this case, you haven't recovered any
structure.

Ross

2018-10-31 18:00:08 UTC

Thanks for your time

My question rephrased:
Lets assume a spectrum of size N, can you create a meaningfull spectrum
of size N/2
by simply adding every other bin together?

Neglecting the artefacts of the forward transform, lets say an
artificial spectrum
(or a spectrum after peak picking that discards the region around the peaks)

Lets say two sinusoids in two adjacent bins, will summing them into a
single bin of a half sized spectrum
make sense and represent them adequately?
In my limited understanding, yes, but I am not sure, and would like to
know why not
if that is not the case.

Post by Ross Bencina
I'm not sure what "compress" means in this context, nor am I sure what
A steady state sine wave in the time domain will be transformed by a
short-time fourier transform into a spectral peak, convolved (in the
frequency domain) by the spectrum of the analysis envelope. If you
know that all of your inputs are sine waves, then you can perform
"spectral peak picking" (AKA MQ analysis) and reduce your signal to a
list of sine waves and their frequencies and phases -- this is the
sinusoidal component of Serra's SMS (explained in the pdf linked above).
Note that since a sinusoid ends up placing non-zero values in every
FFT bin, you'd need to account for that in your spectral estimation,
which basic MQ does not -- hence it does not perfectly estimate the
sinusoids.
In any case, most signals are not sums of stationary sinusoids. And
since signals are typically buried in noise, or superimposed on top of
each other, so the problem is not well posed. For two very simple
examples: consider two stable sine waves at 440Hz and 441Hz -- you
will need a very long FFT to distinguish this from a single
amplitude-modulated sine wave? or consider a sine wave plus white
noise -- the accuracy of frequency and phase recovery will depend on
how much input you have to work with.
I think by "compression" you mean "represent sparsely" (i.e. with some
reduced representation.) The spectral modeling approach is to "model"
the signal by assuming it has some particular structure (e.g.
sinusoids+noise, or sinusoids+transients+noise) and then work out how
to extract this structure from the signal (or to reassemble it for
synthesis).
An alternative (more mathematical) approach is to simply assume that
the signal is sparse in some (unknown) domain. It turns out that if
your signal is sparse, you can apply a constrained random
dimensionality reduction to the signal and not lose any information.
This is the field of compressed sensing. Note that in this case, you
haven't recovered any structure.
Ross

Ethan Fenn

2018-11-02 16:41:42 UTC

In any case, most signals are not sums of stationary sinusoids. And since
signals are typically buried in noise, or superimposed on top of each
consider two stable sine waves at 440Hz and 441Hz -- you will need a very
long FFT to distinguish this from a single amplitude-modulated sine wave?
or consider a sine wave plus white noise -- the accuracy of frequency and
phase recovery will depend on how much input you have to work with.

This is an interesting thing to think about in any sort of spectral
modeling.

No length of FFT will distinguish between a mixture of these sine waves and
a single amplitude-modulated one, because they're mathematically
identitical! Specifically:

sin(440t) + sin(441t) = 2*cos(0.5t)*sin(440.5t)

So the question isn't whether an algorithm can distinguish between them but
rather which one of these two interpretations it should pick. And I would
say in most audio applications the best answer is that it should pick the
same interpretation that the human hearing system would. In this example
it's clearly the right-hand side. In the case of a large separation (e.g.
440Hz and 550Hz, a major third) it's clearly the left-hand side. And
somewhere in between I guess it must be a toss-up.

-Ethan

Thanks for your time
Lets assume a spectrum of size N, can you create a meaningfull spectrum of
size N/2
by simply adding every other bin together?
Neglecting the artefacts of the forward transform, lets say an artificial
spectrum
(or a spectrum after peak picking that discards the region around the peaks)
Lets say two sinusoids in two adjacent bins, will summing them into a
single bin of a half sized spectrum
make sense and represent them adequately?
In my limited understanding, yes, but I am not sure, and would like to
know why not
if that is not the case.
I'm not sure what "compress" means in this context, nor am I sure what

Post by Ross Bencina
A steady state sine wave in the time domain will be transformed by a
short-time fourier transform into a spectral peak, convolved (in the
frequency domain) by the spectrum of the analysis envelope. If you know
that all of your inputs are sine waves, then you can perform "spectral peak
picking" (AKA MQ analysis) and reduce your signal to a list of sine waves
and their frequencies and phases -- this is the sinusoidal component of
Serra's SMS (explained in the pdf linked above).
Note that since a sinusoid ends up placing non-zero values in every FFT
bin, you'd need to account for that in your spectral estimation, which
basic MQ does not -- hence it does not perfectly estimate the sinusoids.
In any case, most signals are not sums of stationary sinusoids. And since
signals are typically buried in noise, or superimposed on top of each
consider two stable sine waves at 440Hz and 441Hz -- you will need a very
long FFT to distinguish this from a single amplitude-modulated sine wave?
or consider a sine wave plus white noise -- the accuracy of frequency and
phase recovery will depend on how much input you have to work with.
I think by "compression" you mean "represent sparsely" (i.e. with some
reduced representation.) The spectral modeling approach is to "model" the
signal by assuming it has some particular structure (e.g. sinusoids+noise,
or sinusoids+transients+noise) and then work out how to extract this
structure from the signal (or to reassemble it for synthesis).
An alternative (more mathematical) approach is to simply assume that the
signal is sparse in some (unknown) domain. It turns out that if your signal
is sparse, you can apply a constrained random dimensionality reduction to
the signal and not lose any information. This is the field of compressed
sensing. Note that in this case, you haven't recovered any structure.
Ross

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-11-02 20:40:45 UTC

Now the synth works quite well with an FFT size of 4096, I had a severe bug
all the time which was messing every other frames phase up.

I have simple peak picking now for sines+noise synthesis
which sounds much nicer when the sound is frozen.

It's a peak if its larger then two adjacent bins and if it was a peak
candidate or peak in either
three bins in the frame before - since this would miss all the onsets of
peaks
the analysis is done backwards with the unwanted side effect that the
last bin of a fading
frquency is recognized as a random part, but you dont seem to notice that.

I use random amplitudes for the noise, but keep the phases
so they are not out of phase when a frequency changes from bin to bin.

What I want to do now is some analysis that detects whether or not
a phase should be reset to the original phase for a transient.

I am not sure if the idea is good or not, since the window is rather large
so it will have the wrong phase before the transient.

Any ideas for that? I thought of taking the cepstrum and look for new peaks
to identify new tones. But I think this idea is flawed cause it misses
the actual broadband transient and comes in too late.

A simple time transient detector on the other hand would reset all partials
which is not wanted either.

Any other ideas?

2018-11-03 01:53:59 UTC

Post by gm
Any other ideas?

ok the answer is already in my post: just analyze backwards
It's possibly part of a transient when the backwards tracked partial
stops to exist.

Ross Bencina

2018-11-03 04:59:08 UTC

Post by Ethan Fenn
No length of FFT will distinguish between a mixture of these sine waves
and a single amplitude-modulated one, because they're mathematically
sin(440t) + sin(441t) = 2*cos(0.5t)*sin(440.5t)
So the question isn't whether an algorithm can distinguish between them
but rather which one of these two interpretations it should pick. And I
would say in most audio applications the best answer is that it should
pick the same interpretation that the human hearing system would. In
this example it's clearly the right-hand side. In the case of a large
separation (e.g. 440Hz and 550Hz, a major third) it's clearly the
left-hand side. And somewhere in between I guess it must be a toss-up.

I guess you could model both simultaneously, with some kind of
probability weighting.

Ross.

2018-11-03 06:12:39 UTC

An I think you can model them simply by adding their phasors/bins/numbers...

for opposite angles they will cancel, for the same angle they will be
amplified

so the model is correct at the center of the window, but it models just
an instance in time and spreads this instance

in this way of thinking you basically consider the dft a sine bank
oscillator

Post by Ethan Fenn
No length of FFT will distinguish between a mixture of these sine
waves and a single amplitude-modulated one, because they're
sin(440t) + sin(441t) = 2*cos(0.5t)*sin(440.5t)
So the question isn't whether an algorithm can distinguish between
them but rather which one of these two interpretations it should
pick. And I would say in most audio applications the best answer is
that it should pick the same interpretation that the human hearing
system would. In this example it's clearly the right-hand side. In
the case of a large separation (e.g. 440Hz and 550Hz, a major third)
it's clearly the left-hand side. And somewhere in between I guess it
must be a toss-up.

I guess you could model both simultaneously, with some kind of
probability weighting.
Ross.
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

robert bristow-johnson

2018-11-03 06:18:40 UTC

personally, i believe this issue is resolved with the choice of the window size (or "frame length", not to be conflated with "frame hop") that goes into the FFT.

for a sufficiently large window, the two sinusoids will appear as two separate peaks in the FFT result.
for a sufficiently small window, it will be a single sinusoid, but the amplitude will vary in adjacent windows.
that's my spin on the issue.
r b-j

---------------------------- Original Message ----------------------------

Subject: Re: [music-dsp] two fundamental questions Re: FFT for realtime synthesis?

From: "gm" <***@voxangelica.net>

Date: Sat, November 3, 2018 2:12 am

To: music-***@music.columbia.edu

--------------------------------------------------------------------------

And I think you can model them simply by adding their phasors/bins/numbers...
for opposite angles they will cancel, for the same angle they will be
amplified
so the model is correct at the center of the window, but it models just
an instance in time and spreads this instance
in this way of thinking you basically consider the dft a sine bank
oscillator

I guess you could model both simultaneously, with some kind of
probability weighting.
Ross.

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."

Ross Bencina

2018-11-03 09:48:39 UTC

[resending, I think I accidentally replied off-list]

Post by gm
Lets assume a spectrum of size N, can you create a meaningfull

spectrum of size N/2

Post by gm
by simply adding every other bin together?
Neglecting the artefacts of the forward transform, lets say an

artificial spectrum

Post by gm
(or a spectrum after peak picking that discards the region around the peaks)
Lets say two sinusoids in two adjacent bins, will summing them into a

single bin of a half sized spectrum

Post by gm
make sense and represent them adequately?
In my limited understanding, yes, but I am not sure, and would like

to know why not

Post by gm
if that is not the case.

You can analyze this by looking at the definition of the short-time
discrete Fourier transform. (Or the corresponding C code for a DFT).

Each spectral bin is the sum of samples in the windowed signal
multiplied by a complex exponential.

Off the top of my head, assuming a rectangular window, I think you'll
find that dropping every second bin in the length N spectrum gives you
the equivalent of the bin-wise sum of two length N/2 DFTs computed with
hop size N/2.

Summing adjacent bins would do something different. You could work it
out by taking the definition of the DFT and doing some algebra. I think
you'd get a spectrum with double the amplitude, frequency shifted by
half the bin-spacing. (i.e. the average of the two bin's center
frequencies).

Ross.

2018-11-03 12:49:06 UTC

as I wrote before, it depends on the phase angle whether or not the
amplitudes add up when you add the numbers

with opposite phase, and equal amplitude, the amplitudes cancel, with
same phase they add up,

with pi radians it is sqrt(2), ect,

but I made a quick hack test and it doesn't sound as good as the full
spectrum, something in quality is missing

I am still investigating and thinking about this so I am not sure what I
am missing,

I mean there exists a halfsized spectrum that would represent the signal
perfectly (or rather, two of them would)
maybed it has to do with the fact that you need two successive spectra
to represent he same information
but I dont really see the effect of that other than it has a better time
resolution

Post by Ross Bencina
[resending, I think I accidentally replied off-list]

Post by gm
Lets assume a spectrum of size N, can you create a meaningfull

spectrum of size N/2

Post by gm
by simply adding every other bin together?
Neglecting the artefacts of the forward transform, lets say an

artificial spectrum

Post by gm
(or a spectrum after peak picking that discards the region around

the peaks)

Post by gm
Lets say two sinusoids in two adjacent bins, will summing them into

a single bin of a half sized spectrum

Post by gm
make sense and represent them adequately?
In my limited understanding, yes, but I am not sure, and would like

to know why not

Post by gm
if that is not the case.

You can analyze this by looking at the definition of the short-time
discrete Fourier transform. (Or the corresponding C code for a DFT).
Each spectral bin is the sum of samples in the windowed signal
multiplied by a complex exponential.
Off the top of my head, assuming a rectangular window, I think you'll
find that dropping every second bin in the length N spectrum gives you
the equivalent of the bin-wise sum of two length N/2 DFTs computed
with hop size N/2.
Summing adjacent bins would do something different. You could work it
out by taking the definition of the DFT and doing some algebra. I
think you'd get a spectrum with double the amplitude, frequency
shifted by half the bin-spacing. (i.e. the average of the two bin's
center frequencies).
Ross.
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

Theo Verelst

2018-11-04 02:03:26 UTC

It's a complicated subject when you fill in all the boundary conditions properly, isn't it?

Lots of frequency considerations look a bit alike but aren't mathematically equivalent.

The human (and animal I suppose) hearing is very sensitive to a lot of these issues in all
kinds of convoluted combinations, so finding a heuristic for some problem is easily going
to be food for yet another theory getting disproven. Some theories (like Fourier) can only
be ultimately, mathematically used to close to infinite accuracy success when applied
properly, this is where usually the problem lies.

It might help to understand why in this case you'd chose for the computation according to
a IFFT scheme for synthesis. Is it for complimentary processing steps, efficiency, because
you have data that fits the practical method in terms of granularity, theoretical
interest, example software, some sort of juxtaposition of alternatives, or maybe a well
known engineering example where this appears logical ?

T.V.

2018-11-04 11:56:55 UTC

Post by Theo Verelst
It might help to understand why in this case you'd chose for the
computation according to a IFFT scheme for synthesis. Is it for
complimentary processing steps, efficiency, because you have data that
fits the practical method in terms of granularity, theoretical
interest, example software, some sort of juxtaposition of
alternatives, or maybe a well known engineering example where this
appears logical ?

Originall it was a UI study and proof of concept engine for a granular
cloud synth,
then I replaced the grain engine with FFT because I had some misconceptions:

- I thought you could get the granular cloud sound by randomizing the
FFT, but it sounds very different

- I thought you would get a better time resolution by using a smaller
hop size in the analysis
but that is not the case. You would need a smaller window, but then the
frequency analysis is bad.

- Also I thought you could use a larger FFT for ananlysis and compress
them into a smaller ones for resynthesis
and benefit from both, but it's the other way it seems, you get the
disadvantages from both,
at least it didnt work so well here and sounded like the smaller FFT
exept that transients are blurred more.

- Finally I thought you could reconstruct the waveshape on transients by
using th original phase,
but it doesn't seem to make a big audible difference and my transient
detection
doesn't work so well cause the transients are already spread in time by
the analysis
so this reonstructed waveshape doesn't come in at he right moment.

Still I think that somehow it could be possible with the FFT, using a
time domain pitch shift,
but doing it in that order iFFT -> time domain shift seems to be
more complicated compared to the other way : td shift -> FFT, which was
my first implementation
I think for the other way you always need a variing number of FFTs per
time?
Also I think the resolution issue will be the same.

2018-11-04 13:55:10 UTC

Maybe you could make the analysis with a filterbank, and do the
resynthesis with FFT?

Years ago I made such a synth based on "analog" Fourier Transforms,
(the signal is modulated and rotated down to 0 Frequency and that
frequencies around DC are lowpass filtered
depending on the bandwitdh yoe need)
it it did the inverse transform by modulated sines, but sounded rather
reverberant though
since the analysis has a rather long IR on the lowest bands.
I used 72 Bands based on a Bark scale.

Maybe I should try something like that again...
However it's absolutely not clear to me how to go from 72 (or what ever)
broadband bands
to FFT frequencies and bands?

Maybe you could just use the amplitude spectrum from the bank and the
frequencies from an FFT.
Any objections?

The frequencies would be wrong before transients, but the benefit is
that you get a good spectral envelope
and you can have a really sharp transient detection.

2018-11-04 16:00:47 UTC

ok I now I tried a crude and quick multiresolution FFT analysis at log 2
basis and it seems to work.
However my partial and transient tracking does not work naymore on the
higher bands since there is too much modulation from the windows now,
but it seems that in general this is the direction to go, with some
refinements-

Post by gm
Maybe you could make the analysis with a filterbank, and do the
resynthesis with FFT?
Years ago I made such a synth based on "analog" Fourier Transforms,
(the signal is modulated and rotated down to 0 Frequency and that
frequencies around DC are lowpass filtered
depending on the bandwitdh yoe need)
it it did the inverse transform by modulated sines, but sounded rather
reverberant though
since the analysis has a rather long IR on the lowest bands.
I used 72 Bands based on a Bark scale.
Maybe I should try something like that again...
However it's absolutely not clear to me how to go from 72 (or what
ever) broadband bands
to FFT frequencies and bands?
Maybe you could just use the amplitude spectrum from the bank and the
frequencies from an FFT.
Any objections?
The frequencies would be wrong before transients, but the benefit is
that you get a good spectral envelope
and you can have a really sharp transient detection.
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-11-04 18:54:54 UTC

Post by gm
ok I now I tried a crude and quick multiresolution FFT analysis at log
2 basis

Scott Cotton

2018-11-04 21:07:33 UTC

The following may help

https://www.dsprelated.com/freebooks/sasp/STFT_COLA_Decomposition.html
http://lonce.org/Publications/publications/2007_RealtimeSignalReconstruction.pdf
https://hal.archives-ouvertes.fr/hal-00453178/document

librosa uses the below (If you have access)
[1]D. W. Griffin and J. S. Lim, âSignal estimation from modified short-time
Fourier transform,â IEEE Trans. ASSP, vol.32, no.2, pp.236â243, Apr. 1984.

The biggest thing to note however is that if you modify the spectra in an
STFT (or by taking in "grains") then the modified sequence of spectra no
longer necessarily coincides to any signal, and so some sort of estimation
is used. If you're doing power of 2 sizes and corresponding subbed
decomposition, you'd apply that to each band. The Gabor frame stuff looks
like it has multiple time-frequency tilings as well.

I believe reconstructing with windowing is one of the hardest parts of
frequency domain processing to do well. It is still being researched, and
it is one big reason why phase vocoders aren't, in my opinion, a solved
problem.

Scott

Post by gm
ok I now I tried a crude and quick multiresolution FFT analysis at log
2 basis

I half the window size (Hann) for every FFT.
To compensate for the smaller window, I multiply by the factor that it
is smaller, that is 2, 4, 8,
But it appears that there is noticably more energy now in the higher
bands with the smaller windows.
How do I compensate for the window properly?
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

--
Scott Cotton
http://www.iri-labs.com

2018-11-04 21:49:39 UTC

Thanks for the links.
At the moment it's still to reverberant even with multiresolution.

Probably Gabor windows are part of the solution, I still have to look at
these papers closely.
I will try those, but probably they only make sense with more multires
bands then I have now.

Also I want to get the synthesis stage at 2048 FFT size.

How I do it now:

analysis
Â at 4096 FFT size, log 2 multiresolution (7 bands), 16 overlaps (Hann
Windows)
but peak tracking and phase tracking at 4096 FFT size without
multiresolution for better tracking
it's a peak if it was a peak candidate in the frame after the frame or
in adjacent bins
it's a peak candidate if its larger than adjacent bins and larger than a
noise threshold
it's a transient if a peak track starts (or actually stops, regarded
backwards).

synthesis
sines and noise sythesis:
if its not at tracked peak, its noise, and synthesised with random
amplitudes but updated phases

transients:
if it's transient, use original phase, accumlated phase otherwise
do transients only once when they are crossed

freqshift:
updated phases for timestretch and freq. shift or transients and
frequency shift
shift amplitude bins for frequency shift

formants:
filter original spectrum on log2 scale with increasingly long filters
(should be done offline)
divide by filtered spectrum, multiply by shifted filtered spectrum

Here is how it sounds, all examples are too time stretched (and or
shifted) to better hear it's limits
https://soundcloud.com/traumlos_kalt/transcoder-096-2/s-3bqkl
Still reverberant, transients are better but still lacking, otherwise I
am half way content

Gabor windows, I will try these, further suggestions?

Especially for the transient detection, because at the moment it comes
in too early
for the lower bands, due to the temporal dilution.

Post by Scott Cotton
The following may help
https://www.dsprelated.com/freebooks/sasp/STFT_COLA_Decomposition.html
http://lonce.org/Publications/publications/2007_RealtimeSignalReconstruction.pdf
https://hal.archives-ouvertes.fr/hal-00453178/document
librosa uses the below (If you have access)
[1]D. W. Griffin and J. S. Lim, âSignal estimation from modified
short-time Fourier transform,â IEEE Trans. ASSP, vol.32, no.2,
pp.236â243, Apr. 1984.
The biggest thing to note however is that if you modify the spectra in
an STFT (or by taking in "grains") then the modified sequence of
spectra no longer necessarily coincides to any signal, and so some
sort of estimation is used.Â If you're doing power of 2 sizes and
correspondingÂ subbed decomposition, you'd apply that to each band.Â
The Gabor frame stuff looks like it has multiple time-frequency
tilings as well.
I believe reconstructing with windowing is one of the hardest parts of
frequency domain processing to do well.Â It is still being researched,
and it is one big reason why phase vocoders aren't, in my opinion, a
solved problem.
Scott

Post by gm
ok I now I tried a crude and quick multiresolution FFT analysis

at log

Post by gm
2 basis

Scott Cotton

2018-11-04 22:49:15 UTC

Post by gm
Thanks for the links.
At the moment it's still to reverberant even with multiresolution.
Probably Gabor windows are part of the solution, I still have to look at
these papers closely.
I will try those, but probably they only make sense with more multires
bands then I have now.
Also I want to get the synthesis stage at 2048 FFT size.
analysis
at 4096 FFT size, log 2 multiresolution (7 bands), 16 overlaps (Hann
Windows)
but peak tracking and phase tracking at 4096 FFT size without
multiresolution for better tracking
it's a peak if it was a peak candidate in the frame after the frame or in
adjacent bins
it's a peak candidate if its larger than adjacent bins and larger than a
noise threshold
it's a transient if a peak track starts (or actually stops, regarded
backwards).
synthesis
if its not at tracked peak, its noise, and synthesised with random
amplitudes but updated phases
if it's transient, use original phase, accumlated phase otherwise
do transients only once when they are crossed

note that in polyphonic sources, transients may only apply to one of the
sources, so
if you define transient as a slice of time say of a percussive onset, like
guitar, maybe
other strings or instruments are sustaining at the same time, and then the
transient
part may sound weird w.r.t. continuity of the other parts.

Post by gm
updated phases for timestretch and freq. shift or transients and frequency
shift
shift amplitude bins for frequency shift
filter original spectrum on log2 scale with increasingly long filters
(should be done offline)
divide by filtered spectrum, multiply by shifted filtered spectrum

I don't understand what you're doing with formants, would be interested to
learn more.

Post by gm
Here is how it sounds, all examples are too time stretched (and or
shifted) to better hear it's limits
https://soundcloud.com/traumlos_kalt/transcoder-096-2/s-3bqkl
Still reverberant, transients are better but still lacking, otherwise I am
half way content
Gabor windows, I will try these, further suggestions?
phavorit

<https://hci.rwth-aachen.de/materials/publications/karrer2006a.pdf> gives a
nice overview. There is some other tool I looked into once which did quite
well
and had binary downloads for a standalone tool. It was part of a thesis at
a UK or US east coast music
school I think, but I've lost the reference. I think it was from about
10-15 years ago. Maybe that will ring a bell
to someone else?

Also rubber band library, last I looked, does some sort of interpolation of
the windows relating
the input window size to output size used in resynthesis, I think after
interpolation it might
just do the window inverse, but I'm not sure.

Post by gm
Especially for the transient detection, because at the moment it comes in
too early
for the lower bands, due to the temporal dilution.
Maybe you can do transient processing as pre-processing and recombine

after the harmonic
parts are treated, and heuristically subtract the transient part of the
signal (detected for example
by spectral flux, and subtract out the noisy flatter part of the spectrum.

Post by gm
The following may help
https://www.dsprelated.com/freebooks/sasp/STFT_COLA_Decomposition.html
http://lonce.org/Publications/publications/2007_RealtimeSignalReconstruction.pdf
https://hal.archives-ouvertes.fr/hal-00453178/document
librosa uses the below (If you have access)
[1]D. W. Griffin and J. S. Lim, âSignal estimation from modified
short-time Fourier transform,â IEEE Trans. ASSP, vol.32, no.2, pp.236â243,
Apr. 1984.
The biggest thing to note however is that if you modify the spectra in an
STFT (or by taking in "grains") then the modified sequence of spectra no
longer necessarily coincides to any signal, and so some sort of estimation
is used. If you're doing power of 2 sizes and corresponding subbed
decomposition, you'd apply that to each band. The Gabor frame stuff looks
like it has multiple time-frequency tilings as well.
I believe reconstructing with windowing is one of the hardest parts of
frequency domain processing to do well. It is still being researched, and
it is one big reason why phase vocoders aren't, in my opinion, a solved
problem.
Scott

Post by gm
ok I now I tried a crude and quick multiresolution FFT analysis at log
2 basis

--
Scott Cotton
http://www.iri-labs.com

2018-11-04 23:22:07 UTC

Post by Scott Cotton
note that in polyphonic sources, transients may only apply to one of
the sources, so
if you define transient as a slice of time say of a percussive onset,
like guitar, maybe
other strings or instruments are sustaining at the same time, and then
the transient
part may sound weird w.r.t. continuity of the other parts.

I track partials, and when a new partial comes in, it is treated as a
part of a transient.
So only those partials that belong to the transient are set to the
original phase.

But the issue is that temporal spread of the FFT, the low frequencies may
come in softly before the high frequencies, so a new partial is tracked
before the actualy transient.

So the whole thing seems to work for higher frequencies (which are
important for transients)
but not the low frequencies. Or doesnt work at all, it's hard to hear a
difference tbh.
also, when the frequency is shifted, the waveform is squeezed and only
the squeezed part
hast the right wave shape - if it has the right waveshape, I am not 100%
sure it has
but my test seem to suggest that, and it seems to be the right idea.

Post by Scott Cotton
filter original spectrum on log2 scale with increasingly long
filters (should be done offline)
divide by filtered spectrum, multiply by shifted filtered spectrum
I don't understand what you're doing with formants, would be
interested to learn more.

They are set to be fixed, so they dont shift with the frequency, for a
natural sound,
and you can also shift them if you want, as can be heard briefly on the
"male" voice in the example I posted

The spectrum is filtered (smoothed) on a log 2 basis to obtain a
spectral envelope,
by that I mean according to the log 2 of the bin index the fiter length
doubles.
The filters are simpel averaging FIR filters, I start with 5 bands average.
I also filter forwards and backwards, so the filter is aplied twice, for
symmetry reasons
and also for better filter group response.
This is somewhat similar but not identical to filtering on an ERB scale
or Mel scale filtering,
which might give better results, but would be more laborious to do, it
was a quick hack.
Also my way could be considered more physical.

Then the origial spectrum is divided by the smoothed spectrum, which whitens
the signal, that is this removes all formants, but keeps the difference
between the spikes of the partials
and the dips where there is nothing - so sideband artefacts or noise
floor are not whitened.

Then the smoothed spectrum is shifted, and the whitened spectrum
multiplied with that,
vocoding with itself bascially, so now it has the new formants.

2018-11-05 00:14:02 UTC

bear with me, I am a math illiterate.

I understand you can do a Discrete Fourier Transform in matrix form,
and for 2-point case it is simply
[ 1, 1
1,-1]
like the Haar transform, average and difference.

My idea is, to use two successive DFT frames, and to transform
resepctive bins of two successive frames like this.
To obtain a better frequency estimate (subbands) from two smaller DFTs
instead of an DFT double the size.

This should be possible? and the information obtained, time and
frequency resolution wise, identical.
Except that you can overlap the two DTFs.

I basically want to find the dominant frequency in the FFT bin, and
sepreate it and discard the rest.
And a subband resolution of 2 seems to be a sufficient increase in
resolution.

But how do I get that from this when there is no phase other then 0?
I can see which of the two bands has more energy, but how do I know "the
true frequency" of Nyquist and DC?
There is not enough information.

The problem persists for me if I resort to a 4-point transform, what to
do with the highest/lowest subband.
(and also to understand how to calculate the simple 4 point matrix,
cause I am uneducated..)

Or do I need the 4-point case and discard "DC" and Nyquist subbands?
Or is the idea totally nonsens?
Or is it justified to pick the subband that has more energy, and then, what?

Theo Verelst

2018-11-08 17:28:33 UTC

No matter how you go about this, the the Fast Fourier will in almost every case
act as some sort of ensemble measurement over it's length, and maybe do some
filtering between consecutive transform steps. Maybe you even continuously
average in the frequency domain, using per sample sliding FFT frames, even then
you're measuring which "bins" of the FFT respond to the samples in you signal,
not more and not less.

Usually if the math/algebra doesn't proof anything conclusive, no high spun
expectations about the mathematical relevance of the result is in order...

T.V.

robert bristow-johnson

2018-10-29 20:33:37 UTC

my comments below are about time-scaling without pitch shifting. a time-scaler coupled with resampling (like sample-rate conversion) will make you a pitch shifter.

---------------------------- Original Message ----------------------------

Subject: Re: [music-dsp] pitch shifting in frequency domain Re: FFT for realtime synthesis?

From: "gm" <***@voxangelica.net>

Date: Mon, October 29, 2018 2:12 pm

To: music-***@music.columbia.edu

--------------------------------------------------------------------------

...

I also tried Miller Pucketts phase locking mentioned in his book The
Theory and Technique of Electronic Music and also mentioned in the paper,
but a) I don't hear any difference and b) I don't see how and why it
should work.

wow! my experience from a long time ago is that it makes a BIG FUCKING difference. before Miller Puckett's observation, if we were doing the basic Portnoff PV, the phase adjustment made to one bin could be (and from numerical issues, often was) wildly different from the
phase adjustment made to the adjacent bin. that will for sure fuck up the sinusoid. all you have to worry about, to do Miller Puckett's correction is to use a **single** phase adjustment for the whole lobe or bump or peak corresponding to a single sinusoid, and then your sinusoid will
splice nicely to the corresponding sinusoid of the previous frame. but if, for a single lobe, the phase adjustment is wild and crazy for each bin, you will mangle that sinusoid.
but Miller's thing doesn't take care of the other phasiness problem that Jean and Mark were dealing with and
that is waveshape variance due to different harmonics (of the same tonal instrument) being shifted by different amounts of time (a time-domain WSOLA time-scaler doesn't have that problem.) that, and the transient problem (how to keep your PV from mangling an attack transient), are the main
problems with PV or even with sinusoidal modeling. i dunno how Melodyne does it.

the last i worked on this seriously is 17 years ago. i gotta paper and old MATLAB that demonstrates how to timescale each sinusoidal component *within* a single frame. there are still the issues of phase adjustment to make that sinuosoid splice nicely to the previous frame.
and a swept-frequency sine sounded just fine. the bigger problem was not the slow sweep (the regular PV did fine with a slow sweep) but was about the fast sweep. that's when this intraframe correction was useful. i offered this code up before, but again, of someone wants either or
both this 2001 paper or the MATLAB code from that project, i can send it to you (just email me). it turned out that the intraframe correction didn't help audibly, but we could show that it helped visually with a changing frequency.

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."

Ian Esten

2018-10-23 22:48:59 UTC

The Kurzweil K150 is the first product I can think of that did it. To
create custom sounds for it required the use of software that modeled the
sound using partial amplitudes over time. It's a very powerful technique
for synthesising certain types of sound, such as a piano, where frequencies
of partials do not change. The more rapidly partial frequencies change, the
more complicated it becomes to model the resulting spectrum.

It's fun stuff.

Post by gm
Does anybody know a real world product that uses FFT for sound synthesis?
Do you think its feasable and makes sense?
Totally unrelated to the recent discussion here I consider replacing (WS)OLA
granular "clouds" with a spectral synthesis and was wondering if I
should use FFT for that.
I want to keep all the musical artefacts of the granular approach when
desired
and I was thinking that you can get the "grain cloud" sound when you add
noise to the phases/frequencies
for instance and do similar things.
An advantage of using FFT instead of sinusoids would be that you dont
have to worry
about partial trajectories, residual noise components and that sort of
thing.
Whether or not it would use much less CPU I am not sure, depends on how
much overlap
of frames you have.
Disadvantages I see is latency, even more so if you want an even workload,
and that the implementation is somewhat fuzzy/messy when you do a
timestretch followed by resampling.
Another disadvantage would be that you cant have immediate parameter changes
since everything is frame based, and even though some granularity is
fine for me
the granularity of FFT would be fixed to the overlap/frame size, which
is another disadvantage.
Another disadvantage I see is the temporal blur you get when you modify
the sound.
Any thoughs on this? Experiences?
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

robert bristow-johnson

2018-11-05 00:39:53 UTC

mr. g,I think what you're describing is the Cooley-Tukey Radix-2 FFT algorithm.--r b-jÂ Â Â Â Â Â Â Â Â Â Â ***@audioimagination.com"Imagination is more important than knowledge."

-------- Original message --------
From: gm <***@voxangelica.net>
Date: 11/4/2018 4:14 PM (GMT-08:00)
To: music-***@music.columbia.edu
Subject: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for realtime synthesis?

bear with me, I am a math illiterate.I understand you can do a Discrete Fourier Transform in matrix form,and for 2-point case it is simply[ 1, 1 Â 1,-1]like the Haar transform, average and difference.My idea is, to use two successive DFT frames, and to transform resepctive bins of two successive frames like this.To obtain a better frequency estimate (subbands) from two smaller DFTs instead of an DFT double the size.This should be possible? and the information obtained, time and frequency resolution wise, identical.Except that you can overlap the two DTFs.I basically want to find the dominant frequency in the FFT bin, and sepreate it and discard the rest.And a subband resolution of 2 seems to be a sufficient increase in resolution.But how do I get that from this when there is no phase other then 0?I can see which of the two bands has more energy, but how do I know "the true frequency" of Nyquist and DC?There is not enough information.The problem persists for me if I resort to a 4-point transform, what to do with the highest/lowest subband.(and also to understand how to calculate the simple 4 point matrix, cause I am uneducated..)Or do I need the 4-point case and discard "DC" and Nyquist subbands?Or is the idea totally nonsens?Or is it justified to pick the subband that has more energy, and then, what?_______________________________________________dupswapdrop: music-dsp mailing listmusic-***@music.columbia.eduhttps://lists.columbia.edu/mailman/listinfo/music-dsp

2018-11-05 00:56:39 UTC

Post by robert bristow-johnson
mr. g,
I think what you're describing is the Cooley-Tukey Radix-2 FFT algorithm.

yes that seems kind of right, though I am not describing something but
posting a question actually
and the "other thing" was an answer to a question

maybe my post was too long, boring and naive for you, so I am sorry for
that.
I am reusing the same subject lien to help people to ignore this, if
they want

so you do the "radix 2 algorithm" if you will on a subband, and now what?
the bandlimits are what? the neighbouring upper and lower bands?

how do I get a frequency estimate "in between" out of these two real
values that describe the upper and lower limit of the band but have no
further information?

thank you.

2018-11-05 05:28:55 UTC

Post by gm
so you do the "radix 2 algorithm" if you will on a subband, and now what?
the bandlimits are what? the neighbouring upper and lower bands?
how do I get a frequency estimate "in between" out of these two real
values that describe the upper and lower limit of the band but have no
further information?
thank you.

The way I see it:

If you do that 2 point transform on a band you get 2 data points instead
of one (or rather instead of two sucsessive ones of course),
representing the upper and lower bandwith limit of the band, but not
very well seperated.
But if you take the result of the previous frame also into account you
now get 4 points representing the corner of a bin
of the original spectrum so to say, however in bewteen spectra, and you
now can do bilinear interpolation between these 4 points.

But in the end this is just crude averaging between two sucessive
spectra, and I am not sure if it sounded better
or worse. It's hard to tell a difference, it works quite well on a sine
sweep though.

But there must be a better way to make use of these 2 extra data points.

In the end you now have the same amount of information as with a
spectrum of double size.
So you should be able to obtain the same quality from that.
That was my way of thinking, however flawed that is, I'd like to know.

Ethan Fenn

2018-11-05 15:17:52 UTC

It's not exactly Cooley-Tukey. In Cooley-Tukey you take two _interleaved_
DFT's (that is, the DFT of the even-numbered samples and the DFT of the
odd-numbered samples) and combine them into one longer DFT. But here you're
talking about taking two _consecutive_ DFT's. I don't think there's any
cheap way to combine these to exactly recover an individual bin of the
longer DFT.

Of course it's possible you'll be able to come up with a clever frequency
estimator using this information. I'm just saying it won't be exact in the
way Cooley-Tukey is.

-Ethan

If you do that 2 point transform on a band you get 2 data points instead
of one (or rather instead of two sucsessive ones of course), representing
the upper and lower bandwith limit of the band, but not very well seperated.
But if you take the result of the previous frame also into account you now
get 4 points representing the corner of a bin
of the original spectrum so to say, however in bewteen spectra, and you
now can do bilinear interpolation between these 4 points.
But in the end this is just crude averaging between two sucessive spectra,
and I am not sure if it sounded better
or worse. It's hard to tell a difference, it works quite well on a sine
sweep though.
But there must be a better way to make use of these 2 extra data points.
In the end you now have the same amount of information as with a spectrum
of double size.
So you should be able to obtain the same quality from that.
That was my way of thinking, however flawed that is, I'd like to know.
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-11-05 17:43:50 UTC

Post by Ethan Fenn
Of course it's possible you'll be able to come up with a clever
frequency estimator using this information. I'm just saying it won't
be exact in the way Cooley-Tukey is.

Maybe, but not the way I laid it out.

Also it seems wiser to interpolate spektral peaks, as had been suggested
to me before.

But it doesn's sound much better then to get the frequency from the
phase step so the bad sound for frequency shifts at an FFT size of 2048
has other reasons than just a bad phase estimate.
Maybe it's just stupid to find a solution for this FFT size and a
frequency domain shift.

robert bristow-johnson

2018-11-05 19:18:35 UTC

Ethan, that's just the difference between Decimation-in-Frequency FFT and Decimation-in-Time FFT.
i guess i am not entirely certainly of the history, but i credited both the DIT and DIF FFT to Cooley and Tukey. that might be an incorrect historical impression.

---------------------------- Original Message ----------------------------

Subject: Re: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for realtime synthesis?

From: "Ethan Fenn" <***@polyspectral.com>

Date: Mon, November 5, 2018 10:17 am

To: music-***@music.columbia.edu

--------------------------------------------------------------------------

Post by Ethan Fenn
It's not exactly Cooley-Tukey. In Cooley-Tukey you take two _interleaved_
DFT's (that is, the DFT of the even-numbered samples and the DFT of the
odd-numbered samples) and combine them into one longer DFT. But here you're
talking about taking two _consecutive_ DFT's. I don't think there's any
cheap way to combine these to exactly recover an individual bin of the
longer DFT.
Of course it's possible you'll be able to come up with a clever frequency
estimator using this information. I'm just saying it won't be exact in the
way Cooley-Tukey is.
-Ethan

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."

Ethan Fenn

2018-11-05 19:34:18 UTC

I don't think that's correct -- DIF involves first doing a single stage of
butterfly operations over the input, and then doing two smaller DFTs on
that preprocessed data. I don't think there is any reasonable way to take
two "consecutive" DFTs of the raw input data and combine them into a longer
DFT.

(And I don't know anything about the historical question!)

-Ethan

On Mon, Nov 5, 2018 at 2:18 PM, robert bristow-johnson <

Post by robert bristow-johnson
Ethan, that's just the difference between Decimation-in-Frequency FFT and
Decimation-in-Time FFT.
i guess i am not entirely certainly of the history, but i credited both
the DIT and DIF FFT to Cooley and Tukey. that might be an incorrect
historical impression.
---------------------------- Original Message ----------------------------
Subject: Re: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for realtime synthesis?
Date: Mon, November 5, 2018 10:17 am
--------------------------------------------------------------------------

you're

Post by Ethan Fenn
talking about taking two _consecutive_ DFT's. I don't think there's any
cheap way to combine these to exactly recover an individual bin of the
longer DFT.
Of course it's possible you'll be able to come up with a clever frequency
estimator using this information. I'm just saying it won't be exact in

the

Post by Ethan Fenn
way Cooley-Tukey is.
-Ethan

--
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

Ethan Duni

2018-11-05 20:40:06 UTC

You can combine consecutive DFTs. Intuitively, the basis functions are
periodic on the transform length. But it won't be as efficient as having
done the big FFT (as you say, the decimation in time approach interleaves
the inputs, so you gotta pay the piper to unwind that). Note that this is
for naked transforms of successive blocks of inputs, not a WOLA filter
bank.

There are Dolby codecs that do similar with a suitable flavor of DCT (type
II I think?) - you have your encoder going along at the usual frame rate,
but if it detects a string of stationary inputs it can fold them together
into one big high-res DCT and code that instead.

Post by Ethan Fenn
I don't think that's correct -- DIF involves first doing a single stage of
butterfly operations over the input, and then doing two smaller DFTs on
that preprocessed data. I don't think there is any reasonable way to take
two "consecutive" DFTs of the raw input data and combine them into a longer
DFT.
(And I don't know anything about the historical question!)
-Ethan
On Mon, Nov 5, 2018 at 2:18 PM, robert bristow-johnson <

Post by Ethan Fenn
It's not exactly Cooley-Tukey. In Cooley-Tukey you take two

_interleaved_

Post by Ethan Fenn
DFT's (that is, the DFT of the even-numbered samples and the DFT of the
odd-numbered samples) and combine them into one longer DFT. But here

you're

frequency

Post by Ethan Fenn
estimator using this information. I'm just saying it won't be exact in

the

Post by Ethan Fenn
way Cooley-Tukey is.
-Ethan

--
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

Stefan Sullivan

2018-11-06 06:46:09 UTC

I'm definitely not the most mathy person on the list, but I think there's
something about the complex exponentials, real transforms and the 2-point
case. For all real DFTs you should get a real-valued sample at DC and
Nyquist, which indeed you do get with your matrix. However, there should be
some complex numbers in a matrix for a 4-point DFT, which you won't get no
matter how many matrices of that form you multiply together. My guess is
that yours is a special case of a DFT Matrix for 2 bins. I suspect if you
took a 4-point DFT Matrix and tried the same it might work out better?

https://en.wikipedia.org/wiki/DFT_matrix

Stefan

Post by Ethan Duni
You can combine consecutive DFTs. Intuitively, the basis functions are
periodic on the transform length. But it won't be as efficient as having
done the big FFT (as you say, the decimation in time approach interleaves
the inputs, so you gotta pay the piper to unwind that). Note that this is
for naked transforms of successive blocks of inputs, not a WOLA filter
bank.
There are Dolby codecs that do similar with a suitable flavor of DCT (type
II I think?) - you have your encoder going along at the usual frame rate,
but if it detects a string of stationary inputs it can fold them together
into one big high-res DCT and code that instead.

Post by Ethan Fenn
I don't think that's correct -- DIF involves first doing a single stage
of butterfly operations over the input, and then doing two smaller DFTs on
that preprocessed data. I don't think there is any reasonable way to take
two "consecutive" DFTs of the raw input data and combine them into a longer
DFT.
(And I don't know anything about the historical question!)
-Ethan
On Mon, Nov 5, 2018 at 2:18 PM, robert bristow-johnson <

Post by robert bristow-johnson
Ethan, that's just the difference between Decimation-in-Frequency FFT
and Decimation-in-Time FFT.
i guess i am not entirely certainly of the history, but i credited both
the DIT and DIF FFT to Cooley and Tukey. that might be an incorrect
historical impression.
---------------------------- Original Message
----------------------------
Subject: Re: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for
realtime synthesis?
Date: Mon, November 5, 2018 10:17 am
--------------------------------------------------------------------------

Post by Ethan Fenn
It's not exactly Cooley-Tukey. In Cooley-Tukey you take two

_interleaved_

Post by Ethan Fenn
DFT's (that is, the DFT of the even-numbered samples and the DFT of the
odd-numbered samples) and combine them into one longer DFT. But here

you're

frequency

Post by Ethan Fenn
estimator using this information. I'm just saying it won't be exact in

the

Post by Ethan Fenn
way Cooley-Tukey is.
-Ethan

--
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-11-06 13:03:45 UTC

The background of the idea was to get a better time resolution
with shorter FFTs and then to refine the freuqency resolution.

You would think at first glance that you would get the same time resolution
as you would with the longer FFT, but I am not sure, if you do overlaps
you get kind of a sliding FFT but maybe it's still the same, regardless.

A similar idea would be to do some basic wavelet transfrom in octaves
for instance and then
do smaller FFTs on the bands to stretch and shift them but I have no idea
if you can do that - if you shift them you exceed their bandlimit I assume?
and if you stretch them I am not sure what happens, you shift their
frequency content down I assume?
Its a little bit fuzzy to me what the waveform in a such a band represents
and what happens when you manipulate it, or how you do that.

Probably these ideas are nonsense but how could you pitch and stretch a
waveform
and preserve transients other wise? with a more or less quick real time
inverse transform?

Ross Bencina

2018-11-06 13:20:12 UTC

Post by gm
A similar idea would be to do some basic wavelet transfrom in octaves
for instance and then
do smaller FFTs on the bands to stretch and shift them but I have no idea
if you can do that - if you shift them you exceed their bandlimit I assume?
and if you stretch them I am not sure what happens, you shift their
frequency content down I assume?
Its a little bit fuzzy to me what the waveform in a such a band represents
and what happens when you manipulate it, or how you do that.

Look into constant-Q and bounded-Q transforms.

Ross.

2018-11-06 15:13:20 UTC

At the moment I am using decreasing window sizes on a log 2 scale.

It's still pretty blurred, and I don't know if I just don't have the
right window parameters,
and if a log 2 scale is too coarse and differs too much from an auditory
scale, or if if I don't have
enough overlaps in resynthesis (I have four).
Or if it's all together.

The problem is the lowest octave or the lowest two octaves, where I need
a long
window for frequency estimation and partial tracking, it just soundded
bad when the window was smaller in this range
because the frequencies are blurred too much I assume.

Unfortunately I am not sure what quality can be achieved and where the
limits are with this approach.

Look into constant-Q and bounded-Q transforms.
Ross.
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-11-06 16:45:04 UTC

Further tests let me assume that you can do it on a log2 scale but that
appropriate window sizes are crucial.

But how to derive these optmal window sizes I am not sure.

I could calculate the bandwitdh of the octave band (or an octave/N band)
in ERB
for instance but then what? How do I derive a window length from that
for that band?

I understand that bandwitdh is inversly proportional to window length.

So it seems very easy actually but I am stuck here...

Post by gm
At the moment I am using decreasing window sizes on a log 2 scale.
It's still pretty blurred, and I don't know if I just don't have the
right window parameters,
and if a log 2 scale is too coarse and differs too much from an
auditory scale, or if if I don't have
enough overlaps in resynthesis (I have four).
Or if it's all together.
The problem is the lowest octave or the lowest two octaves, where I
need a long
window for frequency estimation and partial tracking, it just soundded
bad when the window was smaller in this range
because the frequencies are blurred too much I assume.
Unfortunately I am not sure what quality can be achieved and where the
limits are with this approach.

Post by gm
A similar idea would be to do some basic wavelet transfrom in
octaves for instance and then
do smaller FFTs on the bands to stretch and shift them but I have no idea
if you can do that - if you shift them you exceed their bandlimit I assume?
and if you stretch them I am not sure what happens, you shift their
frequency content down I assume?
Its a little bit fuzzy to me what the waveform in a such a band represents
and what happens when you manipulate it, or how you do that.

Look into constant-Q and bounded-Q transforms.
Ross.
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-11-06 18:35:13 UTC

I think I figured it out.

I use 2^octave * SR/FFTsize -> toERBscale -> * log2(FFTsize)/42 as a
scaling factor for the windows.

Means the window of the top octave is about 367 samples at 44100 SR -
does that seem right?

Sounds better but not so different, still pretty blurry and somewhat
reverberant.

I used the lower frequency limit of the octaves for the window sizes
and Hann windows cause I don't want the windows to be too small.

Do you think using Gaussian windows and the center of the octave will
make a big difference?

Or do I just need more overlaps in resynthesis now?

2018-11-07 11:32:35 UTC

Post by gm
I use 2^octave * SR/FFTsize -> toERBscale -> * log2(FFTsize)/42 as a
scaling factor for the windows.
Means the window of the top octave is about 367 samples at 44100 SR -
does that seem right?

ok this was wrong...

it should be just windowsize = 1/bandwidthERB

but when I use windowsize = 1/bandwitdhERB I get windows that are too
small for
the phase vocoder, for instance for the lowest band I get

bandwidthERB (~10Hz) ~= 26 Hz bandwidth, ~ 1705 samples window length

This gives too much modulation or too much bandwidth by the window, even
if I double this.

(I know that ERB does not go to 10 Hz but the same is true for all bands.)

So, what am I doing wrong here?

2018-11-07 20:22:28 UTC

Post by gm
but when I use windowsize = 1/bandwitdhERB I get windows that are too
small for
the phase vocoder, for instance for the lowest band I get
bandwidthERB (~10Hz) ~= 26 Hz bandwidth, ~ 1705 samples window length
This gives too much modulation or too much bandwidth by the window,
even if I double this.

according to a paper you have to scale it to the -3dB witdh of the
window in bins
which is given with 1.44 for Hann, but I found that it works
satisfactorily with a scaling of 2.88

the sources are called On the use of windows for harmonic analysis with
the discrete fourier transform
and A flexible multi-resolution time-frequency analysis framework for
audio signals
if anyone is interested

Ethan Fenn

2018-11-08 16:11:44 UTC

I'd really like to understand how combining consecutive DFT's can work.
Let's say our input is x0,x1,...x7 and the DFT we want to compute is
X0,X1,...X7

We start by doing two half-size DFT's:

Y0 = x0 + x1 + x2 + x3
Y1 = x0 - i*x1 - x2 + i*x3
Y2 = x0 - x1 + x2 - x3
Y3 = x0 + i*x1 - x2 - i*x3

Z0 = x4 + x5 + x6 + x7
Z1 = x4 - i*x5 - x6 + i*x7
Z2 = x4 - x5 + x6 - x7
Z3 = x4 + i*x5 - x6 - i*x7

Now I agree because of periodicity we can compute all the even-numbered
bins easily: X0=Y0+Z0, X2=Y1+Z1, and so on.

But I don't see how we can get the odd bins easily from the Y's and Z's.
For instance we should have:

X1 = x0 + (r - r*i)*x1 - i*x2 + (-r - r*i)*x3 - x4 + (-r + r*i)*x5 + i*x6 +
(r + r*i)*x7

where r=sqrt(1/2)

Is it actually possible? It seems like the phase of the coefficients in the
Y's and Z's advance too quickly to be of any use.

-Ethan

Post by Ethan Fenn
I don't think that's correct -- DIF involves first doing a single stage
of butterfly operations over the input, and then doing two smaller DFTs on
that preprocessed data. I don't think there is any reasonable way to take
two "consecutive" DFTs of the raw input data and combine them into a longer
DFT.
(And I don't know anything about the historical question!)
-Ethan
On Mon, Nov 5, 2018 at 2:18 PM, robert bristow-johnson <

Post by robert bristow-johnson
Ethan, that's just the difference between Decimation-in-Frequency FFT
and Decimation-in-Time FFT.
i guess i am not entirely certainly of the history, but i credited both
the DIT and DIF FFT to Cooley and Tukey. that might be an incorrect
historical impression.
---------------------------- Original Message
----------------------------
Subject: Re: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for
realtime synthesis?
Date: Mon, November 5, 2018 10:17 am
------------------------------------------------------------
--------------

Post by Ethan Fenn
It's not exactly Cooley-Tukey. In Cooley-Tukey you take two

_interleaved_

Post by Ethan Fenn
DFT's (that is, the DFT of the even-numbered samples and the DFT of the
odd-numbered samples) and combine them into one longer DFT. But here

you're

frequency

Post by Ethan Fenn
estimator using this information. I'm just saying it won't be exact in

the

Post by Ethan Fenn
way Cooley-Tukey is.
-Ethan

--
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

Ethan Duni

2018-11-08 20:45:20 UTC

Not sure can get the odd bins *easily*, but it is certainly possible.
Conceptually, you can take the (short) IFFT of each block, then do the
(long) FFT of the combined blocks. The even coefficients simplify out as
you observed, the odd ones will be messier. Not sure quite how messy - I've
only looked at the details for DCT cases.

Probably the clearest way to think about it is in the frequency domain.
Conceptually, the two consecutive short DFTs are the same as if we had
taken two zero-padded long DFTs, and then downsampled each by half. So the
way to combine them is to reverse that process: upsample them by 2, and
then add them together (with appropriate compensation for the
zero-padding/boxcar window).

Ethan D

Post by Ethan Fenn
I'd really like to understand how combining consecutive DFT's can work.
Let's say our input is x0,x1,...x7 and the DFT we want to compute is
X0,X1,...X7
Y0 = x0 + x1 + x2 + x3
Y1 = x0 - i*x1 - x2 + i*x3
Y2 = x0 - x1 + x2 - x3
Y3 = x0 + i*x1 - x2 - i*x3
Z0 = x4 + x5 + x6 + x7
Z1 = x4 - i*x5 - x6 + i*x7
Z2 = x4 - x5 + x6 - x7
Z3 = x4 + i*x5 - x6 - i*x7
Now I agree because of periodicity we can compute all the even-numbered
bins easily: X0=Y0+Z0, X2=Y1+Z1, and so on.
But I don't see how we can get the odd bins easily from the Y's and Z's.
X1 = x0 + (r - r*i)*x1 - i*x2 + (-r - r*i)*x3 - x4 + (-r + r*i)*x5 + i*x6
+ (r + r*i)*x7
where r=sqrt(1/2)
Is it actually possible? It seems like the phase of the coefficients in
the Y's and Z's advance too quickly to be of any use.
-Ethan

Post by Ethan Duni
You can combine consecutive DFTs. Intuitively, the basis functions are
periodic on the transform length. But it won't be as efficient as having
done the big FFT (as you say, the decimation in time approach interleaves
the inputs, so you gotta pay the piper to unwind that). Note that this is
for naked transforms of successive blocks of inputs, not a WOLA filter
bank.
There are Dolby codecs that do similar with a suitable flavor of DCT
(type II I think?) - you have your encoder going along at the usual frame
rate, but if it detects a string of stationary inputs it can fold them
together into one big high-res DCT and code that instead.

Post by Ethan Fenn
I don't think that's correct -- DIF involves first doing a single stage
of butterfly operations over the input, and then doing two smaller DFTs on
that preprocessed data. I don't think there is any reasonable way to take
two "consecutive" DFTs of the raw input data and combine them into a longer
DFT.
(And I don't know anything about the historical question!)
-Ethan
On Mon, Nov 5, 2018 at 2:18 PM, robert bristow-johnson <

Post by robert bristow-johnson
Ethan, that's just the difference between Decimation-in-Frequency FFT
and Decimation-in-Time FFT.
i guess i am not entirely certainly of the history, but i credited both
the DIT and DIF FFT to Cooley and Tukey. that might be an incorrect
historical impression.
---------------------------- Original Message
----------------------------
Subject: Re: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for
realtime synthesis?
Date: Mon, November 5, 2018 10:17 am
--------------------------------------------------------------------------

Post by Ethan Fenn
It's not exactly Cooley-Tukey. In Cooley-Tukey you take two

_interleaved_

Post by Ethan Fenn
DFT's (that is, the DFT of the even-numbered samples and the DFT of

the

Post by Ethan Fenn
odd-numbered samples) and combine them into one longer DFT. But here

you're

Post by Ethan Fenn
talking about taking two _consecutive_ DFT's. I don't think there's

any

Post by Ethan Fenn
cheap way to combine these to exactly recover an individual bin of the
longer DFT.
Of course it's possible you'll be able to come up with a clever

frequency

Post by Ethan Fenn
estimator using this information. I'm just saying it won't be exact

in the

Post by Ethan Fenn
way Cooley-Tukey is.
-Ethan

--
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-11-09 14:39:32 UTC

This is brining up my previous question again, how do you decimate a
spectrum
by an integer factor properly, can you just add the bins?

the orginal spectrum represents a longer signal so I assume folding
of the waveform occurs? but maybe this doesn't matter in practice for
some applications?

The background is still that I want to use a higher resolution for
ananlysis and
a lower resolution for synthesis in a phase vocoder.

Post by Ethan Duni
Not sure can get the odd bins *easily*, but it is certainly possible.
Conceptually, you can take the (short) IFFT of each block, then do the
(long) FFT of the combined blocks. The even coefficients simplify out
as you observed, the odd ones will be messier. Not sure quite how
messy - I've only looked at the details for DCT cases.
Probably the clearest way to think about it is in the frequency
domain. Conceptually, the two consecutive short DFTs are the same as
if we had taken two zero-padded long DFTs, and then downsampled each
upsample them by 2, and then add them together (with appropriate
compensation for the zero-padding/boxcar window).
Ethan D
I'd really like to understand how combining consecutive DFT's can
work. Let's say our input is x0,x1,...x7 and the DFT we want to
compute is X0,X1,...X7
Y0 = x0 + x1 + x2 + x3
Y1 = x0 - i*x1 - x2 + i*x3
Y2 = x0 - x1 + x2 - x3
Y3 = x0 + i*x1 - x2 - i*x3
Z0 = x4 + x5 + x6 + x7
Z1 = x4 - i*x5 - x6 + i*x7
Z2 = x4 - x5 + x6 - x7
Z3 = x4 + i*x5 - x6 - i*x7
Now I agree because of periodicity we can compute all the
even-numbered bins easily: X0=Y0+Z0, X2=Y1+Z1, and so on.
But I don't see how we can get the odd bins easily from the Y's
X1 = x0 + (r - r*i)*x1 - i*x2 + (-r - r*i)*x3 - x4 + (-r + r*i)*x5
+ i*x6 + (r + r*i)*x7
where r=sqrt(1/2)
Is it actually possible? It seems like the phase of the
coefficients in the Y's and Z's advance too quickly to be of any use.
-Ethan
You can combine consecutive DFTs. Intuitively, the basis
functions are periodic on the transform length. But it won't
be as efficient as having done the big FFT (as you say, the
decimation in time approach interleaves the inputs, so you
gotta pay the piper to unwind that). Note that this is for
naked transforms of successive blocks of inputs, not a WOLA
filter bank.
There are Dolby codecs that do similar with a suitable flavor
of DCT (type II I think?) - you have your encoder going along
at the usual frame rate, but if it detects a string of
stationary inputs it can fold them together into one big
high-res DCT and code that instead.
On Mon, Nov 5, 2018 at 11:34 AM Ethan Fenn
I don't think that's correct -- DIF involves first doing a
single stage of butterfly operations over the input, and
then doing two smaller DFTs on that preprocessed data. I
don't think there is any reasonable way to take two
"consecutive" DFTs of the raw input data and combine them
into a longer DFT.
(And I don't know anything about the historical question!)
-Ethan
On Mon, Nov 5, 2018 at 2:18 PM, robert bristow-johnson
Ethan, that's just the difference between
Decimation-in-Frequency FFT and Decimation-in-Time FFT.
i guess i am not entirely certainly of the history,
but i credited both the DIT and DIF FFT to Cooley and
Tukey.Â that might be an incorrect historical impression.
---------------------------- Original Message
----------------------------
Subject: Re: [music-dsp] 2-point DFT Matrix for
subbands Re: FFT for realtime synthesis?
Date: Mon, November 5, 2018 10:17 am
--------------------------------------------------------------------------

Post by Ethan Fenn
It's not exactly Cooley-Tukey. In Cooley-Tukey you

take two _interleaved_

Post by Ethan Fenn
DFT's (that is, the DFT of the even-numbered samples

and the DFT of the

Post by Ethan Fenn
odd-numbered samples) and combine them into one

longer DFT. But here you're

Post by Ethan Fenn
talking about taking two _consecutive_ DFT's. I

don't think there's any

Post by Ethan Fenn
cheap way to combine these to exactly recover an

individual bin of the

Post by Ethan Fenn
longer DFT.
Of course it's possible you'll be able to come up

with a clever frequency

Post by Ethan Fenn
estimator using this information. I'm just saying it

won't be exact in the

Post by Ethan Fenn
way Cooley-Tukey is.
-Ethan

--
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

Ethan Duni

2018-11-09 18:16:58 UTC

Post by gm
This is brining up my previous question again, how do you decimate a

spectrum

Post by gm
by an integer factor properly, can you just add the bins?

To decimate by N, you just take every Nth bin.

Post by gm
the orginal spectrum represents a longer signal so I assume folding
of the waveform occurs?

Yeah, you will get time-domain aliasing unless your DFT is oversampled
(i.e., zero-padded in time domain) by a factor of (at least) N to begin
with. For critically sampled signals the result is severe distortion (i.e.,
SNR ~= 0dB).

Post by gm
but maybe this doesn't matter in practice for some applications?

Post by gm
This is brining up my previous question again, how do you decimate a
spectrum
by an integer factor properly, can you just add the bins?
the orginal spectrum represents a longer signal so I assume folding
of the waveform occurs? but maybe this doesn't matter in practice for some
applications?
The background is still that I want to use a higher resolution for
ananlysis and
a lower resolution for synthesis in a phase vocoder.
Not sure can get the odd bins *easily*, but it is certainly possible.
Conceptually, you can take the (short) IFFT of each block, then do the
(long) FFT of the combined blocks. The even coefficients simplify out as
you observed, the odd ones will be messier. Not sure quite how messy - I've
only looked at the details for DCT cases.
Probably the clearest way to think about it is in the frequency domain.
Conceptually, the two consecutive short DFTs are the same as if we had
taken two zero-padded long DFTs, and then downsampled each by half. So the
way to combine them is to reverse that process: upsample them by 2, and
then add them together (with appropriate compensation for the
zero-padding/boxcar window).
Ethan D

Post by Ethan Duni
You can combine consecutive DFTs. Intuitively, the basis functions are
periodic on the transform length. But it won't be as efficient as having
done the big FFT (as you say, the decimation in time approach interleaves
the inputs, so you gotta pay the piper to unwind that). Note that this is
for naked transforms of successive blocks of inputs, not a WOLA filter
bank.
There are Dolby codecs that do similar with a suitable flavor of DCT
(type II I think?) - you have your encoder going along at the usual frame
rate, but if it detects a string of stationary inputs it can fold them
together into one big high-res DCT and code that instead.

Post by Ethan Fenn
I don't think that's correct -- DIF involves first doing a single stage
of butterfly operations over the input, and then doing two smaller DFTs on
that preprocessed data. I don't think there is any reasonable way to take
two "consecutive" DFTs of the raw input data and combine them into a longer
DFT.
(And I don't know anything about the historical question!)
-Ethan
On Mon, Nov 5, 2018 at 2:18 PM, robert bristow-johnson <

Post by robert bristow-johnson
Ethan, that's just the difference between Decimation-in-Frequency FFT
and Decimation-in-Time FFT.
i guess i am not entirely certainly of the history, but i credited
both the DIT and DIF FFT to Cooley and Tukey. that might be an incorrect
historical impression.
---------------------------- Original Message
----------------------------
Subject: Re: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for
realtime synthesis?
Date: Mon, November 5, 2018 10:17 am
--------------------------------------------------------------------------

Post by Ethan Fenn
It's not exactly Cooley-Tukey. In Cooley-Tukey you take two

_interleaved_

Post by Ethan Fenn
DFT's (that is, the DFT of the even-numbered samples and the DFT of

the

Post by Ethan Fenn
odd-numbered samples) and combine them into one longer DFT. But here

you're

Post by Ethan Fenn
talking about taking two _consecutive_ DFT's. I don't think there's

any

Post by Ethan Fenn
cheap way to combine these to exactly recover an individual bin of

the

Post by Ethan Fenn
longer DFT.
Of course it's possible you'll be able to come up with a clever

frequency

Post by Ethan Fenn
estimator using this information. I'm just saying it won't be exact

in the

Post by Ethan Fenn
way Cooley-Tukey is.
-Ethan

--
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

_______________________________________________
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-11-09 21:19:17 UTC

hm, my application has also WOLA ...

All I find is about up- and downsampling of time sequences and spectra
of the same length.

Summing adjacent bins seemed to be in correspondence with lowpass
filtering and decimation of time sequences
even though it's not the apropriate sinc filter...

If I just take every other bin it misses information that a half sized
spectrum derived from the
original time series would have, for instance if only bin 1 had content
in the double sized spectrum
certainly the downsized spectrum would need to reflect this as DC or
something?

So do I have to apply a sinc filter before and then discard every other bin?

If so, can this be done with an other FFT like a cepstrum on the bins?

If anyone knows of an easy explanation of down- and up sampling spectra
it would be much appreciated.

Post by Ian Esten

Post by gm
This is brining up my previous question again, how do you decimate a

spectrum

Post by gm
by an integer factor properly, can you just add the bins?

To decimate by N, you just take every Nth bin.

Post by gm
the orginal spectrum represents a longer signal so I assume folding
of the waveform occurs?

Yeah, you will get time-domain aliasing unless your DFT is oversampled
(i.e., zero-padded in time domain) by a factor of (at least) N to
begin with. For critically sampled signals the result is severe
distortion (i.e., SNR ~= 0dB).

Post by gm
but maybe this doesn't matter in practice for some applications?

The only applications I know of that tolerate time-domain aliasing in
transforms are WOLA filter banks - which are explicitly designed to
cancel these (severe!) artifacts in the surrounding time-domain
processing.
Ethan D
This is brining up my previous question again, how do you decimate
a spectrum
by an integer factor properly, can you just add the bins?
the orginal spectrum represents a longer signal so I assume folding
of the waveform occurs? but maybe this doesn't matter in practice
for some applications?
The background is still that I want to use a higher resolution for
ananlysis and
a lower resolution for synthesis in a phase vocoder.

Post by gm
Not sure can get the odd bins *easily*, but it is certainly
possible. Conceptually, you can take the (short) IFFT of each
block, then do the (long) FFT of the combined blocks. The even
coefficients simplify out as you observed, the odd ones will be
messier. Not sure quite how messy - I've only looked at the
details for DCT cases.
Probably the clearest way to think about it is in the frequency
domain. Conceptually, the two consecutive short DFTs are the same
as if we had taken two zero-padded long DFTs, and then
downsampled each by half. So the way to combine them is to
reverse that process: upsample them by 2, and then add them
together (with appropriate compensation for the
zero-padding/boxcar window).
Ethan D
I'd really like to understand how combining consecutive DFT's
can work. Let's say our input is x0,x1,...x7 and the DFT we
want to compute is X0,X1,...X7
Y0 = x0 + x1 + x2 + x3
Y1 = x0 - i*x1 - x2 + i*x3
Y2 = x0 - x1 + x2 - x3
Y3 = x0 + i*x1 - x2 - i*x3
Z0 = x4 + x5 + x6 + x7
Z1 = x4 - i*x5 - x6 + i*x7
Z2 = x4 - x5 + x6 - x7
Z3 = x4 + i*x5 - x6 - i*x7
Now I agree because of periodicity we can compute all the
even-numbered bins easily: X0=Y0+Z0, X2=Y1+Z1, and so on.
But I don't see how we can get the odd bins easily from the
X1 = x0 + (r - r*i)*x1 - i*x2 + (-r - r*i)*x3 - x4 + (-r +
r*i)*x5 + i*x6 + (r + r*i)*x7
where r=sqrt(1/2)
Is it actually possible? It seems like the phase of the
coefficients in the Y's and Z's advance too quickly to be of any use.
-Ethan
On Mon, Nov 5, 2018 at 3:40 PM, Ethan Duni
You can combine consecutive DFTs. Intuitively, the basis
functions are periodic on the transform length. But it
won't be as efficient as having done the big FFT (as you
say, the decimation in time approach interleaves the
inputs, so you gotta pay the piper to unwind that). Note
that this is for naked transforms of successive blocks of
inputs, not a WOLA filter bank.
There are Dolby codecs that do similar with a suitable
flavor of DCT (type II I think?) - you have your encoder
going along at the usual frame rate, but if it detects a
string of stationary inputs it can fold them together
into one big high-res DCT and code that instead.
On Mon, Nov 5, 2018 at 11:34 AM Ethan Fenn
I don't think that's correct -- DIF involves first
doing a single stage of butterfly operations over the
input, and then doing two smaller DFTs on that
preprocessed data. I don't think there is any
reasonable way to take two "consecutive" DFTs of the
raw input data and combine them into a longer DFT.
(And I don't know anything about the historical
question!)
-Ethan
On Mon, Nov 5, 2018 at 2:18 PM, robert
Ethan, that's just the difference between
Decimation-in-Frequency FFT and
Decimation-in-Time FFT.
i guess i am not entirely certainly of the
history, but i credited both the DIT and DIF FFT
to Cooley and Tukey.Â that might be an incorrect
historical impression.
---------------------------- Original Message
----------------------------
Subject: Re: [music-dsp] 2-point DFT Matrix for
subbands Re: FFT for realtime synthesis?
Date: Mon, November 5, 2018 10:17 am
--------------------------------------------------------------------------

Post by Ethan Fenn
It's not exactly Cooley-Tukey. In Cooley-Tukey

you take two _interleaved_

Post by Ethan Fenn
DFT's (that is, the DFT of the even-numbered

samples and the DFT of the

Post by Ethan Fenn
odd-numbered samples) and combine them into one

longer DFT. But here you're

Post by Ethan Fenn
talking about taking two _consecutive_ DFT's. I

don't think there's any

Post by Ethan Fenn
cheap way to combine these to exactly recover

an individual bin of the

Post by Ethan Fenn
longer DFT.
Of course it's possible you'll be able to come

up with a clever frequency

Post by Ethan Fenn
estimator using this information. I'm just

saying it won't be exact in the

Post by Ethan Fenn
way Cooley-Tukey is.
-Ethan

_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

robert bristow-johnson

2018-11-09 21:31:19 UTC

what you're discussing here appears to me to be about perfect reconstruction in the context of Wavelets and Filter Banks.

there is a theorem that's pretty easy to prove that if you have complementary high and low filterbanks with a common cutoff at 1/2 Nyquist, you can downsample both high and low-pass filterbank outputs by a factor of 1/2 and later combine the two down-sampled streams of samples to get perfect
reconstruction of the original. this result is not guaranteed if you **do** anything to either filter output in the filterbank.

---------------------------- Original Message ----------------------------

Subject: Re: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for realtime synthesis?

From: "gm" <***@voxangelica.net>

Date: Fri, November 9, 2018 4:19 pm

To: music-***@music.columbia.edu

--------------------------------------------------------------------------

Post by gm
hm, my application has also WOLA ...
All I find is about up- and downsampling of time sequences and spectra
of the same length.

...

Post by gm
If anyone knows of an easy explanation of down- and up sampling spectra
it would be much appreciated.
..

Post by Ethan Duni
The only applications I know of that tolerate time-domain aliasing in
transforms are WOLA filter banks - which are explicitly designed to
cancel these (severe!) artifacts in the surrounding time-domain
processing.

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."

2018-11-09 22:02:15 UTC

You get me intrigued with this

I actually believe that wavelets are the way to go for such things,
but, besides that anything beyond a Haar wavelet is too complicated for me
(and I just grasp that Haar very superficially of course),

I think one problem is the problem you mentioned - don't do anything
with the bands,
only then you have perfect reconstruction

And what to do you do with the bands to make a pitch shift or to
preserve formants/do some vocoding?

It's not so obvious (to me), my naive idea I mentioned earlier in this
thread was to
do short FFTs on the bands and manipulate the FFTs only

But how? if you time stretch them, I believe the pitch goes down (thats
my intuition only, I am not sure)
and also, these bands alias, since the filters are not brickwall,
and the aliasing is only canceled on reconstruction I believe?

So, yes, very interesting topic, that could lead me astray for another
couple of weeks but without any results I guess

I think as long as I don't fully graps all the properties of the FFT and
phase vocoder I shouldn't start anything new...

Post by robert bristow-johnson
what you're discussing here appears to me to be about perfect
reconstruction in the context of Wavelets and Filter Banks.
there is a theorem that's pretty easy to prove that if you have
complementary high and low filterbanks with a common cutoff at 1/2
Nyquist, you can downsample both high and low-pass filterbank outputs
by a factor of 1/2 and later combine the two down-sampled streams of
samples to get perfect reconstruction of the original.Â this result is
not guaranteed if you **do** anything to either filter output in the
filterbank.
---------------------------- Original Message ----------------------------
Subject: Re: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for realtime synthesis?
Date: Fri, November 9, 2018 4:19 pm
--------------------------------------------------------------------------

Post by gm
hm, my application has also WOLA ...
All I find is about up- and downsampling of time sequences and spectra
of the same length.

...

Post by gm
If anyone knows of an easy explanation of down- and up sampling spectra
it would be much appreciated.
..

--
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

robert bristow-johnson

2018-11-09 22:29:23 UTC

i don't wanna lead you astray. i would recommend staying with the phase vocoder as a framework for doing time-frequency manipulation. it **can** be used real-time for pitch shift, but when i have used the phase vocoder, it was for time-scaling and then we would simply
resample the time-scaled output of the phase vocoder to bring the tempo back to the original and shift the pitch. that was easier to get it right than it was to *move* frequency components around in the phase vocoder. but i remember in the 90s, Jean Laroche doing that real time with a
single PC. also a real-time phase vocoder (or any frequency-domain process, like sinusoidal modeling) is going to have delay in a real-time process. even if your processor is infinitely fast, you still have to fill up your FFT buffer with samples before invoking the FFT. if your
buffer is 4096 samples and your sample rate is 48 kHz, that's almost 1/10 second. and that doesn't count processing time, just the buffering time. and, in reality, you will have to double buffer this process (buffer both input and output) and that will make the delay twice as much.
so with 1/5 second delay, that's might be an issue.
i offered this before (and someone sent me a request and i believe i replied, but i don't remember who), but if you want my 2001 MATLAB code that demonstrates a simple phase vocoder doing time scaling, i am happy to send it to you or
anyone. it's old. you have to turn wavread() and wavwrite() into audioread() and audiowrite(), but otherwise, i think it will work. it has an additional function that time-scales each sinusoid *within* every frame, but i think that can be turned off and you can even delete that
modification and what you have left is, in my opinion, the most basic phase vocoder implemented to do time scaling. lemme know if that might be helpful.
L8r,
r b-j

---------------------------- Original Message ----------------------------
Subject: Re: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for realtime synthesis?

From: "gm" <***@voxangelica.net>

Date: Fri, November 9, 2018 5:02 pm

To: music-***@music.columbia.edu

--------------------------------------------------------------------------

Post by gm
You get me intrigued with this
I actually believe that wavelets are the way to go for such things,
but, besides that anything beyond a Haar wavelet is too complicated for me
(and I just grasp that Haar very superficially of course),
I think one problem is the problem you mentioned - don't do anything
with the bands,
only then you have perfect reconstruction
And what to do you do with the bands to make a pitch shift or to
preserve formants/do some vocoding?
It's not so obvious (to me), my naive idea I mentioned earlier in this
thread was to
do short FFTs on the bands and manipulate the FFTs only
But how? if you time stretch them, I believe the pitch goes down (thats
my intuition only, I am not sure)
and also, these bands alias, since the filters are not brickwall,
and the aliasing is only canceled on reconstruction I believe?
So, yes, very interesting topic, that could lead me astray for another
couple of weeks but without any results I guess
I think as long as I don't fully grasp all the properties of the FFT and
phase vocoder I shouldn't start anything new...

--

r b-j ***@audioimagination.com

"Imagination is more important than knowledge."

2018-11-09 23:19:14 UTC

thanks for your offer, I can not really read Math lab code and always
have a hard time
even figuring out essentials of such code

My phase vocoder already works kind of satisfactorily now as a demo in
Native Instruments Reaktor,
I do the forward FFT offline and the iFFT "just in time", that means 12
"butterflies" per sample,
so you could bring down latency by speeding up the iFFT, though I am not
sure what a reasonable latency is.
I made a poll on some electronic musicians board and most people voted
for 10 ms being just tolerable.

I am half way content with the way it works now, for analysis I have
twelve FFTs in parallel, one for each octave
with window sizes based on ERB scale per octave, so it's not totally bad
on transients but not good either.
I assume there is still some room for improvements on the windows, but
not very much.

FFT size is 4096, and now I search for ways to improve it, mostly
regarding transients.
But I am not sure if that's possible with FFT cause I still have
pre-ringing, and I cant see
how to avoid that completely cause you can only shorten the windows on
the low octaves so much.
Maybe with an assymetric window?
If you do the analysis with a IIR filter bank (or wavelets) you kind of
have assymmetric windows, that is the filters
integrate in a causal way with a decaying "window" they see, but I am
not sure if this can be adapted somehow
to an FFT.

An other way that would reduce reverberation and shorten transient times
somehwat would
be using shorter FFTs for the resynthesis, this would also bring down
CPU a bit and latency.

So this is where I am at at the moment

i don't wanna lead you astray.Â i would recommend staying with the
phase vocoder as a framework for doing time-frequency manipulation.Â
it **can** be used real-time for pitch shift, but when i have used the
phase vocoder, it was for time-scaling and then we would simply
resample the time-scaled output of the phase vocoder to bring the
tempo back to the original and shift the pitch.Â that was easier to
get it right than it was to *move* frequency components around in the
phase vocoder.Â but i remember in the 90s, Jean Laroche doing that
real time with a single PC.Â also a real-time phase vocoder (or any
frequency-domain process, like sinusoidal modeling) is going to have
delay in a real-time process.Â even if your processor is infinitely
fast, you still have to fill up your FFT buffer with samples before
invoking the FFT.Â if your buffer is 4096 samples and your sample rate
is 48 kHz, that's almost 1/10 second.Â and that doesn't count
processing time, just the buffering time.Â and, in reality, you will
have to double buffer this process (buffer both input and output) and
that will make the delay twice as much. so with 1/5 second delay,
that's might be an issue.
i offered this before (and someone sent me a request and i believe i
replied, but i don't remember who), but if you want my 2001 MATLAB
code that demonstrates a simple phase vocoder doing time scaling, i am
happy to send it to you or anyone.Â it's old.Â you have to turn
wavread() and wavwrite() into audioread() and audiowrite(), but
otherwise, i think it will work.Â it has an additional function that
time-scales each sinusoid *within* every frame, but i think that can
be turned off and you can even delete that modification and what you
have left is, in my opinion, the most basic phase vocoder implemented
to do time scaling.Â lemme know if that might be helpful.
L8r,
r b-j
---------------------------- Original Message ----------------------------
Subject: Re: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for realtime synthesis?
Date: Fri, November 9, 2018 5:02 pm
--------------------------------------------------------------------------

Post by gm
You get me intrigued with this
I actually believe that wavelets are the way to go for such things,
but, besides that anything beyond a Haar wavelet is too complicated

for me

Post by gm
(and I just grasp that Haar very superficially of course),
I think one problem is the problem you mentioned - don't do anything
with the bands,
only then you have perfect reconstruction
And what to do you do with the bands to make a pitch shift or to
preserve formants/do some vocoding?
It's not so obvious (to me), my naive idea I mentioned earlier in this
thread was to
do short FFTs on the bands and manipulate the FFTs only
But how? if you time stretch them, I believe the pitch goes down (thats
my intuition only, I am not sure)
and also, these bands alias, since the filters are not brickwall,
and the aliasing is only canceled on reconstruction I believe?
So, yes, very interesting topic, that could lead me astray for another
couple of weeks but without any results I guess
I think as long as I don't fully grasp all the properties of the FFT and
phase vocoder I shouldn't start anything new...

--
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
https://lists.columbia.edu/mailman/listinfo/music-dsp

2018-11-10 11:42:04 UTC

Post by gm
FFT size is 4096, and now I search for ways to improve it, mostly
regarding transients.
But I am not sure if that's possible with FFT cause I still have
pre-ringing, and I cant see
how to avoid that completely cause you can only shorten the windows on
the low octaves so much.
Maybe with an assymetric window?
If you do the analysis with a IIR filter bank (or wavelets) you kind
of have assymmetric windows, that is the filters
integrate in a causal way with a decaying "window" they see, but I am
not sure if this can be adapted somehow
to an FFT.

I just tried this and it works well in the analysis, pre-ringing is
reduced, everthing sounds more compact

Together with the ERB scaling now it even kind of works on bass drums.
They are not perfect, but much better.

The windows I am using for this:

if wn(t) is a Hann window and 0 <= t <= 1
then window is wn(sqrt(1-t))

The properties of the window can be found in
"Asymmetric Windows and their Application in Frequency Estimation"

It's not particularly good compared to others, sides fall off with -12
dB/octave but it's shape
is the most skewed one of the windows given in the paper with the
fastest rise
and the rightmost peak so it really makes a difference.

Downside is that now I also have to use this window for peak tracking
and frequency estimation
where a better rolloff and a more narrow peak would be desireable, but
the partials need to be in synch
with the amplitudes so I need the same window shape.
An up side is that it scales so that the peak stays in place for all sizes.

Another downside is that ERB scaling makes the high frequencies quite
hissy, they seem too
pronounced also because they ring longer than in the original when you
dont use a really
high overlap rate, like 78.

Still I am looking for window that might have better properties so any
input on that is apreciated.

Theo Verelst

2018-11-19 12:24:53 UTC

The interesting part of the windowing the FFT averaging computation with
certain functions (leading to the essentially FIR FFT possibly becoming IIR because of
the average filtering) is that at some point the combination of your signal with
the windowed FFT might approximate certain ideals.

For instance, maybe you can make oscillators which together with some form
of FFT + windowing/filtering reconstructs a whole lot better through the
standard (FIR or IIR) reconstruction filters build into every contemporary
DAC chip.

It's hard to oversee the exact computations and all the side effects regarding
harmonic distortion and transient accuracy though. And the further you get the more
it becomes an increasing number of detail considerations with more and more decreasing
added value...

T.V.

robert bristow-johnson

2018-11-05 00:42:52 UTC

and the other thing you're describing is what they usually call "sinusoidal modeling.".--r b-jÂ Â Â Â Â Â Â Â Â Â Â ***@audioimagination.com"Imagination is more important than knowledge."

-------- Original message --------
From: gm <***@voxangelica.net>
Date: 11/4/2018 4:14 PM (GMT-08:00)
To: music-***@music.columbia.edu
Subject: [music-dsp] 2-point DFT Matrix for subbands Re: FFT for realtime synthesis?

bear with me, I am a math illiterate.I understand you can do a Discrete Fourier Transform in matrix form,and for 2-point case it is simply[ 1, 1 Â 1,-1]like the Haar transform, average and difference.My idea is, to use two successive DFT frames, and to transform resepctive bins of two successive frames like this.To obtain a better frequency estimate (subbands) from two smaller DFTs instead of an DFT double the size.This should be possible? and the information obtained, time and frequency resolution wise, identical.Except that you can overlap the two DTFs.I basically want to find the dominant frequency in the FFT bin, and sepreate it and discard the rest.And a subband resolution of 2 seems to be a sufficient increase in resolution.But how do I get that from this when there is no phase other then 0?I can see which of the two bands has more energy, but how do I know "the true frequency" of Nyquist and DC?There is not enough information.The problem persists for me if I resort to a 4-point transform, what to do with the highest/lowest subband.(and also to understand how to calculate the simple 4 point matrix, cause I am uneducated..)Or do I need the 4-point case and discard "DC" and Nyquist subbands?Or is the idea totally nonsens?Or is it justified to pick the subband that has more energy, and then, what?_______________________________________________dupswapdrop: music-dsp mailing listmusic-***@music.columbia.eduhttps://lists.columbia.edu/mailman/listinfo/music-dsp

robert bristow-johnson

2018-11-05 20:14:12 UTC