Sonographic
Sound
Processing
Introduction
The
purpose
of
this
article
is
to
introduce
the
reader
to
the
subject
of
Real-‐Time
Sonographic
Sound
Processing.
Since
sonographic
sound
processing
is
based
on
windowed
analysis
it
belongs
to
the
digital
world.
Hence
we
will
be
leaning
against
digital
concepts
of
audio
and
image
or
video
throughout
this
article.
Sonographic
Sound
Processing
is
all
about
converting
sound
into
image,
using
graphical
effects
on
the
image
and
then
converting
the
image
back
into
sound.
When
thinking
about
sound-‐image
conversion
we
stumble
across
two
immanent
problems:
-‐ First
one
is
that
images
are
static
while
sounds
can
not
be
static.
Sound
is
all
about
pressure
fluctuations
in
some
medium,
usually
air,
and
as
soon
as
the
pressure
becomes
static,
the
sound
ceases
to
exist.
In
digital
world,
audio
signals
that
may
later
become
audible
sounds,
are
all
a
consequence
of
constantly
changing
numbers.
-‐
The
second
problem
is
that
image,
in
terms
of
data,
exist
in
2D
space,
while
sound
exists
only
in
1
dimension.
Gray
scale
image
for
instance
can
be
understood
as
a
function
of
pixel
intensity
or
brightness
in
relation
to
2
dimensional
space,
that
is
width
and
height
of
an
image.
On
the
other
hand
audio
signal
is
a
function
of
amplitude
in
relation
to
only
one
dimension,
which
is
time
(see
fig.
1
and
fig.
2).
Figure
1
Heavily
downsampled
image
(10x10
pixels)
of
black
spot
with
gradient
on
white
background.
On
right-‐hand
side
of
figure
1
is
numerical
representation
of
an
image
that
is
two
dimensional
array
or
a
matrix.
Figure
2
Heavily
downsampled
piece
of
a
sine
wave
(10
samples,
time
domain).
On
right-‐hand
side
there
is
numerical
representation
of
a
waveform
that
is
one
dimensional
array.
At
this
point
we
see
that
we
could
easily
transform
one
column
or
one
row
of
an
image
into
a
sound.
If
our
image
would
have
a
width
of
44100
pixels
and
height
of
22050
pixels
and
our
sampling
rate
would
be
set
at
44.1
KHz,
that
would
mean
that
one
row
would
give
us
exactly
one
second
of
sound
while
one
column
would
give
us
exactly
half
second
of
sound.
Instead
of
columns
and
rows
we
could
also
follow
diagonals
or
any
other
arbitrary
trajectories.
As
long
as
we
would
read
numbers
(=switch
between
various
pixels
and
read
their
values)
at
the
speed
of
sampling
rate
we
could
precisely
control
the
audio
with
image.
And
that
is
the
basic
concept
of
wave
terrain
synthesis,
that
could
also
be
understood
as
an
advanced
version
of
wavetable
synthesis,
where
the
reading
trajectory
“defines”
its
momentary
wavetable.
So
the
answer
to
the
first
problem
is
that
we
have
to
gradually
read
an
image
in
order
to
hear
it.
So
for
instance
if
we
would
to
hear
“the
sound
of
my
gran
drinking
a
cup
of
coffee”,
we
would
need
to
read
each
successive
column
from
bottom
to
the
top
and
in
that
manner
progress
through
the
whole
width
of
the
picture
(*as
mentioned
earlier,
the
trajectory
of
reading
is
arbitrary
as
long
as
we
would
read
all
the
pixels
of
an
image
only
once).
That
would
translate
an
image,
that
is
frozen
in
time,
into
sound,
that
has
some
time
limited
duration.
But
since
we
would
like
to
only
process
some
existing
sound
with
graphical
effects,
we
need
to
first
generate
the
image
out
of
sound.
What
we
could
try
for
a
start
is
to
chop
“horizontal”
audio
into
slices
and
place
them
vertically
so
they
would
become
columns.
That
would
give
us
an
image
out
of
sound
and
as
we
would
read
the
columns
successively
from
bottom
to
the
top
we
would
actually
gradually
read
the
picture
from
left
to
right
(fig.
3)
and
that
would
give
us
back
our
initial
sound.
Figure
3
Chopping
audio
signal
into
slices
and
placing
them
vertically
as
columns
of
a
matrix
where
each
matrix
cell
represent
a
pixel
and
audio
sample
at
the
same
time.
By
reading
successive
columns
vertically
we
progress
through
the
image
from
left
to
right.
At
that
point
we
would
actually
have
an
image
made
out
of
audio
data
and
we
could
easily
play
it
back.
But
that
kind
of
image
would
not
give
us
any
easily
understandable
information
about
the
audio
itself
–
it
would
look
like
some
kind
of
graphical
noise
(fig.
4a).
Hence
we
would
not
be
able
to
control
the
audio
output
with
graphical
effect
since
we
would
not
understand
what
in
the
picture
reflects
certain
audio
parameters.
Of
course
we
could
actually
modify
audio
data
with
graphical
effects
but
again,
our
outcome
would
exist
only
in
frame
of
noise
variations.
One
of
the
reasons
that
for
lies
in
the
fact
that
our
X
and
Y
axes
would
still
represent
only
one
dimension
–
time.
Spectrogram
&
FFT
In
order
to
give
some
understandable
meaning
to
audio
data
we
need
to
convert
the
audio
signal
from-‐time
domain
into
frequency-‐domain.
And
that
can
be
achieved
via
windowed
Fast
Fourier
Transform
(FFT)
or
more
accurately
short-‐
time
Fourier
transform
(STFT).
If
we
apply
STFT
to
our
vertical
audio
slices,
the
result
would
be
a
mirrored
spectrogram
(fig.
4b).
That
kind
of
graphical
audio
representation
can
be
easily
interpreted
but
has
at
the
same
time
half
of
the
unwanted
data
–
the
mirrored
part.
In
order
to
obtain
only
the
information
we
need
we
can
simply
remove
the
unneeded
part
(fig.
6),
which
in
a
way
happens
by
default
in
Max’s
pfft~
object.
In
other
words,
just
as
we
have
to
avoid
using
frequencies
above
Nyquist
in
order
to
prevent
aliasing
in
digital
world,
it
is
pointless
to
analyze
audio
all
the
way
up
to
the
sampling
rate
frequency
since
everything
above
Nyquist
freq.
will
allways
give
us
wrong
results
in
form
of
mirrored
spectrum.
Figure
4a
(left)
Vertically
plotted
slices
of
audio
(few
words
of
speech)
as
time-‐domain
waveforms
from
“bird’s
eye
view”.
Figure
4a
(right)
Spectrogram
or
vertically
plotted
slices
of
audio
(few
words
of
speech)
transformed
into
frequency-‐domain
via
STFT
.
First
step
of
STFT,
which
is
the
most
common
and
useful
form
of
FFT,
is
to
apply
a
volume
envelope
to
a
time
slice.
Volume
envelope
could
be
a
triangle
or
any
symmetrical
curve
such
as
Gaussian,
Hamming,
etc.
The
volume
envelope
in
question
is
in
the
world
of
signal
analysis
called
a
window
function
and
the
result,
time
slice
multiplied
by
a
window
function,
is
called
a
window
or
a
windowed
segment.
At
the
last
stage
of
STFT,
FFT
is
calculated
for
the
windowed
segment
(fig.
5).
At
that
point,
FFT
should
be
considered
only
as
a
mathematical
function,
that
translates
windowed
fragments
of
time-‐domain
signals
into
frequency-‐domain
signals.
Figure
5
Time
Slice
multiplied
by
window
function
(Roads,
1996,
p.550).
In
order
to
perform
STFT,
windows
have
to
consist
of
exactly
2!
samples.
That
results
in
familiar
numbers
such
as
128,
256,
512,
1024,
2048,
4096
etc.
These
numbers
represent
the
FFT
size
or
in
our
case,
twice
the
height
of
a
spectrogram
image
in
pixels
(fig.
6).
At
that
point
we
can
also
further
adopt
the
terminology
from
the
world
of
FFT
and
call
each
column
in
the
spectrogram
matrix
a
FFT
frame
and
each
pixel
inside
the
column
a
FFT
bin.
Therefore
each
useful
half
of
!!
FFT
frame
consists
of
!
FFT
bins,
where
2!
is
the
number
of
samples
in
windowed
segment.
Number
of
FFT
frames
corresponds
to
the
number
of
pixels
in
horizontal
direction
of
an
image
(spectrogram)
while
number
of
half
FFT
bins
correspond
to
the
number
of
pixels
in
vertical
direction.
Hence:
spectrogram
image
width
=
nr.
of
FFT
frames;
spectrogram
image
height
=
nr.
of
FFT
bins
divided
by
2.
Figure
6
The
height
of
the
spectrogram
(useful
half
of
each
FFT
frame)
is
two
times
smaller
than
the
length
of
the
extracted
time
segment.
If
our
goal
would
be
only
to
create
a
spectrogram,
we
would
need
to
convert
a
linear
frequency
scale
of
our
spectrogram
into
logarithmic
one,
in
order
to
have
a
better
view
into
the
most
important
part
of
the
spectrum.
Also
we
would
need
to
downsample
or
“squash”
the
height
of
our
spectrogram
image,
since
the
images
might
be
too
high
(4096
samples
long
window
would
give
us
2048
vertical
pixels
on
our
spectrogram).
Both
actions
would
result
in
significant
data
loss
that
we
can
not
afford
when
preparing
the
ground
for
further
spectral
transformations.
Hence
if
we
want
to
preserve
all
the
data
obtained
during
the
FFT
analysis,
our
linear
spectral
matrix
should
remain
intact.
Of
course
it
would
be
sensible
to
duplicate
and
remap
the
spectral
data,
to
create
a
user
friendly
interactive
sonogram,
through
which
one
could
interact
with
original
spectral
matrix,
but
that
would
complicate
things
even
further
so
it
will
not
be
covered
in
this
article.
For
now,
the
image
of
our
spectrogram
should
consist
of
all
the
useful
data
obtained
during
the
FFT
analysis.
It
is
also
important
to
emphasize
that
window
size
defines
the
lowest
frequency
that
can
be
detected
in
time-‐domain
signal
by
FFT.
That
frequency
is
called
a
fundamental
FFT
frequency.
As
we
know,
frequency
and
wavelength
are
in
reverse
relationship,
therefore
very
low
frequencies
have
very
large
wavelengths
or
periods.
And
if
the
period
is
bigger
than
our
chosen
window,
so
it
can
not
fit
inside
the
window,
it
can
not
be
detected
by
the
FFT
algorithm.
Hence
we
need
large
windows
in
order
to
analyze
low
frequencies.
Large
windows
also
give
us
a
good
frequency
resolution
of
frequency-‐domain
signals.
The
reason
for
that
lies
in
the
harmonic
nature
of
FFT.
FFT
can
be
considered
as
a
harmonically
rich
entity
that
consists
of
many
harmonically
related
sine
waves.
First
sine
wave
is
a
FFT
fundamental.
For
instance if we would choose 2!
or 512 samples long window at the sampling rate of
44.1kHz, our FFT fundamental would be 86.13Hz (44100Hz/512=86.13Hz). And all
the following harmonics would have a frequency of ! × (!!" !"#$%&'#(%)),
where N is an integer. All these virtual harmonics are basically FFT probes: FFT
compares time-domain signal with each probe and checks if that frequency is also
present in the tested signal. Hence smaller FFT fundamentals, as a consequence of
larger windows, mean that we have much more FFT probes in the range from
fundamental and all the way up to the Nyquist frequency. That is why large windows
give us a good frequency resolution.
So if FFT probe detects the testing frequency in the time-domain signal, it tells us
how strong it is (amplitude) and what is its phase. Since the number of probes always
equals the number of samples in chosen time window, we get two information for
each sample of time-domain audio signal – that is amplitude and phase. Hence the full
spectrogram from figure 4b, without the unwanted mirrored part, actually looks like
this (fig. 7):
Figure
7
Graphical
representation
of
amplitude
(left)
and
phase
(right)
information
of
a
frequency-‐domain
signal.
We
see
that
phase
part
of
a
spectrogram
does
not
give
us
any
meaningful
information
(*when
we
look
at
it)
and
is
therefore
usually
not
displayed.
But
as
we
will
see
later,
the
phase
information
is
very
important
when
processing
spectral
data
and
transforming
it
back
into
time-‐domain
via
inverse
STFT
and
overlap-‐add
(OA)
resynthesis.
As
we
said,
the
size
of
the
window
defines
the
FFT
fundamental
and
hence
the
frequency
resolution
of
the
FFT.
Therefore
we
need
big
windows
for
good
frequency
resolution.
But
here
comes
the
ultimate
problem
of
FFT
–
frequency
and
time
resolution
are
in
inverse
relationship.
FFT
frames
(analyzed
windows)
are
actually
snapshots
frozen
in
time.
There
is
no
temporal
information
about
analyzed
audio
inside
a
single
FFT
frame.
Therefore
an
audio
signal
in
frequency-‐
domain
can
be
imagined
as
successive
frames
of
a
film.
Just
as
each
frame
of
a
film
is
a
static
picture,
each
FFT
frame
presents
in
a
way
a
“static”
sound.
And
if
we
want
to
hear
a
single
FFT
frame
we
need
to
constantly
loop
through
it,
just
as
a
digital
oscillator
loops
through
its
wavetable.
Therefore
a
sound
of
a
FFT
frame
is
“static”
in
a
way
that
is
“static”
a
sound
of
on
oscillator
with
a
constant
frequency.
For
a
good
temporal
resolution
we
therefore
need
to
sacrifice
the
frequency
resolution
and
vice-‐versa.
STFT
is
in
praxis
always
performed
with
overlapping
windows.
One
function
of
overlapping
windows
is
to
cancel
out
the
unwanted
artifacts
of
amplitude
modulation
that
occurs
when
applying
a
window
to
a
time
slice.
As
we
can
see
from
figure
8,
overlapping
factor
2
is
sufficient
for
that
job.
The
sum
of
overlapping
window
amplitudes
is
constantly
1,
which
cancels
out
the
effect
of
amplitude
modulation.
Hence
we
can
conclude
that
our
spectrogram
from
figure
7
is
not
very
accurate,
since
we
haven’t
used
any
overlap.
Figure
8
Windowing
with
triangular
window
functions
with
overlap
factor
2.
The
sum
of
overlapping
window
amplitudes
is
constantly
1,
which
cancels
out
the
effect
of
amplitude
modulation.
Another
role
of
overlapping
is
to
increase
the
time
resolution
of
frequency-‐
domain
signals.
According
to
Curtis
Roads,
“an
overlap
factor
of
eight
or
more
is
recommended
when
the
goal
is
transforming
the
input
signal”
(Roads,
1996,
p.555).
When
using
for
instance
overlap
8,
it
means
that
FFT
is
producing
8
times
more
data
as
when
using
no
overlap
–
instead
of
one
FFT
frame
we
have
8
FFT
frames.
Therefore
if
we
would
like
to
present
the
same
amount
of
time-‐domain
signal
in
a
spectrogram
as
in
figure
6,
we
would
need
8
times
wider
image
(spectrogram).
Also
the
sampling
rate
of
all
the
FFT
operations
has
to
be
overlap-‐times
higher
as
in
the
time-‐domain
part
of
our
patch
or
a
program
(*in
case
we
want
to
process
all
the
data
in
real-‐time).
Now
as
we
introduced
the
concept
of
spectrogram,
it
is
time
to
take
a
look
into
the
central
tool
of
sonographic
sound
processing,
the
phase
vocoder.
Phase
Vocoder
Phase vocoder (PV) can be considered as an upgrade to STFT and is consequentially
a very popular analysis tool. The added benefit is that it can measure a frequency
deviation from its center bin frequency as said by Dodge and Jerse (1997, p. 251). For
example if the STFT with fundamental frequency 100Hz is analyzing a 980Hz sine
tone, the FFT algorithm would show the biggest energy in bin with index 10 - in other
words at the frequency 1000Hz. PV is on the other hand able to determine that the
greatest energy was concentrated -20Hz below the 1000Hz bin giving us the correct
result 980Hz.
Calculation of mentioned frequency deviation is based on phase differences between
successive FFT frames for a given FFT bin. In other words, phase differences are
calculated between neighboring pixels for each row in spectral matrix containing
phase information (fig 7, right). In general, phase vocoder does not store spectral data
in a form of a spectrogram but in form of a linear stereo buffer (one channel for phase
and the other one for amplitude).
Phase hence contains a structural information or information about temporal
development of a sound. ”The phase relationships between the different bins will
reconstruct time-limited events when the time domain representation is resynthesized”
(Sprenger, 1999). Bin’s true frequency therefore enables a reconstruction of time
domain signal on a different time basis (1999). In other words, phase difference and
consequently a running phase is what enables a smooth time stretching or time
compression in phase vocoder. Time stretching or time compression in the case of our
phase vocoder with spectrogram as an interface, is actually its ability to read the FFT
data (spectrogram) with various reading speeds while preserving the initial pitch.
Since inverse FFT demands phase for signal reconstruction instead of phase
difference, phase difference has to be summed back together and this is called a
running phase. If there is no time manipulation present (*reading speed = the speed of
recording), running phase for each frequency bin equals phase values obtained
straight after the analysis. But for any different reading speed running phase is
different from initial phases and responsible for smooth signal reconstruction. And
taking phase into consideration when resynthesizing time stretched audio signals is
the main difference between phase vocoder and synchronous granular synthesis
(SGS).
As soon as we have a phase vocoder with spectrogram as an interface, and when our
spectrogram actually reflects the actual FFT data, so all the pixels represent all useful
FFT data (*we can ignore mirrored spectrum), we are ready to perform sonographic
sound processing. We can record our spectral data, in form of two gray scale spectral
images (amplitude and phase), import them into Photoshop or any similar image
processing software and playback the modified images as sounds. We just need to
know what sizes to use for our chosen FFT parameters. But when we are preforming
spectral sound processing in real-time programming environments such as Max/Jitter,
we can modify our spectrograms “on the fly”. Jitter offer us endless possibilities of
real-time graphical manipulations so we can tweak our spectrogram with graphical
effects just like we would tweak our synth parameters in real-time. And the
commodity of real-time sonographic interaction is very rear in the world of
sonographic sound processing. Actually I am not aware of any commercial
sonographic processing product on the market, that would enable user a real-time
interaction.
Another important thing when graphically modifying sound is to preserve the bit
resolution of audio when processing images. Audio signals have accuracy of 32 bit
floating point or more. On the other hand, a single channel of ARGB color model, has
the resolution of only 8 bits. And since we are using gray scale images, we only use
one channel. Hence we need to process our spectrogram images only with objects or
tools, that are able to process 32 or 64 bit floating point numbers.
When doing spectral processing is also sensible to consider processing of spectral
data on GPUs, that are much more powerful and faster with their parallel processing
abilities as CPUs. One may think that all image processing takes place on GPU but
that is not correct in many cases. Many graphical effects are actually using CPU to
scale and reposition the data of an image. So the idea is to transfer the spectral data
from CPU to GPU, perform as many operations as possible on the GPU by using
various shaders, and then transferring the spectrogram back to CPU where it can be
converted back into time-domain signal.
The only problematic link in the chain, when transferring data from CPU to GPU and
back, is the actual transfer to and from the graphics card. That is in general slowing
the whole process to certain extent. Therefore we should have only one CPU-GPU-
CPU transfer in the whole patch. Once on the GPU, all the desired openGL actions
should be executed. Also we have to be careful that we do not loose our 32 bit
resolution in the process, which happens in Max by default when going from GPU to
CPU, because Jitter assumes that we need something in ARGB format from the GPU.
For the end of this article I should also mention one very useful method, discovered
by J.F. Charles (Charles, 2008), and that is the interpolation between successive FFT
frames. As we said earlier, FFT frames are like frames in a movie. Hence we progress
through spectrogram, when reading it back, by jumping from one FFT frame to
another. And in the same manner as we notice switching between successive still
pictures when playing back a video very slowly, we notice switching between
successive FFT frames (“static” loops) when reading a spectrogram back very slowly.
And that is known as frame effect of phase vocoder. Hence in order to achieve a high
quality read back when doing extreme time stretching, we can constantly interpolate
between two successive FFT frames and read only the interpolated one.
References:
Charles, J. F. 2008. A Tutorial on Spectral Sound Processing Using Max/MSP and
Jitter. Computer Music Journal 32(3) pp. 87–102.
Dodge, C. and Jerse, T.A. 1997. Computer Music: Synthesis, Composition and
Performance. 2nd edition, New York: Thomson Learning
Roads, C. 1996. The Computer Music Tutorial. Cambridge, Massachusetts: The MIT
Press
Sprenger, S. M. 1999. Pitch Scaling Using The Fourier Transform. Audio DSP pages.
[Online Book]. Available:
http://docs.happycoders.org/unsorted/computer_science/digital_signal_processing/dsp
dimension.pdf [Accessed 5 August 2010].
Tadej Droljc, spring 2013