|
The following systems have been developed in the AIT Smart Lab:
Based on these systems we have also developed the following
ambient context-aware services:
- The Memory Jog providing assistance (e.g., memory aids,
pertinent information) in the scope of small meetings,
lectures and seminars occurring in the smart room [Click to
see
video /
images 1,
2]
- The e-Director, an intelligent camera-man, that
automatically chooses the camera (among the possible
five in the room) that provides the optimal view of a
speaker [Click to see
video
/ image]
A major part of this work has been carried out in the scope
of the Computers in the Human Interaction Loop (CHIL)
project (http://chil.server.de).
A video about AIT activities in CHIL is also available [Click here].
• Face Detection
The face detection part of the AIT-FACER identifies those
image segments that most resemble faces, based on color and
geometry. From a systems perspective the face detector is
designed and trained to have low probability of miss, but
rather high probability of false detection. Thus the face
detector accepts any reasonably large, round, skin-like
colored segments as face candidates. Then the subsystems,
attempt to select the frontal faces out of all these
candidate face segments. A second design consideration has
been speed: this module needs to operate very fast by itself
and also help minimize the effort required by the subsequent
processing. Since the direct application of skin-color
segmentation is computationally expensive, detection begins
with foreground segmentation; skincolor segmentation follows
on the foreground segments.
Foreground in general is identified by using the
difference from a background image. The background image can
be either static or the result of adaptive estimation. In
the system that we present we have not used any adaptive
scheme for background estimation. This is not a limitation
of the system since for many in-doors applications the
background and illumination remain fairly constant. Thus an
empty room image is utilized as background and subsequently
the colors of the foreground are normalized. We now describe
the algorithm we use for foreground estimation in detail.
The first subtract the empty room image, sum the RGB
channels and binarize the result. We subsequently perform a
median filtering operation on 8x8 pixel blocks is performed,
in order to produce solid foreground segments. This is much
faster than equivalent morphological operations.
The next step is color normalization, which is used to
minimize the effects of shadows on a frame level. The
preferred and visibly better way is Gamma correction, but a
faster solution is needed for our real-time system. Thus we
set the brightness of the foreground segment at 95%. The
original brightness information is also kept and used in the
identified skin segments at later stages of the face
detector.
Skin color segmentation is applied on the
color-normalized foreground segments by using a human skin
color model to produce the skin likelihood map of the
foreground. The likelihood map is then segmented into skin
segments; some heuristics discard those segments that are
very unlikely to be human faces.
The color model we use is based on the skin color and
non-skin color histograms of modified with conditional skin
probability scaling. The ratio of the skin over non-skin
probabilities is scaled by skin probability if the latter is
small. In logarithmic scale, the used log-likelihood L(r,g,b)
is given by

The above proposed modification decreases the likelihood
that a pixel is characterized as skin when it is not
probable that it is, and as a result better segmentation is
achieved. The likelihood map L(r,g,b) is binarized: pixels
with L(r,g,b) > -.75 pixels take the value 1 (skin color);
the rest take the value 0. Then, we use a median filtering
on 11-by-11 pixel neighborhoods with similar advantages to
the foreground application (smooth skin map, much faster
than equivalent morphological operations).
Next we identify regular and connected skin color
segments and discard irregular or very small such segments.
First, the different segments become connected in the skin
map by using 8-way connectivity. Subsequently, the bounding
boxes of the segments are identified and boxes with small
area (<0.2% of the frame area) are discarded: their
resolution is too low for recognition.
Subsequently, we choose segments with face-like
elliptical aspect ratios. We assign more particles for
pixels that are more skin-like and apply Principal Component
Analysis (PCA) on the scatter matrix of the particle
coordinates: a) the principal axes of the particle cloud
covering the region are used as a rough estimate of the
rotation of the face and used along with the eye detector
the eigenvalues are used to estimate the elliptical aspect
ratio of the region. Regions that are too oblong to
correspond to faces are discarded.
Finally, we stress that by performing PCA on the particle
cloud(s) we are able to obtain not only the elliptical
aspect ratio, but also a rough estimation of the angle of
rotation of the segment. Although not very accurate, this
estimate proves particularly useful for robust eye
detection.

• Eye Detection
Detecting the location of the eyes in a facial image is of
paramount importance for an automated face recognizer. If we
can identify the eyes and their location reliably, we can
perform necessary normalizations in terms of shift, scale
and rotation. The eye detector we developed at AIT as part
of the AIT-FACER system operates in two stages: First the
eye zone (eyes and bridge of the nose area) is detected in
the face candidate segments. As a second stage, we detect
the eyes in the identified eye zone. The eye zone is
identified in the segment using a technique similar to.
First the vertical position of the zone is determined by
examining the mean brightness of every row in the segment.
Two peaks are detected in the upper part of the segment:
they correspond to the forehead and upper cheek regions of
the face. The center of the zone is set between the two
peaks as follows: If n b , n= 0,…p is the averaged
brightness between the two peaks, with 0 being the index of
the upper and p that of the lower peak and m b , m,(0,p) is
the minimum averaged brightness, then we find indices k such
that

Denoting the set of such indices by K, the
center c of the zone is the weighted average

The zone is further extended until the
averaged brightness exceeds 1.2 times bm.
This approach works well for approximately horizontally
located eyes. For rotated segments we use the rough
estimation of the rotation angle from the Face Detector
subsystem to de-rotate them sufficiently to locate the eyes.
If the face had been very accurately
segmented, the vertical position of the eye zone would have
been adequate. Unfortunately, the hair of the temples and
sometimes a bit of the background are present in the face
segments. These parts are usually fairly dark and can become
particularly troublesome for locating the eyes for this
reason; we introduce a new algorithm to also define the eye
zone horizontally.
The horizontal eye location algorithm starts
by reducing the effect of shadows at the local (segment)
level. We introduce a new method that subtracts the best
linear fit of brightness variation from the segment. Since
brightness is affected both by lighting and by the color
properties of the objects (e.g. hair over part of forehead),
its value in the whole segment cannot be used directly.
Instead, the horizontal component of brightness, averaged
across each column in the (vertically defined) eye zone of
each segment, is used to estimate the best linear fit and
subtract it from the face segment.
Subsequently, the horizontal extent of the
eye zone is identified in steps. First, the brightness in
the eye zone is averaged across its columns. All the
brightness peaks are then detected: the ones close to the
center correspond to the bridge of the nose. A zone with
decreasing or roughly constant brightness around the
leftmost and the rightmost of these peaks is found. The
center of this zone defines the center of the face, while
its extent defines the face width. The eye zone is extended
beyond the leftmost and rightmost peaks as long as
brightness remains roughly constant.

Having estimated the vertical and horizontal position of
the eye zone, we use particles to sample its brightness
distribution and identify the most probable position of the
eye pupils. More particles are assigned to pixels with less
brightness; the mean of the particle distributions is found
using 2-class k-means clustering. This approach works in
practice because the bridge of the nose is usually quite
bright and serves as a boundary between two distinct clouds
of particles.
The variance of the two clusters is then estimated and if
either of them is above a threshold, the cluster is
re-clustered. This spreading is usually the effect of
eyebrows, temples and dark corners of the eyes. Another
2-class k-means is used to identify two secondary clusters.
Depending on whether the eye is the left or the right, the
leftmost or the rightmost of the two secondary clusters is
kept. The position of the eye pupil is then obtained as a
weighted average between the original and the secondary
cluster center. The relative weight of the secondary cluster
and the re-clustering threshold are system parameters.

• Face Recognition
All normalized segments are finally processed by an LDA
classifier and an identity tag is attached to each one.
There are 5 images per person available for training. In
order to account for the inaccuracies of the eye-detector,
we train the LDA classifier using not only the
hand-annotated eye positions of the training images but also
perturbations around the ideal positions. In particular,
each eye position is allowed to vary within any of the eight
neighboring positions in a 3-by-3 rectangular grid centered
on the ideal position. The horizontal and vertical steps of
the perturbation are 1.68% of the template eye distance. The
RMS error introduced by all the 81 possible eye pair
positions is 1.94% of the template eye distance. Although
this is smaller that the expected RMS error of the eye
detector, we have performed extensive testing to validate
that this is a good choice for the perturbations. Every T
seconds, a majority voting is performed amongst the
identities. The votes are weighted using

Longer periods of T allow more frontal faces to be
detected and accurately normalized, hence leading to reduced
errors.

• Person Tracking
The block diagram of the tracking system is shown in the
figure below. It comprises three modules: adaptive
background, measurement and Kalman filtering. The adaptive
background module produces the foreground pixels of each
video frame and passes this evidence to the measurement
module. The measurement module associates the foreground
pixels to targets, initializes new ones if necessary and
manipulates existing targets by merging or splitting them
based on an analysis of the foreground evidence. The
existing or new target information is passed to the Kalman
filtering module to update the state of the tracker, i.e.
the position and size of the targets. The output of the
tracker is the state information which is also fed back to
the adaptive background module to guide the spacio-temporal
adaptation of the algorithm. In the rest of the section, we
present the three modules in detail.
Adaptive Background Module
The targets of the proposed system (vehicles and humans) are
mostly moving. The changes in the video frames due to the
movement are used to identify and segment the foreground
(pixels of the moving targets) from the background (pixels
without movement). If a background image were available,
this segmentation is simply the difference of the current
frame from the background image. The foreground pixels thus
obtained are readily grouped into target regions. A static
image of the empty scene viewed by the camera can be used
for background. Unfortunately this is not practical and
adaptive background approaches are adopted primarily for two
reasons: First, such an empty scene image might not be
available due to system setup. Secondly and most
importantly, background (outdoors and indoors) also changes:
Natural light conditions change slowly as time goes by; the
wind causes swaying movements of flexible background object
(e.g. foliage); fluorescent light flickers at the power
supply frequency; objects on tabletops and small pieces of
furniture are rearranged and projection areas display
different content. All these changes need to be learnt into
an adaptive background model.

Stauffer’s adaptive background algorithm is capable of
learning such changes with so different speeds of change by
learning into the background any pixel, whose color in the
current frame resembles the colors that this pixel often
has. So no changes, periodic changes or changes that
occurred in the distant past lead to pixels that are
considered background. To do so, a number of weighted
Gaussians model the appearance of different colors in each
pixel. The weights indicate the amount of time the modeled
color is active in that particular pixel. The mean is a
three dimensional vector indicating the color modeled for
that pixel, while the covariance matrix indicates the extend
around the mean that a color of that pixel is to be
considered as similar to the one modeled. Colors in any
given pixel similar to that modeled by any of the Gaussians
of that pixel lead to an update of that Gaussian, an
increase of its weight and a decrease of all the weights of
the other Gaussians of that pixel. Colors not matching any
of the Gaussians of that pixel lead to the introduction of a
new Gaussian with minimum weight. Hence the possible updates
of the weight of the i-th Gaussian of the pixel located at
(x, y) at time t are

where is the learning rate. a
Some variations of the Stauffer algorithm found in the
literature deal with the way covariance is represented
(single value, diagonal of full matrix) and the way the mean
and covariance of the Gaussians are updated. Some further
variations of the algorithm address the way the foreground
information is represented. The original algorithm and most
of the modifications lead to a binary decision for each
pixel: foreground or background. In the Pixel Persistence
Map (PPM) is used instead. This is a map of the same
dimension as the frames with a value at each location (x, y)
equal to the weight of the Gaussian matching the current
color of the pixel at (x, y). Small PPM values indicate
foreground objects, while large indicate background. The
foreground/background threshold is left unspecified though.
The drawback of all the existing variations of Stauffer’s
algorithm is that stationary foreground objects tend to fade
in the background with rate. Small rates fade foreground
objects slowly, but are also slow in adapting to the
background changes, like the motion of a chair. Large rates
favor background adaptation but tend to fade a target into
the background when it stops. This fading progressively
destroys the region of the tracked object, deforms its
perceived shape and finally leads to loosing track of the
object altogether. When the target resumes moving,
foreground pixels will be marked only at the locations not
previously occupied by the stationary target. When the
target has fairly uniform coloration, this can lead to track
loss even in the presence of movement. a
We propose a feedback tracking architecture in order to
addresses these problems. The threshold PPM serves as target
evidence to the Kalman tracker. The state of the Kalman
tracker contains the ellipse that describes every target.
The learning rate is modified in elliptical regions around
these targets. Thus instead of a constant value, a spacio-temporal
adaptation of the learning rate is used:

This delays fading of the targets and depending on the
selection of the small learning rate and the motion of the
targets can be sufficient. In some cases though where
targets stay put for very long periods, even the small
learning rate will gradually fade them into the background.
If this starts happening (the target becomes smaller while
its mobility is small), the normal weight update mechanism
of (1) is bypassed. The weight of the current Gaussian is
decreased and that of all the rest is increased with a rate
that is inversely proportional to the mobility of the
target, as this is estimated from the state of the Kalman
tracker for this particular target. This fading prevention
mechanism is not always in effect; it is only activated when
targets are small and rather immobile, since the tampering
of the weights is very forceful and affects the whole
elliptical disk around the target, regardless if the pixel
is actually foreground or not.
The second major proposed modification of Stauffer’s
algorithm addresses extreme flickering situations often
encountered in night vision cameras. In such scenes the PPM
needs to be bounded by a very low threshold in order not to
consider flickering pixels as foreground. The threshold on
the other hand tends to discard actual foreground pixels as
well. The proposed solution is to adapt the threshold T in a
spacio-temporal fashion similar to the learning rate.

This way flickering pixels are avoided far from the
targets, while the targets themselves are not affected. The
penalty of this strategy is the delayed detection of new
very small targets.
These proposed feedback mechanisms on the learning rate lead
to robust foreground regions regardless of the flickering in
the images or the lack of target mobility, while they do not
affect the adaptation of the background around the targets.
When such flickering and mobility conditions occur, the
resulting PPM is more suitable for target region forming
that the original version of. The forming of target regions
is the goal of the measurement module, detailed next.
Measurement Module
The measurement module finds foreground segments, assigns
them to known targets or initializes new ones and checks
targets for possible merging or splitting. The information
for new targets or targets to be updated is passed to the
Kalman module.
The measurement process begins by processing the
adaptively thresholded PPM to obtain foreground segments.
This involves shadow detection based on, dilation, filling
of any holes in the segments and erosion. The obtained
segments are checked for possible merging based on their
Mahalanobis distance and are further considered only if they
are large enough. These segments are associated to targets
based on their Mahalanobis distance from the targets.
Non-associated segments generate new target requests to the
Kalman module.
The targets are subsequently checked for possible merging
based on how similar they are. Since we are using a Kalman
tracker, the targets are described by two-dimensional
Gaussians. If two such Gaussians are too similar, the
targets are merged. Finally, very large targets are checked
for splitting. This is necessary as, for example, two
monitored people can be walking together and then separate
their tracks. Splitting is performed using the k-means
algorithm on the pixels of the foreground segment comprising
the target. Two parts are requested from the k-means
algorithm. These parts are subsequently checked to determine
if they are distinct. For this, the minimum Mahalanobis
distance of the one with respect to the other is used. If
the two parts are found distinct, then they form two
targets. The one part of the foreground evidence is used to
update the existing target, while the other part is used to
request a new target from the Kalman tracker.
Kalman Tracking Module
The Kalman module maintains the states of the targets. It
creates new targets should it receive a request from the
measurement module and performs measurement update based on
the foreground segments associated to the targets. The
states of the targets are fed back to the adaptive
background module to adapt the learning rate and the
threshold for the PPM binarization.
Every target is approximated by an elliptical disc, i.e.
can be described by a single Gaussian. This facilitates the
use of a Kalman tracker. The target states are
seven-dimensional; they comprise of the mean of the Gaussian
describing the target (horizontal and vertical components),
the velocity of the mean (horizontal and vertical
components) and the three independent terms of the
covariance matrix.
The prediction step uses a loose dynamic model of
constant velocity for the update of the mean position and
velocity. As for the update of the three covariance terms,
their exact model is non-linear, hence cannot be used with
the Kalman tracker; instead of using linearization and an
extended Kalman tracker, the covariance terms are modeled as
constant. The variations of the velocity and the covariance
terms are permitted by the state update variance term. This
loose dynamic model permits arbitrary movement of the
targets. It is very different to the more elaborate models
used for tracking aircraft. Aircraft can perform a limited
set of maneuvers that can be learned and be expected by the
tracking system. Further, flying aircraft can be modeled as
rigid bodies thus strict and multiple dynamic models are
appropriate and have been used extensively in Interacting
Multiple Model Kalman trackers. Unlike aircraft, street
vehicles and especially humans have more degrees of freedom
for their movement which includes apart from speed and
direction changes obstacles arbitrarily, rendering the
learning of a strict dynamic model impractical. A strict
dynamic model in this case can mislead a tracker to a
particular track even in the presence of contradicting
evidence.

• Acoustic Source Localization
Estimation of Direction Of Arrival
In AIT’s ASL system, collection of audio data is performed
using a total of 80 microphones located in different
locations inside the acoustic enclosure and organized in
different topologies. More analytically, there is a 64
channel linear microphone array and four smaller clusters of
microphones, each containing four microphones. Each of the
microphone clusters has the microphones organized in an
inverted T topology

As noted localization of speakers is generally dealt with
the estimation of the DOA of the acoustic source by means of
time delay estimation (TDE) algorithms. Estimation of DOA
essentially provides us with the direction from which sound
is arriving from. Typically, audio data is collected in
frames so that the current TDE estimate can be provided.
Combination of several DOAs can then provide us with the
actual source position.
The practical and, in many ways, severely restricting
disadvantage of traditional methods for TDE is that if the
system is used in reverberant environments, the returned
estimate could be a spurious delay created by the ensuing
reflections. For the purposes of our system, we have
proposed a new mathematical framework that resolves to great
amount the reverberation issues and generates robust
estimations. It is thus of interest to briefly investigate
the used model.
Consider two of the microphones with a distance d between
them. The sound source is assumed to be in the far field of
the array. For the case in which the environment is
non-reverberant, the assumption of a single source leads to
the following discrete-time signal being recorded at the mth
microphone (where m = 1, 2):

where _m denotes the time in samples that it takes for
the source signal to reach the mth microphone and nm(k) is
the respective additive noise (assumed to be zero mean and
uncorrelated with the source signal). The overall geometry
of the corresponding system can be seen in Fig. 2. Without
loss of generality this considers m1 to be the reference
microphone i.e. ô1 = 0. The delay at m2 is then the relative
delay between the two recorded signals and thus the
relationship between the microphone signals is reduced to
x1(k) = x2(k − ô2). The corresponding DOA angle è is defined
with respect to the broadside of the array and connected
with any delay ô as:

where fs is the sampling frequency, and c is the speed of
sound (typically defined as 343 m/s). Ôhus, DOA estimation
methods rely on successful estimation of the TDE ô .

However, in a real reverberant environment, each of the
microphone recordings are a result of a convolution operator
between the speech signal and a reverberant impulse response
of significant length (depending on the reverberation
level). In order to overcome the problems introduced by
reverberation we make use of the concept of mutual
information (MI) by tailoring it appropriately to the
tracking of an acoustic source. A review of the concept can
be found in the work of Bell et al.
Most of the DOA estimation techniques are required to
operate in real time. We must, therefore, assume that data
at each sensor m are collected over t frames xmt = [xm(tL),
xm(tL+1), . . . , xm(tL+L−1)] of L samples. Since the
analysis will be independent of the data frame, we can drop
t to express frames simply as xm for any t. In the context
of our model, and for any set of frames, we may then write:

where xm(ô) denotes a delayed version of xm by ô samples.
Thus, the problem is to estimate the correct value of and
the DOA by processing two frames x1 and x2(ô) only. If we
were to neglect reverberation, only a single delay is
present in the microphone signals. Thus, measurement of
information contained in a sample l of x1 is only dependent
on the information contained in sample l −ô of x2(ô). In the
case of the reverberant model though information contained
in a sample l of x1 is also contained in neighboring samples
of sample l−ô of x2(ô) due to the fact that the model is now
convolutive. The same logical argument applies to the
samples of x2(ô). In order to estimate the information
between the microphone signals, we use the marginal MI that
considers jointly N neighboring samples and can be
formulated as follows for the case where the recordings
exhibit Gaussian behavior:

with the joint covariance matrix:

If N is chosen to be greater than zero, the elements of
C(ô) are themselves matrices. In fact, for any value of N,
the size of C(ô) is always 2(N + 1) × 2(N + 1). For the
purposes of the present work, we call N the order of the
tracking system. When C(ô) reaches a maximum as a function
of at a specific time shift ô , then there is at this point
a joint process with a maximum transport of information
between x1 and x2(ô). According to the presented
information-theoretical criterion, this is the delay that
synchronizes the two recordings. In the context of DOA, this
delay returns the correct angle è, at which the signal
coincides with the microphone array. The last step in the
ASL process is the combination of several DOA estimates, in
order to get the actual 2D coordinates of the speaker. This
is performed by calculating the crossing points between the
lines defined by the estimated DOAs. In most cases these
lines cross in more than one point and thus, the speaker
lies within some area defined by these points. The speaker
position is found by, employment of a closed-form source
location estimator as presented in the following paragraph.
In general this type of estimators represents a tremendous
computational saving over other exhaustive search methods.
Source Localization
The estimation of a DOA from a pair of microphones and the
corresponding angle è cannot by itself determine the speaker
location in space. For this we need to combine information
from a set of DOAs from different pairs in the enclosure. In
the following, we describe the method used for fusing
information from a set of microphone pairs. The analysis is
restricted in the 2D case.
Suppose we have employed m microphones in the acoustic
enclosure, each of them placed at a geometric location rm =
[Xm, Ym]. Let’s also assume that we organize the receivers
in P pairs. We define the estimated DOA angle of the pair
containing the ith and jth microphones as èij . Thus, after
the DOA estimation is completed we obtain P angle values.
Given a pair of microphones ri and rj (assuming far-field
conditions), we can define the line crossing from the
mid-point of the microphone pair and the estimated source
location as a function of the derived angle èij as:

where:

In the above, brij = [Xij, Yij] is the geometric
mid-point of microphones at ri and rj . In real systems the
location estimate is most often different than the actual
position of the source due to noise, interfering acoustic
sources, reverberation and the fact that the sources do not
exhibit omnidirectional characteristics. Thus, even though
in ideal conditions the lines would cross at a single point,
in real environments we have a set of crossing points
defining an area within which there are a series of
candidate source locations. Most often the source location
is derived by operating on the line equations according to
some adaptive or closed-form criterion. An alternative
approach is the operation upon the crossing points of these
lines. In this case, localizing the acoustic source in 2D
first requires the derivation of the line equations for the
pairs of microphones for which DOA estimation is performed.
In the sequel, the set of all crossing points between all
lines is derived. The total number of crossing points is a
function of P.
The problem is then to choose an appropriate filtering
mechanism that accepts the remaining crossing points as an
input and returns a source location estimate. For the
purposes of the present work we apply a median filter upon
the crossing points. Thus, the acoustic source estimate bs
for each of the L frames is given as:

where u is the set of the derived crossing points. To
assist the localization process further, all crossing points
outside the enclosure dimensions can be neglected prior to
filtering.

• Blind Source Separation
System for Blind Source Separation
Consider a convolutive BSSD problem, where two source
signals are recorded by two microphones after being mixed by
a set of reverberant acoustic impulse responses

Using a frequency-domain approach, the convolutive
mixture is transformed to a set of instantaneous mixtures
where, at a given frequency, the microphone signals can be
written:

where x = [x1, x2]T is the vector of microphone signals,
u = [u1, u2]T is the vector of source signals, A is the 2 x
2 matrix of acoustic transfer functions (ATFs) and [•]T
denotes matrix transpose. Note that all signals and transfer
functions are functions of frequency, although to simplify
notation we suppress the explicit dependence on frequency.
Define A(s,m) as the acoustic transfer function (ATF) from a
source located at s to a microphone located at m. Let the
position of the first and second sources, respectively, be
s1 and s2. Similarly, let the position of the first and
second microphone bem1 andm2, respectively. Thus, the (n,
k)th element of A is given by [A]n,k = A(sk,mn). The
microphone signals are then filtered by the 2x2 unmixing
matrix H to give the output signals:

where y = [y1, y2]T is the vector of output signals. Let
[H]l,n = Hln denote the unmixing filter from the microphone
at mn to the lth output.
The aim of any BSSD algorithm is to design the unmixing
matrix H such that y1 and y2 are “separated”. Typical
criteria to achieve this are minimization of mutual
information, entropy maximization, maximum likelihood and
latent source models and central limit theorem.
Statistical Room Acoustics
Our aim in this paper is to provide some analysis of
expected performance of BSSD algorithms in reverberant
rooms, so we require a parsimonious model for the ATFs that
is amenable to theoretical analysis. As the behaviour of
reverberation in rooms is too complicated to model directly,
we instead draw on the classical tools of statistical room
acoustics (SRA) to provide this model.
SRA assumes that the energy density of the reverberant
sound waves inside an enclosure can be considered as the sum
of contributions from an infinite number of plane waves that
arrive from random directions of propagation and contribute
equally to the space-averaged energy density. In such a
reverberant room and at a microphone position m, the sound
pressure due to a source located at s can be written as

where pd(s,m) is the pressure due to the direct sound,
and pr(s,m) is the pressure due to the reverberant field
which is assumed diffuse.
For this model to be accurate, the following set of
conditions must be satisfied:
- The dimensions of the room have to be chosen
relatively large when compared to the size of the
wavelength.
- The average spacing of the resonance frequencies
must be smaller than one-third of their bandwidth. This
is ensured if all frequencies exceed the Schroeder
frequency for large rooms, given by:

where V is the room volume in m3 and T60 is the
reverberation time in seconds.
- The source and the receiver must be at least one
halfwavelength away from the walls.
Additionally, SRA assumes that the direct
pressure and reverberant pressure are uncorrelated so that,

where E{•} denotes ensemble averaging over all source and
receiver positions within a room. In the validity of this
assumption is verified by comparison of a series of
simulations. As with the sound pressure, SRA enables us to
also write the ATFs as the sum of direct-path and
reverberant-path terms, i.e.,

where the direct-path term is

and k = 2ðf/c is the wavenumber and k k is the vector 2-
norm.
The reverberant-path terms satisfy the property

where * denotes complex conjugation, sinc(x) , sin x/x, S
is the wall surface area of the room, and á is the average
absorption coefficient of the walls. The room parameters are
related by the Sabine equation


• SAD (Speech Activity Detection)
LDA Applied to MFCC Parameters
Mel Frequency Cepstral Coefficients (MFCC) are the dominant
features used for speech recognition. They are obtained by
taking the inverse Fourier transform of the log spectrum
after it is wrapped according to a nonlinear scale that is
motivated by properties of human hearing, the Mel scale. It
was shown in our experiments that the addition of the first
and second derivatives of the MFCC as well as of the energy
of each preprocessed frame enhances the performance of the
algorithm. LDA is a method that seeks directions that are
efficient for discrimination. In the case of SAD there are
two classes to be discriminated, speech and non speech. The
criterion function that we are seeking to maximize in LDA is
the following:

SB is the between scatter matrix and is a measure
of the separation of the means of the clusters. SW, which is
called the within cluster matrix, is a measure of the
spread of the clusters. The maximization problem reduces to
a general eigenvalue one:

from which the projecting vector is found.
Energy Based Adaptive Algorithm
The EBA algorithm relies on an estimation of the
instantaneous SNR for the distinction of speech and non
speech segments. It is based on “Murphy’s algorithm”. The
algorithm computes the SNR for the i-th frame as follows:

where fe[i] is the energy of the i-th frame and nf[i] its
noise floor. The noise floor is calculated with the aid of
the smoothed frame energy contour, se[i]. The frame energy
contour is computed by calculating the energy of a frame
that has been smoothed by a window function. The heuristics
for the estimation of the noise floor energy are the
following:
- Increase slightly the noise floor estimate: nf [i] =
a ⋅ nf [i −1] , where á is slightly greater than 1.
- Select the minimum value between the estimated noise
floor and the smoothed energy contour: nf [i] = min(nf [i],
se[i]) .
- If the noise floor becomes greater than twice the
smoothed frame energy contour it is assigned the value
of the floor energy divided by two.

This operation accounts for the speech to silence
transitions.
Various thresholds are employed in order for the system to
also take into account transitions from speech to silence
and vice versa. More specifically if SNR[i] is greater than
a pre-specified threshold, SNR_THRESH, then a counter is
incremented; otherwise it is decremented. When the counter
is incremented past a threshold, SC_THRESH, the current
frame as well as a number of previous frames (the number is
specified by another user specified parameter) are labeled
as speech. This labeling accounts for the silence to speech
transition. If the counter is greater than SC_THRESH then
the current frame is considered speech. Note also that
another threshold, SC_MAX, imposes an upper bound of the
value of the counter and consequently is a limit for
consecutive speech segments.
A backtrack operation is triggered when the current SNR is
less than SNR_THRESH and the counter equals SC_THRESH. This
operation accounts for the speech to silence transition. As
long as the counter is less than SC_THRESH the current frame
is labeled as non speech. Median filtering is performed as
the final step so that spurious transitions smooth out.
Proposed Fused Algorithm
The following figures illustrate the block diagrams of the
training and testing schemes of the algorithm. For the stand
alone LDA algorithm the training sequence is first compared
with a reference file and separated to speech and non speech
segments. In our case the training sequence is first passed
through the EBA algorithm which separates the frames to
speech and non speech ones. Both speech and non speech audio
streams are segmented into overlapping frames of a given
length. The same applies to the testing sequence.
Note that the separation with the EBA algorithm contains
errors (unlike the first case where the separation is based
on a reference file that contains the timestamps of speech
and non speech periods). However it is desirable to perform
this step in the training process since the resulting
sequence that will be then separated to speech and
non-speech segments for LDA training is more representative
of the data that are tested with the LDA classifier in the
corresponding testing step.
The EBA Algorithm is fine tuned in such a way so that it
is unbalanced, meaning that there is a considerable
difference between speech and non speech detection error
rate. After this preprocessing stage, the class with the
least misclassifications discards its frames. The frames of
the other class are compared with a reference file and are
separated to the true speech and non speech classes. The
MFCC are calculated from the two segments and finally LDA is
applied in order to obtain a projecting vector and a
threshold.

In the testing scheme the framed sequence is initially
passed through the EBA algorithm. Then the frames that were
categorized as the class that was not taken into account in
the training after preprocessing are labeled as that class
and are not considered until the median filtering operation.
From the remaining frames the MFCC are calculated. The
processed frames are then projected in the direction of the
eigenvector found in the training scheme. A decision is made
on these frames based on where the projected data lie
relative to a threshold. Finally all the decisions are
passed through a median filter.
|