Autonomic and Grid Computing
Prototype Systems & Services

The following systems have been developed in the AIT Smart Lab:

Based on these systems we have also developed the following ambient context-aware services:
  • The Memory Jog providing assistance (e.g., memory aids, pertinent information) in the scope of small meetings, lectures and seminars occurring in the smart room [Click to see video / images 1, 2]
  • The e-Director, an intelligent camera-man, that automatically chooses the camera (among the possible five in the room) that provides the optimal view of a speaker [Click to see video / image]

A major part of this work has been carried out in the scope of the Computers in the Human Interaction Loop (CHIL) project (http://chil.server.de). A video about AIT activities in CHIL is also available [Click here].

 

 

Face Detection

The face detection part of the AIT-FACER identifies those image segments that most resemble faces, based on color and geometry. From a systems perspective the face detector is designed and trained to have low probability of miss, but rather high probability of false detection. Thus the face detector accepts any reasonably large, round, skin-like colored segments as face candidates. Then the subsystems, attempt to select the frontal faces out of all these candidate face segments. A second design consideration has been speed: this module needs to operate very fast by itself and also help minimize the effort required by the subsequent processing. Since the direct application of skin-color segmentation is computationally expensive, detection begins with foreground segmentation; skincolor segmentation follows on the foreground segments.

Foreground in general is identified by using the difference from a background image. The background image can be either static or the result of adaptive estimation. In the system that we present we have not used any adaptive scheme for background estimation. This is not a limitation of the system since for many in-doors applications the background and illumination remain fairly constant. Thus an empty room image is utilized as background and subsequently the colors of the foreground are normalized. We now describe the algorithm we use for foreground estimation in detail. The first subtract the empty room image, sum the RGB channels and binarize the result. We subsequently perform a median filtering operation on 8x8 pixel blocks is performed, in order to produce solid foreground segments. This is much faster than equivalent morphological operations.

The next step is color normalization, which is used to minimize the effects of shadows on a frame level. The preferred and visibly better way is Gamma correction, but a faster solution is needed for our real-time system. Thus we set the brightness of the foreground segment at 95%. The original brightness information is also kept and used in the identified skin segments at later stages of the face detector.

Skin color segmentation is applied on the color-normalized foreground segments by using a human skin color model to produce the skin likelihood map of the foreground. The likelihood map is then segmented into skin segments; some heuristics discard those segments that are very unlikely to be human faces.
The color model we use is based on the skin color and non-skin color histograms of modified with conditional skin probability scaling. The ratio of the skin over non-skin probabilities is scaled by skin probability if the latter is small. In logarithmic scale, the used log-likelihood L(r,g,b) is given by


 

The above proposed modification decreases the likelihood that a pixel is characterized as skin when it is not probable that it is, and as a result better segmentation is achieved. The likelihood map L(r,g,b) is binarized: pixels with L(r,g,b) > -.75 pixels take the value 1 (skin color); the rest take the value 0. Then, we use a median filtering on 11-by-11 pixel neighborhoods with similar advantages to the foreground application (smooth skin map, much faster than equivalent morphological operations).

Next we identify regular and connected skin color segments and discard irregular or very small such segments. First, the different segments become connected in the skin map by using 8-way connectivity. Subsequently, the bounding boxes of the segments are identified and boxes with small area (<0.2% of the frame area) are discarded: their resolution is too low for recognition.

Subsequently, we choose segments with face-like elliptical aspect ratios. We assign more particles for pixels that are more skin-like and apply Principal Component Analysis (PCA) on the scatter matrix of the particle coordinates: a) the principal axes of the particle cloud covering the region are used as a rough estimate of the rotation of the face and used along with the eye detector the eigenvalues are used to estimate the elliptical aspect ratio of the region. Regions that are too oblong to correspond to faces are discarded.

Finally, we stress that by performing PCA on the particle cloud(s) we are able to obtain not only the elliptical aspect ratio, but also a rough estimation of the angle of rotation of the segment. Although not very accurate, this estimate proves particularly useful for robust eye detection.


 

 

Eye Detection

Detecting the location of the eyes in a facial image is of paramount importance for an automated face recognizer. If we can identify the eyes and their location reliably, we can perform necessary normalizations in terms of shift, scale and rotation. The eye detector we developed at AIT as part of the AIT-FACER system operates in two stages: First the eye zone (eyes and bridge of the nose area) is detected in the face candidate segments. As a second stage, we detect the eyes in the identified eye zone. The eye zone is identified in the segment using a technique similar to. First the vertical position of the zone is determined by examining the mean brightness of every row in the segment. Two peaks are detected in the upper part of the segment: they correspond to the forehead and upper cheek regions of the face. The center of the zone is set between the two peaks as follows: If n b , n= 0,…p is the averaged brightness between the two peaks, with 0 being the index of the upper and p that of the lower peak and m b , m,(0,p) is the minimum averaged brightness, then we find indices k such that

Denoting the set of such indices by K, the center c of the zone is the weighted average

The zone is further extended until the averaged brightness exceeds 1.2 times bm. This approach works well for approximately horizontally located eyes. For rotated segments we use the rough estimation of the rotation angle from the Face Detector subsystem to de-rotate them sufficiently to locate the eyes.

If the face had been very accurately segmented, the vertical position of the eye zone would have been adequate. Unfortunately, the hair of the temples and sometimes a bit of the background are present in the face segments. These parts are usually fairly dark and can become particularly troublesome for locating the eyes for this reason; we introduce a new algorithm to also define the eye zone horizontally.

The horizontal eye location algorithm starts by reducing the effect of shadows at the local (segment) level. We introduce a new method that subtracts the best linear fit of brightness variation from the segment. Since brightness is affected both by lighting and by the color properties of the objects (e.g. hair over part of forehead), its value in the whole segment cannot be used directly. Instead, the horizontal component of brightness, averaged across each column in the (vertically defined) eye zone of each segment, is used to estimate the best linear fit and subtract it from the face segment.

Subsequently, the horizontal extent of the eye zone is identified in steps. First, the brightness in the eye zone is averaged across its columns. All the brightness peaks are then detected: the ones close to the center correspond to the bridge of the nose. A zone with decreasing or roughly constant brightness around the leftmost and the rightmost of these peaks is found. The center of this zone defines the center of the face, while its extent defines the face width. The eye zone is extended beyond the leftmost and rightmost peaks as long as brightness remains roughly constant.

Having estimated the vertical and horizontal position of the eye zone, we use particles to sample its brightness distribution and identify the most probable position of the eye pupils. More particles are assigned to pixels with less brightness; the mean of the particle distributions is found using 2-class k-means clustering. This approach works in practice because the bridge of the nose is usually quite bright and serves as a boundary between two distinct clouds of particles.

The variance of the two clusters is then estimated and if either of them is above a threshold, the cluster is re-clustered. This spreading is usually the effect of eyebrows, temples and dark corners of the eyes. Another 2-class k-means is used to identify two secondary clusters. Depending on whether the eye is the left or the right, the leftmost or the rightmost of the two secondary clusters is kept. The position of the eye pupil is then obtained as a weighted average between the original and the secondary cluster center. The relative weight of the secondary cluster and the re-clustering threshold are system parameters.

 

Face Recognition

All normalized segments are finally processed by an LDA classifier and an identity tag is attached to each one. There are 5 images per person available for training. In order to account for the inaccuracies of the eye-detector, we train the LDA classifier using not only the hand-annotated eye positions of the training images but also perturbations around the ideal positions. In particular, each eye position is allowed to vary within any of the eight neighboring positions in a 3-by-3 rectangular grid centered on the ideal position. The horizontal and vertical steps of the perturbation are 1.68% of the template eye distance. The RMS error introduced by all the 81 possible eye pair positions is 1.94% of the template eye distance. Although this is smaller that the expected RMS error of the eye detector, we have performed extensive testing to validate that this is a good choice for the perturbations. Every T seconds, a majority voting is performed amongst the identities. The votes are weighted using

Longer periods of T allow more frontal faces to be detected and accurately normalized, hence leading to reduced errors.

 

 

Person Tracking
The block diagram of the tracking system is shown in the figure below. It comprises three modules: adaptive background, measurement and Kalman filtering. The adaptive background module produces the foreground pixels of each video frame and passes this evidence to the measurement module. The measurement module associates the foreground pixels to targets, initializes new ones if necessary and manipulates existing targets by merging or splitting them based on an analysis of the foreground evidence. The existing or new target information is passed to the Kalman filtering module to update the state of the tracker, i.e. the position and size of the targets. The output of the tracker is the state information which is also fed back to the adaptive background module to guide the spacio-temporal adaptation of the algorithm. In the rest of the section, we present the three modules in detail.

Adaptive Background Module
The targets of the proposed system (vehicles and humans) are mostly moving. The changes in the video frames due to the movement are used to identify and segment the foreground (pixels of the moving targets) from the background (pixels without movement). If a background image were available, this segmentation is simply the difference of the current frame from the background image. The foreground pixels thus obtained are readily grouped into target regions. A static image of the empty scene viewed by the camera can be used for background. Unfortunately this is not practical and adaptive background approaches are adopted primarily for two reasons: First, such an empty scene image might not be available due to system setup. Secondly and most importantly, background (outdoors and indoors) also changes: Natural light conditions change slowly as time goes by; the wind causes swaying movements of flexible background object (e.g. foliage); fluorescent light flickers at the power supply frequency; objects on tabletops and small pieces of furniture are rearranged and projection areas display different content. All these changes need to be learnt into an adaptive background model.

Stauffer’s adaptive background algorithm is capable of learning such changes with so different speeds of change by learning into the background any pixel, whose color in the current frame resembles the colors that this pixel often has. So no changes, periodic changes or changes that occurred in the distant past lead to pixels that are considered background. To do so, a number of weighted Gaussians model the appearance of different colors in each pixel. The weights indicate the amount of time the modeled color is active in that particular pixel. The mean is a three dimensional vector indicating the color modeled for that pixel, while the covariance matrix indicates the extend around the mean that a color of that pixel is to be considered as similar to the one modeled. Colors in any given pixel similar to that modeled by any of the Gaussians of that pixel lead to an update of that Gaussian, an increase of its weight and a decrease of all the weights of the other Gaussians of that pixel. Colors not matching any of the Gaussians of that pixel lead to the introduction of a new Gaussian with minimum weight. Hence the possible updates of the weight of the i-th Gaussian of the pixel located at (x, y) at time t are

where is the learning rate. a

Some variations of the Stauffer algorithm found in the literature deal with the way covariance is represented (single value, diagonal of full matrix) and the way the mean and covariance of the Gaussians are updated. Some further variations of the algorithm address the way the foreground information is represented. The original algorithm and most of the modifications lead to a binary decision for each pixel: foreground or background. In the Pixel Persistence Map (PPM) is used instead. This is a map of the same dimension as the frames with a value at each location (x, y) equal to the weight of the Gaussian matching the current color of the pixel at (x, y). Small PPM values indicate foreground objects, while large indicate background. The foreground/background threshold is left unspecified though.

The drawback of all the existing variations of Stauffer’s algorithm is that stationary foreground objects tend to fade in the background with rate. Small rates fade foreground objects slowly, but are also slow in adapting to the background changes, like the motion of a chair. Large rates favor background adaptation but tend to fade a target into the background when it stops. This fading progressively destroys the region of the tracked object, deforms its perceived shape and finally leads to loosing track of the object altogether. When the target resumes moving, foreground pixels will be marked only at the locations not previously occupied by the stationary target. When the target has fairly uniform coloration, this can lead to track loss even in the presence of movement. a

We propose a feedback tracking architecture in order to addresses these problems. The threshold PPM serves as target evidence to the Kalman tracker. The state of the Kalman tracker contains the ellipse that describes every target. The learning rate is modified in elliptical regions around these targets. Thus instead of a constant value, a spacio-temporal adaptation of the learning rate is used:

This delays fading of the targets and depending on the selection of the small learning rate and the motion of the targets can be sufficient. In some cases though where targets stay put for very long periods, even the small learning rate will gradually fade them into the background. If this starts happening (the target becomes smaller while its mobility is small), the normal weight update mechanism of (1) is bypassed. The weight of the current Gaussian is decreased and that of all the rest is increased with a rate that is inversely proportional to the mobility of the target, as this is estimated from the state of the Kalman tracker for this particular target. This fading prevention mechanism is not always in effect; it is only activated when targets are small and rather immobile, since the tampering of the weights is very forceful and affects the whole elliptical disk around the target, regardless if the pixel is actually foreground or not.

The second major proposed modification of Stauffer’s algorithm addresses extreme flickering situations often encountered in night vision cameras. In such scenes the PPM needs to be bounded by a very low threshold in order not to consider flickering pixels as foreground. The threshold on the other hand tends to discard actual foreground pixels as well. The proposed solution is to adapt the threshold T in a spacio-temporal fashion similar to the learning rate.

This way flickering pixels are avoided far from the targets, while the targets themselves are not affected. The penalty of this strategy is the delayed detection of new very small targets.
These proposed feedback mechanisms on the learning rate lead to robust foreground regions regardless of the flickering in the images or the lack of target mobility, while they do not affect the adaptation of the background around the targets. When such flickering and mobility conditions occur, the resulting PPM is more suitable for target region forming that the original version of. The forming of target regions is the goal of the measurement module, detailed next.

Measurement Module
The measurement module finds foreground segments, assigns them to known targets or initializes new ones and checks targets for possible merging or splitting. The information for new targets or targets to be updated is passed to the Kalman module.

The measurement process begins by processing the adaptively thresholded PPM to obtain foreground segments. This involves shadow detection based on, dilation, filling of any holes in the segments and erosion. The obtained segments are checked for possible merging based on their Mahalanobis distance and are further considered only if they are large enough. These segments are associated to targets based on their Mahalanobis distance from the targets. Non-associated segments generate new target requests to the Kalman module.

The targets are subsequently checked for possible merging based on how similar they are. Since we are using a Kalman tracker, the targets are described by two-dimensional Gaussians. If two such Gaussians are too similar, the targets are merged. Finally, very large targets are checked for splitting. This is necessary as, for example, two monitored people can be walking together and then separate their tracks. Splitting is performed using the k-means algorithm on the pixels of the foreground segment comprising the target. Two parts are requested from the k-means algorithm. These parts are subsequently checked to determine if they are distinct. For this, the minimum Mahalanobis distance of the one with respect to the other is used. If the two parts are found distinct, then they form two targets. The one part of the foreground evidence is used to update the existing target, while the other part is used to request a new target from the Kalman tracker.

Kalman Tracking Module
The Kalman module maintains the states of the targets. It creates new targets should it receive a request from the measurement module and performs measurement update based on the foreground segments associated to the targets. The states of the targets are fed back to the adaptive background module to adapt the learning rate and the threshold for the PPM binarization.

Every target is approximated by an elliptical disc, i.e. can be described by a single Gaussian. This facilitates the use of a Kalman tracker. The target states are seven-dimensional; they comprise of the mean of the Gaussian describing the target (horizontal and vertical components), the velocity of the mean (horizontal and vertical components) and the three independent terms of the covariance matrix.

The prediction step uses a loose dynamic model of constant velocity for the update of the mean position and velocity. As for the update of the three covariance terms, their exact model is non-linear, hence cannot be used with the Kalman tracker; instead of using linearization and an extended Kalman tracker, the covariance terms are modeled as constant. The variations of the velocity and the covariance terms are permitted by the state update variance term. This loose dynamic model permits arbitrary movement of the targets. It is very different to the more elaborate models used for tracking aircraft. Aircraft can perform a limited set of maneuvers that can be learned and be expected by the tracking system. Further, flying aircraft can be modeled as rigid bodies thus strict and multiple dynamic models are appropriate and have been used extensively in Interacting Multiple Model Kalman trackers. Unlike aircraft, street vehicles and especially humans have more degrees of freedom for their movement which includes apart from speed and direction changes obstacles arbitrarily, rendering the learning of a strict dynamic model impractical. A strict dynamic model in this case can mislead a tracker to a particular track even in the presence of contradicting evidence.


 

 

Acoustic Source Localization

Estimation of Direction Of Arrival
In AIT’s ASL system, collection of audio data is performed using a total of 80 microphones located in different locations inside the acoustic enclosure and organized in different topologies. More analytically, there is a 64 channel linear microphone array and four smaller clusters of microphones, each containing four microphones. Each of the microphone clusters has the microphones organized in an inverted T topology

As noted localization of speakers is generally dealt with the estimation of the DOA of the acoustic source by means of time delay estimation (TDE) algorithms. Estimation of DOA essentially provides us with the direction from which sound is arriving from. Typically, audio data is collected in frames so that the current TDE estimate can be provided. Combination of several DOAs can then provide us with the actual source position.

The practical and, in many ways, severely restricting disadvantage of traditional methods for TDE is that if the system is used in reverberant environments, the returned estimate could be a spurious delay created by the ensuing reflections. For the purposes of our system, we have proposed a new mathematical framework that resolves to great amount the reverberation issues and generates robust estimations. It is thus of interest to briefly investigate the used model.

Consider two of the microphones with a distance d between them. The sound source is assumed to be in the far field of the array. For the case in which the environment is non-reverberant, the assumption of a single source leads to the following discrete-time signal being recorded at the mth microphone (where m = 1, 2):

where _m denotes the time in samples that it takes for the source signal to reach the mth microphone and nm(k) is the respective additive noise (assumed to be zero mean and uncorrelated with the source signal). The overall geometry of the corresponding system can be seen in Fig. 2. Without loss of generality this considers m1 to be the reference microphone i.e. ô1 = 0. The delay at m2 is then the relative delay between the two recorded signals and thus the relationship between the microphone signals is reduced to x1(k) = x2(k − ô2). The corresponding DOA angle è is defined with respect to the broadside of the array and connected with any delay ô as:

where fs is the sampling frequency, and c is the speed of sound (typically defined as 343 m/s). Ôhus, DOA estimation methods rely on successful estimation of the TDE ô .

However, in a real reverberant environment, each of the microphone recordings are a result of a convolution operator between the speech signal and a reverberant impulse response of significant length (depending on the reverberation level). In order to overcome the problems introduced by reverberation we make use of the concept of mutual information (MI) by tailoring it appropriately to the tracking of an acoustic source. A review of the concept can be found in the work of Bell et al.

Most of the DOA estimation techniques are required to operate in real time. We must, therefore, assume that data at each sensor m are collected over t frames xmt = [xm(tL), xm(tL+1), . . . , xm(tL+L−1)] of L samples. Since the analysis will be independent of the data frame, we can drop t to express frames simply as xm for any t. In the context of our model, and for any set of frames, we may then write:
 

where xm(ô) denotes a delayed version of xm by ô samples. Thus, the problem is to estimate the correct value of and the DOA by processing two frames x1 and x2(ô) only. If we were to neglect reverberation, only a single delay is present in the microphone signals. Thus, measurement of information contained in a sample l of x1 is only dependent on the information contained in sample l −ô of x2(ô). In the case of the reverberant model though information contained in a sample l of x1 is also contained in neighboring samples of sample l−ô of x2(ô) due to the fact that the model is now convolutive. The same logical argument applies to the samples of x2(ô). In order to estimate the information between the microphone signals, we use the marginal MI that considers jointly N neighboring samples and can be formulated as follows for the case where the recordings exhibit Gaussian behavior:

with the joint covariance matrix:

If N is chosen to be greater than zero, the elements of C(ô) are themselves matrices. In fact, for any value of N, the size of C(ô) is always 2(N + 1) × 2(N + 1). For the purposes of the present work, we call N the order of the tracking system. When C(ô) reaches a maximum as a function of at a specific time shift ô , then there is at this point a joint process with a maximum transport of information between x1 and x2(ô). According to the presented information-theoretical criterion, this is the delay that synchronizes the two recordings. In the context of DOA, this delay returns the correct angle è, at which the signal coincides with the microphone array. The last step in the ASL process is the combination of several DOA estimates, in order to get the actual 2D coordinates of the speaker. This is performed by calculating the crossing points between the lines defined by the estimated DOAs. In most cases these lines cross in more than one point and thus, the speaker lies within some area defined by these points. The speaker position is found by, employment of a closed-form source location estimator as presented in the following paragraph. In general this type of estimators represents a tremendous computational saving over other exhaustive search methods.

Source Localization
The estimation of a DOA from a pair of microphones and the corresponding angle è cannot by itself determine the speaker location in space. For this we need to combine information from a set of DOAs from different pairs in the enclosure. In the following, we describe the method used for fusing information from a set of microphone pairs. The analysis is restricted in the 2D case.

Suppose we have employed m microphones in the acoustic enclosure, each of them placed at a geometric location rm = [Xm, Ym]. Let’s also assume that we organize the receivers in P pairs. We define the estimated DOA angle of the pair containing the ith and jth microphones as èij . Thus, after the DOA estimation is completed we obtain P angle values. Given a pair of microphones ri and rj (assuming far-field conditions), we can define the line crossing from the mid-point of the microphone pair and the estimated source location as a function of the derived angle èij as:

where:

In the above, brij = [Xij, Yij] is the geometric mid-point of microphones at ri and rj . In real systems the location estimate is most often different than the actual position of the source due to noise, interfering acoustic sources, reverberation and the fact that the sources do not exhibit omnidirectional characteristics. Thus, even though in ideal conditions the lines would cross at a single point, in real environments we have a set of crossing points defining an area within which there are a series of candidate source locations. Most often the source location is derived by operating on the line equations according to some adaptive or closed-form criterion. An alternative approach is the operation upon the crossing points of these lines. In this case, localizing the acoustic source in 2D first requires the derivation of the line equations for the pairs of microphones for which DOA estimation is performed. In the sequel, the set of all crossing points between all lines is derived. The total number of crossing points is a function of P.

The problem is then to choose an appropriate filtering mechanism that accepts the remaining crossing points as an input and returns a source location estimate. For the purposes of the present work we apply a median filter upon the crossing points. Thus, the acoustic source estimate bs for each of the L frames is given as:

where u is the set of the derived crossing points. To assist the localization process further, all crossing points outside the enclosure dimensions can be neglected prior to filtering.



 

Blind Source Separation

System for Blind Source Separation
Consider a convolutive BSSD problem, where two source signals are recorded by two microphones after being mixed by a set of reverberant acoustic impulse responses

Using a frequency-domain approach, the convolutive mixture is transformed to a set of instantaneous mixtures where, at a given frequency, the microphone signals can be written:

where x = [x1, x2]T is the vector of microphone signals, u = [u1, u2]T is the vector of source signals, A is the 2 x 2 matrix of acoustic transfer functions (ATFs) and [•]T denotes matrix transpose. Note that all signals and transfer functions are functions of frequency, although to simplify notation we suppress the explicit dependence on frequency. Define A(s,m) as the acoustic transfer function (ATF) from a source located at s to a microphone located at m. Let the position of the first and second sources, respectively, be s1 and s2. Similarly, let the position of the first and second microphone bem1 andm2, respectively. Thus, the (n, k)th element of A is given by [A]n,k = A(sk,mn). The microphone signals are then filtered by the 2x2 unmixing matrix H to give the output signals:

where y = [y1, y2]T is the vector of output signals. Let [H]l,n = Hln denote the unmixing filter from the microphone at mn to the lth output.

The aim of any BSSD algorithm is to design the unmixing matrix H such that y1 and y2 are “separated”. Typical criteria to achieve this are minimization of mutual information, entropy maximization, maximum likelihood and latent source models and central limit theorem.

Statistical Room Acoustics
Our aim in this paper is to provide some analysis of expected performance of BSSD algorithms in reverberant rooms, so we require a parsimonious model for the ATFs that is amenable to theoretical analysis. As the behaviour of reverberation in rooms is too complicated to model directly, we instead draw on the classical tools of statistical room acoustics (SRA) to provide this model.

SRA assumes that the energy density of the reverberant sound waves inside an enclosure can be considered as the sum of contributions from an infinite number of plane waves that arrive from random directions of propagation and contribute equally to the space-averaged energy density. In such a reverberant room and at a microphone position m, the sound pressure due to a source located at s can be written as

where pd(s,m) is the pressure due to the direct sound, and pr(s,m) is the pressure due to the reverberant field which is assumed diffuse.
For this model to be accurate, the following set of conditions must be satisfied:

  • The dimensions of the room have to be chosen relatively large when compared to the size of the wavelength.
  • The average spacing of the resonance frequencies must be smaller than one-third of their bandwidth. This is ensured if all frequencies exceed the Schroeder frequency for large rooms, given by:

where V is the room volume in m3 and T60 is the reverberation time in seconds.

  • The source and the receiver must be at least one halfwavelength away from the walls.

Additionally, SRA assumes that the direct pressure and reverberant pressure are uncorrelated so that,

where E{•} denotes ensemble averaging over all source and receiver positions within a room. In the validity of this assumption is verified by comparison of a series of simulations. As with the sound pressure, SRA enables us to also write the ATFs as the sum of direct-path and reverberant-path terms, i.e.,

where the direct-path term is

and k = 2ðf/c is the wavenumber and k k is the vector 2- norm.
The reverberant-path terms satisfy the property
 

where * denotes complex conjugation, sinc(x) , sin x/x, S is the wall surface area of the room, and á is the average absorption coefficient of the walls. The room parameters are related by the Sabine equation

 

SAD (Speech Activity Detection)

LDA Applied to MFCC Parameters
Mel Frequency Cepstral Coefficients (MFCC) are the dominant features used for speech recognition. They are obtained by taking the inverse Fourier transform of the log spectrum after it is wrapped according to a nonlinear scale that is motivated by properties of human hearing, the Mel scale. It was shown in our experiments that the addition of the first and second derivatives of the MFCC as well as of the energy of each preprocessed frame enhances the performance of the algorithm. LDA is a method that seeks directions that are efficient for discrimination. In the case of SAD there are two classes to be discriminated, speech and non speech. The criterion function that we are seeking to maximize in LDA is the following:

SB is the between scatter matrix and is a measure of the separation of the means of the clusters. SW, which is called the within cluster matrix, is a measure of the spread of the clusters. The maximization problem reduces to a general eigenvalue one:

from which the projecting vector is found.

Energy Based Adaptive Algorithm
The EBA algorithm relies on an estimation of the instantaneous SNR for the distinction of speech and non speech segments. It is based on “Murphy’s algorithm”. The algorithm computes the SNR for the i-th frame as follows:

where fe[i] is the energy of the i-th frame and nf[i] its noise floor. The noise floor is calculated with the aid of the smoothed frame energy contour, se[i]. The frame energy contour is computed by calculating the energy of a frame that has been smoothed by a window function. The heuristics for the estimation of the noise floor energy are the following:

  1. Increase slightly the noise floor estimate: nf [i] = a ⋅ nf [i −1] , where á is slightly greater than 1.
  2. Select the minimum value between the estimated noise floor and the smoothed energy contour: nf [i] = min(nf [i], se[i]) .
  3. If the noise floor becomes greater than twice the smoothed frame energy contour it is assigned the value of the floor energy divided by two.

This operation accounts for the speech to silence transitions.

Various thresholds are employed in order for the system to also take into account transitions from speech to silence and vice versa. More specifically if SNR[i] is greater than a pre-specified threshold, SNR_THRESH, then a counter is incremented; otherwise it is decremented. When the counter is incremented past a threshold, SC_THRESH, the current frame as well as a number of previous frames (the number is specified by another user specified parameter) are labeled as speech. This labeling accounts for the silence to speech transition. If the counter is greater than SC_THRESH then the current frame is considered speech. Note also that another threshold, SC_MAX, imposes an upper bound of the value of the counter and consequently is a limit for consecutive speech segments.

A backtrack operation is triggered when the current SNR is less than SNR_THRESH and the counter equals SC_THRESH. This operation accounts for the speech to silence transition. As long as the counter is less than SC_THRESH the current frame is labeled as non speech. Median filtering is performed as the final step so that spurious transitions smooth out.

Proposed Fused Algorithm
The following figures illustrate the block diagrams of the training and testing schemes of the algorithm. For the stand alone LDA algorithm the training sequence is first compared with a reference file and separated to speech and non speech segments. In our case the training sequence is first passed through the EBA algorithm which separates the frames to speech and non speech ones. Both speech and non speech audio streams are segmented into overlapping frames of a given length. The same applies to the testing sequence.

Note that the separation with the EBA algorithm contains errors (unlike the first case where the separation is based on a reference file that contains the timestamps of speech and non speech periods). However it is desirable to perform this step in the training process since the resulting sequence that will be then separated to speech and non-speech segments for LDA training is more representative of the data that are tested with the LDA classifier in the corresponding testing step.

The EBA Algorithm is fine tuned in such a way so that it is unbalanced, meaning that there is a considerable difference between speech and non speech detection error rate. After this preprocessing stage, the class with the least misclassifications discards its frames. The frames of the other class are compared with a reference file and are separated to the true speech and non speech classes. The MFCC are calculated from the two segments and finally LDA is applied in order to obtain a projecting vector and a threshold.

In the testing scheme the framed sequence is initially passed through the EBA algorithm. Then the frames that were categorized as the class that was not taken into account in the training after preprocessing are labeled as that class and are not considered until the median filtering operation. From the remaining frames the MFCC are calculated. The processed frames are then projected in the direction of the eigenvector found in the training scheme. A decision is made on these frames based on where the projected data lie relative to a threshold. Finally all the decisions are passed through a median filter.