follow us

Follow AITathens on Twitter faceebook in_logo

Happening now...

Audio-Visual Person Tracking

A Practical Approach

By Fotios Talantzis, Aristodemos Pnevmatikakis & Antony G. Contstantinides

(Latest update Sept. 12, 2012)

AV book cover

This book deals with the creation of the algorithmic backbone that enables a computer to perceive humans in a monitored space. This is performed using the same signals that humans process, i.e., audio and video. Computers reproduce the same type of perception using sensors and algorithms in order to detect and track multiple interacting humans, by way of multiple cues, like bodies, faces or speech. This application domain is challenging, because audio and visual signals are cluttered by both background and foreground objects. First, particle filtering is established as the framework for tracking. Then, audio, visual and also audio-visual tracking systems are separately explained. Each modality is analyzed, starting with sensor configuration, detection for tracker initialization and the trackers themselves. Techniques to fuse the modalities are then considered. Instead of offering a monolithic approach to the tracking problem, this book also focuses on implementation by providing MATLAB code for every presented component. This way, the reader can connect every concept with corresponding code. Finally, the applications of the various tracking systems in different domains are studied.

Visit publisher's page.


Computing systems that are aware of human presence in order to provide heterogeneous services are gaining importance in living and working spaces, in entertainment, security and retail. Central role to such systems is the ability to sense humans and often track them in space across time. Tracking has became a mature topic in radar applications but requires a different set of sensors and algorithms when it involves humans. People generally do not like carrying tracking devices, a fact that would facilitate service provision greatly. Instead, Person Tracking in this book is discussed in one of its unobtrusive flavours i.e. with the use of visual and audio modalities.

This book is about tracking humans using cameras and microphones, emphasising on particle filtering algorithms. There are a few excellent texts on tracking, some of which focus on particle filters. All these texts though focus on radar or sonar tracking. Audiovisual tracking needs different types of measurements on different types of signals: Image, video and audio signal processing elements need to be cast into the tracking frameworks. Two early works paved the way for visual tracking, but audio tracking still lack a comprehensive text. A recent work covers audio/visual tracking, mostly from the sensors and applications point of view.

This book aspires to fill in the gap between traditional tracking texts and signal processing texts. It is meant to be a solid introduction to the researcher starting in the field but also a good reference for people already working in the field. It equips the reader with all the tools to measure the presence of humans in audio and visual signals and convert these measurements in likelihood functions. These likelihood functions are suitable for driving many types of tracking algorithms, but the emphasis is on particle filtering. This became an obvious choice after inspecting the evolution of the relative literature in the past decade that slowly moved away from deterministic and Kalman versions to the more versatile particle filters.

We believe that the coverage of the material is end-to-end, in the sense that the theoretical foundation of particle filtering and the necessary image, video, audio and array signal processing elements is first established, followed by working examples and MATLAB implementations. Nevertheless, the MATLAB implementations aim to serve as skeletons for the employment of larger systems. We felt that the book would not be complete without a chapter discussing applications and real-world systems. This allowed us to give a more meaningful aspect to an otherwise abstract scientific problem.

Back to top


  • Page 14, second paragraph of section 2.4.1 should be changed to ‘Similar to deterministic tracking, recursive Bayesian filtering involves two steps. First the previous posterior is mapped into the one-step prediction density \( p\left( {{{\bf{x}}_n}\left| {{{\bf{y}}_{1:n - 1}}} \right.} \right) \) utilising all the available information about the current state \( {\bf x}_n \), i.e. the previous state \( {\bf x}_{n-1} \) and the sequence of past measurements \( {\bf{y}}_{1:n - 1} \). This is expressed as the conditional PDF \( p\left( {{{\bf{x}}_n}\left| {{{\bf{x}}_{n - 1}},{{\bf{y}}_{1:n - 1}}} \right.} \right) \). Then:’
  • Page 29, first line, ‘is an arbitrary fuction’ should be ‘is an arbitrary function’
  • Page 29, line above eqn (2.41), the equation references should be (2.11)
  • Page 29, line above eqn (2.42), the equation references should be (2.10)
  • Page 92, rotation matrix symbol after eqn. (4.7): the exponent (c) should be replaces by a subscript c.
  • Page 169, caption of Fig. 5.14, ‘the two uni-modal particle filter face tracker’ should be ‘the two uni-modal particle filter face trackers

Back to top


Use the following links to download the MATLAB code and multimedia files used throughout the examples of this book:

Back to top


Throughout the book, figures contain screenshots from videos demonstrating the tracking technology or applications. Here you can find YouTube links for those videos.

  • Fig. 4.19: Finger tracking in NIR video. See relevant section in our Human-Computer Interaction page.
  • Fig. 5.9: Audio-visual speaker tracking: Fuse an audio track (represented by the green bar) that carries no height information with the single-camera video tracks of the faces (represented by the red rectangles in the frame and the red T-shaped markers in the 3D representation) that carry very uncertain depth information into a reliable 3D speaker track (represented by the blue marker in the frame and the blue bar in the 3D representation).
  • Fig. 6.1: 3D tracker providing the locations and body postures (standing, walking, siting, fallen) of all the people in the monitored space. The system combines motion evidence from several cameras into 3D human body abstractions.
  • Fig. 6.2: 3D tracker providing the locations of all participants in a meeting. The system fuses 2D face tracks from the corner cameras (smaller four frames on the left) and motion evidence from the panoramic camera (larger frame on the right) to track the projection of the head on the floor.
  • Fig. 6.4: The augmented reality application. See relevant section in our Human-Computer Interaction page.
  • Fig. 6.6: Outdoor surveillance systems.
  • Fig. 6.7: Targets tracked from a panoramic camera and resulting coverage on the floor. Brighter colours indicate highest occupancy at the particular 10-by-10 cm square.

Back to top

Bookmark and Share
Affiliated with Aalborg University-CTiF, Harvard-Kennedy School Of Goverment © ATHENS INFORMATION TECHNOLOGY designed by {Linakis+Associates}