Third International Workshop on Cooperative Distributed Vision

Presentations by Invited Speakers

Representation of Multi-Agent Action for Recognition

Aaron F. Bobick, Stephen Intille and Yuri Ivanov
( Georgia Institute of Technology, USA, ** M.I.T. Media Laboratory, USA)

Many distributed vision applications concern the recognition of multi-agent action. Unlike the activity of single-agents over short time scales, multi-agent action is typically loosely defined by statistical tendencies, requires certain causal interactions, and evolves over extended periods of time. Recognition techniques designed to observe single-agent action (such as hidden Markov models) are unlikely to succeed in these situations. Here we present two approaches to the statistical recognition of multi-agent action. The first is based upon stochastic parsing of parallel event streams. This method is useful where there are a priori definitions of actions involving a small number of agents, but where the detection of individual elements is uncertain. The fundamental idea is to divide the recognition task into the two levels of statistical detection of underlying features and structural parsing of those detections. The second approach relies on the uncertain integration of confirming evidence of large scale coordinated activity, such as a team executing a particular football play. In presentation and comparison of these two techniques we will attempt characterize multi-agent action recognition problems in terms of being structural or statistical, and in terms of their spatial-temporal rigidity.

Multi-perspective Analysis of Human Action

Larry S. Davis, Eugene Borovikov, Ross Cutler, and Thanarat Horprasert (University of Maryland, USA)

We describe research being conducted in the University of Maryland Keck Laboratory for the Analysis of Visual Motion. The Keck Laboratory is a multi-perspective computer vision Laboratory containing sixty four digital, progressive scan cameras (forty eight monochromatic and sixteen single CCD color) configured into sixteen groups of four cameras. Each group of four is a quadranocular stereo rig consisting of three monochromatic and one color camera. The cameras are attached to a network of sixteen PCs used for both data collection and real time video analysis.
We first describe the architecture of the system in detail, and then present two applications:
Real time multi-perspective tracking of body parts for motion capture. We have developed a real time 3D motion capture system that integrates images from a large number of color cameras to both detect and track human body parts in 3D. A preliminary version of this system (developed in collaboration with ATR Media Integration & Communications Research Laboratories and the M.I.T. Media Laboratory) was demonstrated at SIGGRAPH98. That version, based on the W4 system for visual surveillance developed in our laboratory: Detected people by background modeling and background subtraction, Found body parts in each image via shape analysis of foreground regions, Triangulated those body parts using robust triangulation procedures, and then Smoothed the 3D body part trajectories and predicted locations of those parts in each view using a lightweight version of the dynamical models developed by Chris Wren and Sandy Pentland from M.I.T. Media Laboratory.
Animated the captured motion using graphical models developed by with ATR Media Integration & Communications Research Laboratories. We describe improved versions of the background modeling and tracking components of that system Real-time volume intersection. Models of human shape can also be constructed using volume intersection methods. Here, we use the same background modeling and subtraction methods as in our motion capture system, but then utilize parallel and distributed algorithms for constructing an oct-tree representation of the volume of the person being observed. Details of this algorithm will be described.

Semantic Interactivity in Presence Systems

Simone Santini and Ramesh Jain (PRAJA Inc., USA)

Presence Technology (PT) is targeted to the needs of people who want to be part of a remote, live environment. Presence systems blend component technologies like computer vision, signal understanding, heterogeneous sensor fusion, live-media delivery, telepresence, databases, and multimedia information systems into a novel set of functionality that enables the user to perceive, move around, enquire about, and interact with the remote, live environment through her reception and control devices. PT creates the opportunity to perform different tasks: watch an event, tour and explore a location, meet and communicate with others, monitor the environment for a potential situation, perform a query on the perceived objects and events, and recreate past observations. Technically, the framework offers computer-mediated access to multi-sensory information in an environment, integrates the sensory information into a situation model of the environment, and delivers, at the user's request, the relevant part of the assimilated information through a multimodal interface.
PT is an extension of the Multiple Perspective Interactive Video project at the Visual Computing Laboratory, University of California, San Diego. In this paper we will present results from PRAJA Presence system implemented to bring an early version of PT for different application. We will present a demo of this system to explain different technical components of the system.

Image Understanding for Visual Surveillance Applications

Monique Thonnat and Nathanael Rota (INRIA Sophia Antipolis, France)

This paper presents recent work on behavior analysis. More precisely the objective is to infer high level description of human behavior from image sequences. The main motivation of this research activity is to automatically generate alarms towards human operators when interesting scenarios have been recognized by the system. First, after a presentation of the state of the art in this domain, we introduce the general scheme of our approach which is based on the use of predefined scenarios and a priori contextual information. Secondly, we detail the current low level image processing techniques used for mobile object detection. Third we describe the role of a priori contextual information and different ways of representing this information. Then we address the problem of high level description of mobile object behavior using generic observable events and application dependent scenarios. Finally results obtained on different visual surveillance applications in the European Esprit projects Passwords and AVS-PV are shown and discussed. This papers concludes with future works for enhancing the robustness of such image understanding systems and to improve their capabilities to be re-used.