The Fourth International Workshop on Cooperative Distributed Vision

Presentations by Project Members

Cooperative Distributed Vision: Dynamic Integration of Visual Perception, Camera Action, and Network Communication

Takashi Matsuyama (Kyoto University)

This paper gives an overview of our five years project on Cooperative Distributed Vision (CDV, in short). From a practical point of view, the goal of CDV is summarized as follows:
Embed in the real world a group of network-connected Observation Stations (real time image processor with active camera(s)) and mobile robots with vision, and realize

dynamic real world scene understanding, and
versatile dynamic scene visualization.

Applications of CDV include real time wide area surveillance, remote conference and distance learning systems, 3D Video and intelligent TV studio, navigation of mobile robots and disabled people, cooperative mobile robots, and so on.
In this paper, we first present our motivation, research goal and basic ideas of CDV. We discuss functionalities of and mutual dependencies among perception, action, and communication to formally clarify the meaning of their integration. Then we introduce the Dynamic Memory Architecture for dynamically integrating multiple asynchronous parallel processes. With this architecture, we can implement an Active Vision Agent, in which visual perception, camera action, and network communication modules are realized as asynchronous parallel processes dynamically interacting via the shared dynamic memory. Moreover, we show this architecture enables us to design dynamic cooperations among multiple communicating active vision agents. Finally, we give a summary of our research achievements during these five years.

Imaging a 3D Lecture Room by Interpreting Its Dynamic Situation

Michihiko Minoh and Yoshinari Kameda (Kyoto University)

We envision a new computer supported environment named information media environment. Users can watch what happens in a fixed real space where people get together to do something through raw/synthesized video images in real-time with this environment.
On watching video image of the scene, people want to see and understand what is done there. In this sense, we should know not only what is happening in the scene but also what they want to see and how they want to see it. We have defined imaging rules that are applied according to the dynamic situation of the scene. We also show how the situation should be interpreted into camera works, which describe the way of controlling active shooting cameras.
We select a lecture room in distance learning lectures for our experiment. Interests of remote students should be understood and the active cameras should be cooperated so as to image the focus object in the 3D lecture room. We have implemented a prototype system and served it for several regular courses between UCLA and Kyoto University, which shows soundness of our proposed method.

Cooperative Behavior Acquisition by Learning and Evolution of Vision-Motor Mapping for Mobile Robots

Minoru Asada, Eiji Uchibe, and Koh Hosoda (Osaka Univeristy)

The paper consists of four works contributed to realize cooperative behavior between mobile robots in a dynamically changing environment. First, we propose a method that acquires the purposive behaviors based on the estimation of the state vectors. In order to acquire the cooperative behaviors in multiagent environments, each learning robot estimates the Local Prediction Model (hereafter LPM) between the learner and the other objects separately. The LPM estimate the local interaction while reinforcement learning copes with the global interaction between multiple LPMs and the given tasks. Based on the LPMs which satisfies the Markovian environment assumption as possible, robots learn the desired behaviors using reinforcement learning. We also propose a learning schedule in order to make learning stable especially in the early stage of multiagent systems.
Next, a method for controlling the complexity is proposed for a vision-based mobile robot to develop its behavior according to the complexity of the interactions with its environment. The agent estimates the full set of state vectors with the order of the major vector components based on the {\em LPM}. The environmental complexity is defined in terms of the speed of the agent while the complexity of the state vector is the number of the dimensions of the state vector. According to the increase of the speed of its own or others, the dimension of the state vector is increased by taking a trade-off between the size of the state space and the learning time.
Then, a vector-valued reward function is introduced in order to cope with the multiple tasks. Unlike the traditional weighted sum of several reward functions, we introduce a discounted matrix to integrate them in order to estimate the value function, which evaluates the current action strategy. Owing to the extension of the value function, the learning agent can estimate the future multiple reward from the environment appropriately.
Finally, a genetic programming method is applied to individual population corresponding to each robot so as to obtain cooperative and competitive behaviors through co-evolutionary processes. The complexity of the problem can be explained twofold: co-evolution for cooperative behaviors needs exact synchronization of mutual evolutions, and three robot co-evolution requires well-complicated environment setups that may gradually change from simpler to more complicated situations.
The proposed methods are applied to several simplified soccer games to show their validity, and discussion is given.

Acquiring Human Task and Cooperative Behavior through Spatiotemporal Distributed Vision

Katsushi Ikeuchi, Hiroshi Kimura, Koichi Ogawara and Jun Takamatsu (University of Tokyo)

Acquisition of human task models and skills from observation is essential for preserving, translating and sharing knowledge which is difficult to express verbally. Our goal, based mainly on vision technique, is the acquisition, preservation and enhancement of human hand-work tasks, ranging from everyday tasks to those requiring professional skills. To this end, we are currently developing a recognition technique, reusable task model representation, and a cooperation scheme between humans and robots.
We have developed a human-form robot which has similar capabilities to those of humans. This robot will serve as an experimental platform for the purpose of handling the entire process of task acquisition from observation as input, through internal calculation, to the robot behavior as output. By using the real robot, we can directly verify the recognition process and validity of the acquired task model from the robot behavior; moreover, repeating this loop enables the system to incorporate the effect of the robot and human behavior into observation to revise the task model or to realize cooperative behavior with multiple humans.
As the outcome of the 5 year Cooperative Distributed Vision (CDV) project, we developed 3 different techniques for obtaining task knowledge. One is a technique for constructing a task model by integrating multiple observations by attention points. Another is a technique for automatic acquisition of precise assembly skills by analyzing contact state transitions of manipulated objects. The third is a technique for producing a variety of cooperative behaviors adapted to the progress of the current task and human actions from a single task model by analyzing mutual event dependencies. We consider such preservation, translation, and sharing of task knowledge between humans and robots through vision to be an important aspect of the frame-work of CDV. In this paper, we present each technique and describe the experimental results achieved by the use of real robots.

Construction of Active Vision System for Cooperative Distributed Vision

Koichiro Deguchi (Tohoku University)

This paper reports the construction of active visual sensors which tracks moving targets and simultaneously obtains their 3-D shapes. The main objective of the cooperative distributed vision(CDV) system is to interpret and recognize the dynamic 3-D scene. Then, the key function of each vision sensors in CDV is to acquire dynamic changes in the scene, that is, detecting motions and tracking targets in the scene. To achieve this function, it is not sufficient only to match 2-D images. It must identify 3-D shape of the target and track it in 3-D space. Our system achieves real-time 3-D recognition of high speed moving object by fixation point tracking.
Also we report a visual servoing technique to guide the active visual sensors at their home positions.

Cooperative Distributed Face Registration and Recognition in Natural Environment

Takekazu Kato, Yasuhiro Mukaigawa and Takeshi Shakunaga (Okayama University)

We discuss a task-oriented cooperation distributed vision to realize both the effective face registration and the effective recognition in natural environment. This paper summarizes our framework, algorithms as well as experimental results.

Real-Time Multi-view Image Analysis on PC-cluster and Its Application

Rin-ichiro Taniguchi, Daisaku Arita and Satoshi Yonemoto (Kyushu University)

In the CDV project, one of the chief research issues is cooperative observation of target objects or environments with multiple sensors. To realize such cooperative systems, we have developed, as a base architecture, a PC-cluster system for real-time distributed video/image processing, which consists of off-the-shelf personal computers connected via high speed network. In this paper, we outline our real-time multi-view image processing system on PC-cluster, emphasizing its key issue, synchronization mechanism of distributed real-time image data. We also describe a prototypical application of the system, a real-time motion capture system.

Multi-target Tracking by Cooperation of Stationary and Mobile Agents

Norihiko Yoshida (Nagasaki University)

Mulit-target tracking, or multi-target motion analysis, is known to be an NP-hard problem, and several methods have been proposed. We invented a distributed search method using multiple cooperative agents, and proved that it was more efficient than other methods using a single centralized agent.
If agents performing target cognition stay on their home processors, they must delegate their intermediate solutions to some other as targets move. We invented a more effective alternative. Agents performing target cognition move over a processor network following targets.
Our system is composed of two classes (or layers) of agents. Lower layer "detectors" are stationary, and estimate tracks of targets. Higher layer "trackers" are mobile, and perform target cognition tasks. We designed this system applying a tuple-space mobile-thread mechanism which we invented elsewhere.

Multi-focus Range Sensor Using Coded Aperture

Shinsaku Hiura (Osaka Univeristy) and Takashi Matsuyama (Kyoto University)

Defocusing caused by a finite size aperture is one of the useful image cues for the depth measurement. The depth from defocus(DFD) is a practical method for depth measurement with various advantages: 1) passive, 2)real-time, 3)single lens optics, and 4)no correspondence problem. Although DFD methods proposed so far utilize the blurring phenomenon caused by the standard optical system, ordinary cameras are not enough and new optical designs should be introduced to make this method practically usable in the real world. In this paper, we design the multi-focus camera with special optics (telecentric optics and coded aperture) and propose a depth measurement algorithm using the camera. Structured blurs caused by the coded aperture improves robustness and preciseness of the measurement remarkably. It also enables the blur-free image reconstruction. Both telecentric optics and multi-focus camera make it possible to use simple signal processing methods. Experimental results demonstrated its effectiveness.

Fixed Viewpoint Cameras and Their Applications

Toshikazu Wada and Takashi Matsuyama (Kyoto University)

This paper presents the properties and applications of Fixed Viewpoint Camera (FVC): a rotational camera whose optical center stays at the same position irrelevant to the rotation. This sensor can be used for omni-directional sensing, because any ray coming into this sensor at any view direction passes a single viewpoint, and hence, images at various view directions can be stitched into a seamless omni-directional image without warping images. As an active tracking sensor, FVC enables appearance based image processing, such as, background and inter frame subtractions. As well, the sensor simplifies the egomotion analysis problem: discriminating between the camera rotation and object motion from optical-flow vectors. We also discuss the properties and applications of other FVC families: FVC with zoom control, Fixed Viewpoint Stereo Camera, Fixed Viewpoint Multi-Spectrum Camera.

Real-Time Active Moving Object Detection and Tracking

Akihiro Sugimoto, Toshikazu Wada and Takashi Matsuyama (Kyoto University)

Moving object detection and tracking is one of the most fundamental functions required to develop various real world systems: security and traffic monitoring systems, remote conference and distance learing systems, intelligent TV studios and mobile robots.
In this talk, I first show a real-time active moving object detection and tracking system we developed. It consists of a PC and an active video camera, whose pan, tilt, and zoom parameters can be dynamically controlled. I also show distinguishing characteristics of our active camera, the fundamental algorithm for object detection and tracking. Secondly, I present a parallel control system we devised to realize smooth camera motion, where perception (image processing) and action (camera control) modules run in parallel and share a specialized memory named Dynamic Memory. Finally, I present our extension to multi-object tracking. To track multiple objects at the same time, the system should maintain multiple hypotheses for objects and examine spatial and temporal continuities among the hypotheses to realize robust object tracking. Our hypothesis maintenance enables the system to detect and track multiple objects stably even if occlusions between objects are incurred and the number of objects changes during tracking.

Cooperative Tracking by Communicating Active Vision Agents

Norimichi Ukita and Takashi Matsuyama (Kyoto University)

We present a real-time cooperative object tracking system. In our system, we employ Active Vision Agents (AVAs) for tracking, where AVA is a rational model of the network-connected computer with an active camera. All AVAs cooperatively track their own target object through dynamically interacting with each other. As a result, the system as a whole tracks moving objects under the various and complex situation in the real world.
To cooperatively track objects, the system has to establish object identification between the objects detected by AVAs. For this object identification, AVAs dynamically exchange information of their detected objects with each other. Employing the dynamic memory architecture enables AVAs to 1) estimate information observed at an arbitrary time, and 2) exchange information without synchronization. Consequently, the system acquires 1) the stability of object identification, and 2) the reactiveness of the system's behavior.

Interactive Viewer for 3D Video

Takeshi Takai and Takashi Matsuyama (Kyoto University)

3D video is the ultimate image medium recording a dynamic visual event in the real world as is. Observers can see objects in a scene from any viewpoint, because 3D video records the time varying full 3D object and surface properties (i.e. color and texture).
We first present a method for capturing 3D video: surface reconstruction of the object from a volumetric 3D shape obtained by volume intersection method and generation of textures from images captured by the multi-cameras. We then present a compression method of 3D video using mesh-optimization, which reduces unnecessary complexity of the object's surface. The method enables us to not only decrease the size of 3D video, but also maintain the smooth and clear surface. We finally present an interactive viewer for 3D video. With this viewer, many parameters --the object's position, rotation, etc., a virtual camera's position, pan, tilt, zoom, etc.-- are specified intuitively with a graphical user interface and therefore 3D video can be visualized efficiently. We also show stereo movies of 3D video with a 3D display system that requires no glasses.

Real-Time Active 3D Object Shape Reconstruction for 3D Video

Xiaojun Wu, Toshikazu Wada, and Takashi Matsuyama (Kyoto University)

3D video is the ultimate image medium recording a dynamic visual event in the real world as is: time varying full 3D object shape and surface properties (i.e. color and texture). This paper first presents a novel parallel volume intersection method for real-time 3D shape reconstruction, where a sophisticated plane-to-plane perspective projection is employed to recover 3D object shape from a set of multi-viewpoint images. The method is implemented on a PC cluster system consisting of 10 PCs connected through an ultra-high speed network (1.28Gbps). Experiments demonstrate that it can capture dynamic 3D human shape in real-time: at about 10 frames per second in 2cm x 2cm x 2cm spatial resolution. Then we augment the system by introducing real-time active object tracking function with a group of pan-tilt-zoom cameras. While the frame rate decreases down to 1 frames per second due to the physical camera motion, this new function enables us to capture 3D object motion in a wide spread area without degrading spatial resolution. Finally, we demonstrate the attractiveness of 3D video with its interactive editor and visualizer.

Dynamic Scene Visualization using an Active Camera

Shogo Tokai (Fukui Univ.) and Takashi Matsuyama (Kyoto University)

An active camera, used for detecting and tracking object(s) in a scene, changes its viewing direction automatically, and its field of view is narrower than a human vision. Therefore, its live-video sequence is not appropriate for monitoring the scene situation by human observers. To visualize the situation as a understandable video sequence for human, we consider to integrate the live video with an wide panoramic image. In this integration, geometrical and photometrical consistencies are needed for the natural observation of human. In this paper, we propose geometrical and photometrical integration methods between the active camera image and the panoramic image. For the geometrical integration, we use a fixed-view-point pan-tilt-zoom camera system, and for the photometrical integration, we show a method based on eigen-image analysis.