The Fourth International Workshop on
Cooperative Distributed Vision
Presentations by Project Members
Takashi Matsuyama (Kyoto University)
This paper gives an overview of our five years project on
Cooperative Distributed Vision (CDV, in short).
From a practical point of view, the goal of CDV is summarized as
follows:
Embed in the real world a group of network-connected Observation
Stations (real time image processor with active camera(s)) and
mobile robots with vision, and realize
- dynamic real world scene understanding, and
- versatile dynamic scene visualization.
Applications of CDV include real time wide area surveillance, remote
conference and distance learning systems, 3D Video and
intelligent TV studio, navigation of mobile robots and disabled
people, cooperative mobile robots, and so on.
In this paper, we first present our motivation, research goal and
basic ideas of CDV. We discuss functionalities of and mutual
dependencies among perception, action, and communication to formally
clarify the meaning of their integration. Then we introduce the
Dynamic Memory Architecture for dynamically integrating multiple
asynchronous parallel processes. With this architecture, we can
implement an Active Vision Agent, in which visual perception,
camera action, and network communication modules are realized as
asynchronous parallel processes dynamically interacting via the shared
dynamic memory. Moreover, we show this architecture enables us to
design dynamic cooperations among multiple communicating active vision
agents. Finally, we give a summary of our research achievements
during these five years.
Michihiko Minoh and Yoshinari Kameda (Kyoto University)
We envision a new computer supported environment named information
media environment. Users can watch what happens in a fixed real
space where people get together to do something through
raw/synthesized video images in real-time with this environment.
On watching video image of the scene, people want to see and
understand what is done there. In this sense, we should know not
only what is happening in the scene but also what they want to
see and how they want to see it. We have defined imaging rules
that are applied according to the dynamic situation of the scene.
We also show how the situation should be interpreted into camera
works, which describe the way of controlling active shooting
cameras.
We select a lecture room in distance learning lectures for our
experiment. Interests of remote students should be understood
and the active cameras should be cooperated so as to image the
focus object in the 3D lecture room. We have implemented a
prototype system and served it for several regular courses
between UCLA and Kyoto University, which shows
soundness of our proposed method.
Minoru Asada, Eiji Uchibe, and Koh Hosoda (Osaka Univeristy)
The paper consists of four works contributed to realize cooperative
behavior between mobile robots in a dynamically changing environment.
First, we propose a method that acquires the purposive behaviors based
on the estimation of the state vectors. In order to acquire the
cooperative behaviors in multiagent environments, each learning robot
estimates the Local Prediction Model (hereafter LPM)
between the learner and the other objects separately. The LPM
estimate the local interaction while reinforcement learning copes with
the global interaction between multiple LPMs and the given
tasks. Based on the LPMs which satisfies the Markovian
environment assumption as possible, robots learn the desired behaviors
using reinforcement learning. We also propose a learning schedule in
order to make learning stable especially in the early stage of
multiagent systems.
Next, a method for controlling the complexity is proposed for a
vision-based mobile robot to develop its behavior according to the
complexity of the interactions with its environment. The agent
estimates the full set of state vectors with the order of the major
vector components based on the {\em LPM}. The environmental complexity
is defined in terms of the speed of the agent while the complexity of
the state vector is the number of the dimensions of the state
vector. According to the increase of the speed of its own or others,
the dimension of the state vector is increased by taking a trade-off
between the size of the state space and the learning time.
Then, a vector-valued reward function is introduced in order to cope
with the multiple tasks. Unlike the traditional weighted sum of
several reward functions, we introduce a discounted matrix to
integrate them in order to estimate the value function, which
evaluates the current action strategy. Owing to the extension of the
value function, the learning agent can estimate the future multiple
reward from the environment appropriately.
Finally, a genetic programming method is applied to individual
population corresponding to each robot so as to obtain cooperative and
competitive behaviors through co-evolutionary processes. The
complexity of the problem can be explained twofold: co-evolution for
cooperative behaviors needs exact synchronization of mutual
evolutions, and three robot co-evolution requires well-complicated
environment setups that may gradually change from simpler to more
complicated situations.
The proposed methods are applied to several simplified soccer games to
show their validity, and discussion is given.
Katsushi Ikeuchi, Hiroshi Kimura, Koichi Ogawara
and Jun Takamatsu (University of Tokyo)
Acquisition of human task models and skills from observation is
essential for preserving, translating and sharing knowledge which is
difficult to express verbally. Our goal, based mainly on vision
technique, is the acquisition, preservation and enhancement of human
hand-work tasks, ranging from everyday tasks to those requiring
professional skills. To this end, we are currently developing a
recognition technique, reusable task model representation, and a
cooperation scheme between humans and robots.
We have developed a human-form robot which has similar capabilities to
those of humans. This robot will serve as an experimental platform
for the purpose of handling the entire process of task acquisition
from observation as input, through internal calculation, to the robot
behavior as output. By using the real robot, we can directly verify
the recognition process and validity of the acquired task model from
the robot behavior; moreover, repeating this loop enables the system
to incorporate the effect of the robot and human behavior into
observation to revise the task model or to realize cooperative
behavior with multiple humans.
As the outcome of the 5 year Cooperative Distributed Vision (CDV) project,
we developed 3 different techniques for obtaining task knowledge. One is a
technique for constructing a task model by integrating multiple
observations by attention points. Another is a technique for
automatic acquisition of precise assembly skills by analyzing contact
state transitions of manipulated objects. The third is a
technique for producing a variety of cooperative behaviors adapted to the
progress of the current task and human actions from a single task
model by analyzing mutual event dependencies. We consider such
preservation, translation, and sharing of task knowledge between humans and
robots through vision to be an important aspect of the frame-work
of CDV. In this paper, we present each technique and describe the
experimental results achieved by the use of real robots.
Koichiro Deguchi (Tohoku University)
This paper reports the construction of active visual sensors
which tracks moving targets and simultaneously obtains their 3-D
shapes. The main objective of the cooperative distributed vision(CDV)
system is to interpret and recognize the dynamic 3-D scene. Then, the
key function of each vision sensors in CDV is to acquire dynamic
changes in the scene, that is, detecting motions and tracking targets
in the scene. To achieve this function, it is not sufficient only to
match 2-D images. It must identify 3-D shape of the target and track
it in 3-D space. Our system achieves real-time 3-D recognition of
high speed moving object by fixation point tracking.
Also we report a visual servoing technique to guide the active visual
sensors at their home positions.
Takekazu Kato, Yasuhiro Mukaigawa and Takeshi Shakunaga (Okayama University)
We discuss a task-oriented cooperation distributed
vision to realize both the effective face registration and
the effective recognition in natural environment. This paper
summarizes our framework, algorithms as well as experimental
results.
Rin-ichiro Taniguchi, Daisaku Arita and Satoshi Yonemoto (Kyushu University)
In the CDV project, one of the chief research issues is cooperative
observation of target objects or environments with multiple sensors. To
realize such cooperative systems, we have developed, as a base
architecture, a PC-cluster system for real-time distributed video/image
processing, which consists of off-the-shelf personal computers connected
via high speed network. In this paper, we outline our real-time
multi-view image processing system on PC-cluster, emphasizing its key
issue, synchronization mechanism of distributed real-time image data.
We also describe a prototypical application of the system, a real-time
motion capture system.
Norihiko Yoshida (Nagasaki University)
Mulit-target tracking, or multi-target motion analysis, is known
to be an NP-hard problem, and several methods have been proposed.
We invented a distributed search method using multiple
cooperative agents, and proved that it was more efficient than
other methods using a single centralized agent.
If agents performing target cognition stay on their home
processors, they must delegate their intermediate solutions to
some other as targets move. We invented a more effective
alternative. Agents performing target cognition move over a
processor network following targets.
Our system is composed of two classes (or layers) of agents.
Lower layer "detectors" are stationary, and estimate tracks of
targets. Higher layer "trackers" are mobile, and perform target
cognition tasks. We designed this system applying a tuple-space
mobile-thread mechanism which we invented elsewhere.
Shinsaku Hiura (Osaka Univeristy) and Takashi Matsuyama (Kyoto University)
Defocusing caused by a finite size aperture is one of the useful
image cues for the depth measurement.
The depth from defocus(DFD) is a practical method for depth measurement
with various advantages: 1) passive, 2)real-time, 3)single lens optics,
and 4)no correspondence problem.
Although DFD methods proposed so far utilize the blurring phenomenon
caused by the standard optical system,
ordinary cameras are not enough and new optical designs should
be introduced to make this method practically usable in the real world.
In this paper, we design the multi-focus
camera with special optics (telecentric optics and coded aperture) and
propose a depth measurement algorithm using the camera.
Structured blurs caused by the coded aperture improves robustness
and preciseness of the measurement remarkably.
It also enables the blur-free image reconstruction.
Both telecentric optics and multi-focus camera make it possible
to use simple signal processing methods.
Experimental results demonstrated its effectiveness.
Toshikazu Wada and Takashi Matsuyama (Kyoto University)
This paper presents the properties and applications of Fixed
Viewpoint Camera (FVC): a rotational camera whose optical center
stays at the same position irrelevant to the rotation. This sensor
can be used for omni-directional sensing, because any ray coming
into this sensor at any view direction passes a single viewpoint,
and hence, images at various view directions can be stitched into
a seamless omni-directional image without warping images. As an
active tracking sensor, FVC enables appearance based image processing,
such as, background and inter frame subtractions. As well, the
sensor simplifies the egomotion analysis problem: discriminating
between the camera rotation and object motion from optical-flow
vectors. We also discuss the properties and applications of
other FVC families: FVC with zoom control, Fixed Viewpoint Stereo
Camera, Fixed Viewpoint Multi-Spectrum Camera.
Akihiro Sugimoto, Toshikazu Wada and Takashi Matsuyama (Kyoto University)
Moving object detection and tracking is one of the most fundamental
functions required to develop various real world systems: security and
traffic monitoring systems, remote conference and distance learing
systems, intelligent TV studios and mobile robots.
In this talk, I first show a real-time active moving object detection and
tracking system we developed. It consists of a PC and an active
video camera, whose pan, tilt, and zoom parameters can be dynamically
controlled.
I also show distinguishing characteristics of our active
camera, the fundamental algorithm for object detection and tracking.
Secondly, I present a parallel control system we devised to realize
smooth camera motion, where perception (image processing) and
action (camera control) modules run
in parallel and share a specialized memory named Dynamic
Memory.
Finally, I present our extension to multi-object tracking.
To track multiple objects at the same time, the system should maintain
multiple hypotheses for objects and examine spatial and temporal
continuities among the hypotheses to realize robust object tracking.
Our hypothesis maintenance enables the system to detect and
track multiple objects stably even if occlusions between objects are
incurred and the number of objects changes during tracking.
Norimichi Ukita and Takashi Matsuyama (Kyoto University)
We present a real-time cooperative object tracking system. In our
system, we employ Active Vision Agents (AVAs) for tracking, where AVA
is a rational model of the network-connected computer with an active
camera. All AVAs cooperatively track their own target object through
dynamically interacting with each other. As a result, the system as a
whole tracks moving objects under the various and complex situation in
the real world.
To cooperatively track objects, the system has to establish object
identification between the objects detected by AVAs. For this object
identification, AVAs dynamically exchange information of their
detected objects with each other. Employing the dynamic memory
architecture enables AVAs to 1) estimate information observed at an
arbitrary time, and 2) exchange information without
synchronization. Consequently, the system acquires 1) the stability of
object identification, and 2) the reactiveness of the system's
behavior.
Takeshi Takai and Takashi Matsuyama (Kyoto University)
3D video is the ultimate image medium recording a dynamic visual event
in the real world as is. Observers can see objects in a scene
from any viewpoint, because 3D video records the time varying
full 3D object and surface properties (i.e. color and texture).
We first present a method for capturing 3D video:
surface reconstruction of the object from a volumetric 3D shape
obtained by volume intersection method and
generation of textures from images captured by the multi-cameras.
We then present a compression method of 3D video
using mesh-optimization, which reduces unnecessary complexity of
the object's surface.
The method enables us to not only decrease the size of 3D video,
but also maintain the smooth and clear surface.
We finally present an interactive viewer for 3D video.
With this viewer, many parameters --the object's position, rotation,
etc., a virtual camera's position, pan, tilt, zoom, etc.-- are specified
intuitively with a graphical user interface and therefore 3D video can
be visualized efficiently. We also show stereo movies of
3D video with a 3D display system that requires no glasses.
Xiaojun Wu, Toshikazu Wada, and Takashi Matsuyama
(Kyoto University)
3D video is the ultimate image medium recording a dynamic visual event
in the real world as is: time varying full 3D object shape and surface
properties (i.e. color and texture). This paper first presents a
novel parallel volume intersection method for real-time 3D shape
reconstruction, where a sophisticated plane-to-plane perspective
projection is employed to recover 3D object shape from a set of
multi-viewpoint images. The method is implemented on a PC cluster
system consisting of 10 PCs connected through an ultra-high speed
network (1.28Gbps). Experiments demonstrate that it can capture
dynamic 3D human shape in real-time: at about 10 frames per second in
2cm x 2cm x 2cm spatial resolution. Then we augment the
system by introducing real-time active object tracking function with a
group of pan-tilt-zoom cameras. While the frame rate decreases down to
1 frames per second due to the physical camera motion, this new function
enables us to capture 3D object motion in a wide spread area without
degrading spatial resolution. Finally, we demonstrate the
attractiveness of 3D video with its interactive editor and visualizer.
Shogo Tokai (Fukui Univ.) and Takashi Matsuyama (Kyoto University)
An active camera, used for detecting and tracking object(s) in a
scene, changes its viewing direction automatically, and its field of
view is narrower than a human vision. Therefore, its live-video
sequence is not appropriate for monitoring the scene situation by
human observers. To visualize the situation as a understandable video
sequence for human, we consider to integrate the live video with an
wide panoramic image. In this integration, geometrical and
photometrical consistencies are needed for the natural observation of
human. In this paper, we propose geometrical and photometrical
integration methods between the active camera image and the panoramic
image. For the geometrical integration, we use a fixed-view-point
pan-tilt-zoom camera system, and for the photometrical integration, we
show a method based on eigen-image analysis.