1. Overview and Status
Segmentation can be defined as the operation of partitioning a scene
into regions extracted according to a given criterion. Effective segmentation
of objects in static or dynamic scenes (images or video), is a topic that
has thus demanded much attention and despite progress, it still poses a
significant challenge. Typical schemes for image segmentation have included
extraction of features such as edges and curves and integration of these
features into continuous shapes which are spatial coherent. Typical schemes
for video segmentation, in addition to those used for image segmentation,
have included temporal change detection due to motion of individual objects
in a temporally coherent manner, as well as combination of the two. In
the past, there have been many motivations for work on image and video
segmentation such as, scene analysis, pattern matching, character recognition,
industrial vision systems, target recognition, biomedical etc. The work
has generally been very application specific and has included scenes captured
with different types of sensors and noise level. Not surprisingly, the
results have also been application specific, for instance, some of the
techniques work well for detection of enemy vehicles in high resolution
satellite imagery while others work well for machine parts detection and
identification and well yet others work well in traffic sensing and survelliance.
More recently, an evolving breed of multimedia applications requiring
advanced functionalities have provided a new focus for work on segmentation.
The main requirement of these applications is access to individual audio-visual
objects in the scene, and, the advanced functionalities which need to be
supported are the capability of the capability to move these objects freely
and rearrrange the scene, the capability to add, drop or modify objects
in a coded scene without re-encoding, the capability to improve the spatial
or temporal quality of objects, the capability to combine natural coded
objects with synthetic objects (defined by model parameters) etc. MPEG-4
is an object based multimedia standard (in progress) designed to address
such needs. MPEG-4 video standardizes the syntax and semantics of video
bitstream and specifies the decoding process, it does not mandate any specific
pre-processing or details of encoding. MPEG-4 encoding assumes availability
of segmented video objects (VO's) in the form of a sequence of snapshots
in time of these objects referred to as video object planes (VOP's). For
the purpose of information, the current specification of MPEG-4 includes
discussion on segmentation of objects based on work of Multifunctional
ad hoc group of MPEG-4 video. Section 2 provides a brief overview of this
work.
2. The MPEG-4 Approach
Figure 1 shows the framework for segmentation of video being examined in MPEG-4; it consists of up to three major steps. In the first step, global motion compensation and scene cut detection is applied as a preprocessing step to compensate for overall camera movement. In the second step, either just temporal segmentation or both temporal and spatial segmentation are performed. The third step is only necessary when both temporal and spatial segmentation are performed in the second step, and simply consists of merging of results of the second step. Figure 1 The MPEG-4 framework for video segmentation
2.2 Spatial Segmentation The spatial segmentation algorithm employed consists of three steps. In the first step, morphological filters are used for image simplification. The second step involves approximating the spatial gradient by use of a morphological gradient operator, and using it as input to watershed algorithm for identifying regions with homogeneous intentsity. The third step involves merging of regions obtained from watershed algorithm which are usually over-segmented.
While, we the humans are able to identify meaningful semantic objects in scenes with relative ease because of our experience, vision and recognition processes which employ features, shape, color and movement analysis, the same task for an automated algorithm is rather difficult and requires significant processing. Machine vision systems have been successful only by solving a subset of the bigger problem and by incorporating learning systems. In a general sense, robust automatic segmentation of video is an area in which significant advances are needed and even the state-of-the-art is far from being satisfactory. The next best approach may be to devise semi-automatic segmentation algorithms that work well with minimal human intervention. Perhaps, segmentation at scene change can be specified manually, followed by tracking of movement of the objects. A mechanism to determine when the objects become untrackable and manually re-specfying the segmentation map for that frame could be used followed by motion tracking. In specialized cases, such as when video scenes consist of video objects on the background with a chosen chromakey color (with some noise), automatic real-time segmentation should be possible. These applications include advanced videoconferencing with background insertion, TV weather forecasting, live video games etc.