Tuesday, March 8, 2011

Human Body Recognition

Human Body Recognition using points generation


Abstract
The goal of this work is to detect a human figure image and localize his joints and limbs along with their associated pixel masks. In this work we attempt to tackle this problem in a general setting. The dataset we use is a collection of sports news photographs of baseball players, varying dramatically in pose and clothing. The approach that we take is to use segmentation to guide our recognition algorithm to salient bits of the image. We use this segmentation approach to build limb and torso detectors, the outputs of which are assembled into human figures. We present quantitative results on torso localization, in addition to shortlisted full body configurations.
This paper presents a system that can automatically recognize four different static human body postures in video sequences. The considered postures are standing, sitting, squatting, and lying. The recognition is based on data fusion using the belief theory. The data come from the person’s 2D segmentation and from their face localization. It consists in distance measurements relative to a reference posture (“Da Vinci posture”: standing, arms stretched horizontally). The segmentation is based on an adaptive background removal algorithm. The face localization process uses skin detection based on color information with an adaptive thresholding. The efficiency and the limits of the recognition system are highlighted thanks to the analysis of a great number of results. This system allows real-time processing.
A new method for representing and recognizing human body movements is presented. The basic idea is to identify sets of constraints that are diagnostic of a movement: expressed using body-centered coordinates such as joint angles and in force only during a particular movement. Assuming the availability of Cartesian tracking data, we develop techniques for a representation of movements defined by space curves in subspaces of a “phase space.” The phase space has axes of joint angles and torso location and attitude, and the axes of the subspaces are subsets of the axes of the phase space. Using this representation we develop a system for learning new movements from ground truth data by searching for constraints. We then use the learned representation for recognizing movements in unsegmented data. We train and test the system on nine fundamental steps from classical ballet performed by two dancers; the system accurately recognizes the movements in the unsegmented stream of motion.
Introduction
Approaches to recognizing 3D human body postures from a single image have recently become increasingly popular. While they do not suffer from many of the problems that affect more traditional recursive body tracking techniques, most of them have only been demonstrated in cases where clean body silhouettes can be extracted, for example using background subtraction, which is very restrictive. A key exception is the work reported.
Combining a hierarchy of templates and effectively using the chamfer distance has made the approach applicable to more challenging cases such as the one of a moving camera on a car. However, even then, the algorithm tends to produce many false positives, especially when the background is cluttered. As a result, in practice, it is used in conjunction with a stereo rig both to narrow the initial search area and to filter out false detections from the background. We improve upon this approach and achieve very low rates of both false positives and negatives by incorporating motion information into our templates. It lets us differentiate between actual people and static objects whose outlines roughly resemble those of a human, which are surprisingly numerous. As illustrated this is key to avoiding misdetections. This is of course a well known fact and optical flow methods have been proposed to detect moving humans However, accurately computing the flow on human limbs is notoriously difficult, especially if the background is not static. Our approach avoids this problem by relying on sequences of moving silhouettes.
More specifically, we focus on the part of the walking cycle where both feet are on the ground and use motion capture data to create sequences of 2D silhouettes that we
match against short image sequences. We chose this specific posture both because it is very characteristic and because it could easily be used to initialize a more traditional
recursive tracking algorithm to recover the in-between body poses.
As shown in, we obtain good results even when the background is cluttered and background subtraction is impractical because the camera moves. Note that the subjects move closer or further so that their apparent scale changes and turn so that the angle from which they are seen also varies. In this example, no stereo data or information about the ground plane was required to eliminate false-positives. Our method retains its effectiveness indoors, outdoors, and under difficult lighting conditions. Furthermore, because the detected templates are projections of 3D models, we can map them back to full 3D poses. Note that, even though we chose a specific motion to test it, our approach is generic and could be applied to any other actions that all people perform in roughly similar ways but with substantial individual variations. For example, there also are characteristic postures for somebody sitting on a chair or climbing stairs. In the area of sports, we could use a small number of templates to represent the consecutive postures of a tennis player hitting the ball with a forehand, a backhand, or a serve, as is done in . We could similarly handle the transition between the upswing and the downswing for a golfer. In short, characteristic postures are common in human motion and, therefore, worth finding. The only requirement for applying our method is that a representative motion database can be built. In the reminder of the paper we first briefly discuss earlier
approaches. We then introduce our approach to body pose detection and present a number of results obtained in challenging conditions. Finally, we discuss possible extensions.
Related Work

Until recently, most approaches to capturing human 3D motion from video relied on recursive frame-to-frame pose estimation. While effective in some cases, these techniques usually require manual initialization and re-initialization if the tracking fails. As a result, there is now increasing interest for techniques that can detect a 3D body pose from individual frames of a monocular video sequence. One approach is to use classification to detect people in images, but it does not provide either a pose or a precise outline. Furthermore, such global approaches tend to be very occlusion sensitive. Instead of detecting the body as a whole, a different tack is to look for individual body parts and then to try assembling them to retrieve the pose. This can be done by minimizing an appropriate criterion, for example using an A¤ algorithm. This has the potential to retrieve human bodies under arbitrary poses and in the presence of occlusions. Furthermore it can be done in a computationally effective way using pictorial structures . However, it can
easily become confused because there are many limb-like objects in real world images.
Another class of approaches relies on techniques such as background subtraction to produce silhouettes that can then be analyzed. Several methods learn during an offline
stage a mapping between the visual input space formed by the silhouettes and the 3D pose space from examples collected manually or created using graphics software. For example, uses multilayer perceptions to map the silhouette represented by its moments to the 3D pose. In the mapping is performed using robust locally weighted regression over nearest neighbors that are efficiently retrieved using hash tables. In, it is done indirectly via manifolds embedded in low dimensional spaces, where each manifold corresponds to the subset of silhouettes for walking motion seen from a particular viewpoint. Local Linear Embedding is used to map the manifolds to both the silhouettes and the 3D pose. In, the mapping between the couple formed by an extracted silhouette and a predicted pose to the corresponding 3D pose is established using Relevant Vector Machine. While these works introduce powerful tools to associate 3D poses to detected silhouettes, they tend to be of limited practical use because they require relatively clean silhouettes that are not always easy to obtain. A more robust way to match global silhouettes against image contours is to use both a hierarchy of templates and the chamfer distance, an approach originally introduced in and extended in . This produces excellent results when applied to difficult outdoor images. However, it seems to have a relatively high false detection rate. Reducing this rate involves either introducing a priori assumptions about where people can be or incorporating additional processing such as texture classification or stereo verification In the context of hand tracking, also relies on the chamfer distance and a tree structure quite similar to the hierarchy of templates for efficiency. In this case, the false positives and negatives problem is avoided by assuming that one and only one hand is present in the image. Bayesian tracking is combined with detection to disambiguate the hand pose.
By contrast to these earlier approaches, our method, which also relies on global silhouettes matching, includes an original way to take motion into account to avoid false positives. Such information was also exploited in [2] for human action recognition, but only under the assumption that preprocessed and centered sub images of the people are available. In our case we directly use the full images as input.