Retrieval and tracking of coherent human motion from monocular images in a consistent form is a defying question because of the following factors. First is the suppleness of human motion, second are the different body types, and third is the changeability and unevenness of the styles of body movements. Finally, the 3-dimensional (3D) nature of the human body and how humans see and interpret real-life images is a challenging question (Guo and Qian, p. 1).
The importance of proper human pose estimation and recovery is not only for the routine explanation of human activities in video databases or designing a 3D computer or videogame. It is important for gait and posture analysis for the athletes and physically disabled. It is also important for articulated and emotional human body animations (Kakadiaris and Metaxas, p. 1453). This essay aims to provide a brief, yet comprehensive review of the learning-based method of human pose estimation and tracking.
There are two common basic approaches for human pose estimation and recovering, the model-based and the learning-based approaches.
Model-based method of human pose estimation and tracking
It assumes a clearly definite characteristic body model in advance, and estimates the body poses by one of the following methods.
- Appearance-based methods for people detection,
- part-based methods for people detection (face or limbs).
- 3D pose estimation in images using either matching-based methods or inverse kinematics approach (that is to first determine body joints, then estimate body poses).
Human pose tracking in the model-based method is either gradient-based (track short successions of movements) or sampling-based methods (Lee and Cohen (a), pp.906-907).
Learning-based method of human pose estimation
This method evades the need of having an assumed definite characteristic body model. It makes the most of the hypothesis the set of representative human body poses is less than that of the potentially kinetic poses. This is achieved by learning (estimating) a model that diametrically retrieves guesses from evident image quantities.
This occurs in one of two ways, either by precise storing and searching for comparable training examples. Alternatively, by filtering a training database into a single compressed model with good generalities (using Bayesian regression analysis) (Agarwal and Triggs (a), p.1).
Images depiction and human body pose
To represent a recorded picture properly, two procedures have to be used, first is the means of describing the subject image (image shape descriptors) which is a silhouette. Second is the means to describe body movement (pose) which is representing body pose by joint angles (Agarwal and Triggs (b), pp.2-3).
Recapturing the stance of an individual from a single image is a stimulating challenge, using image silhouettes is the commonest method among the various image categorizing methods (image descriptors). Image silhouettes have advantages and disadvantages (Agarwal and Triggs (c), pp.6).
- Advantages of image silhouettes as image descriptors are: they can be trusted to provide good results especially with complex built backgrounds or with the segmental splitting of body movements that create shadows. Second is the low sensitivity of silhouettes to immaterial surface features as the color and structural qualities of cloths, and third is no need for labeling information to create a 3D adopted posture of a person in the picture.
- Disadvantages (limitations) of image silhouettes as image descriptors are: Image artifacts (like low background segmental structure) alter the particular area characteristics. Second is the difficulty in determining some details, like recognizing a front view from the back one or walking with the left or right lower limb.
Agarwal and Triggs (d), 2006 suggested a base-up method (bottom-up approach), which uses the built-in characteristics of an image to figure out approximately the pose of the upper part of the human body from one image with the distorted background. Common regressive methods in current use need preceding splitting to segments or use weak extremity image detector.
The method they described depends on the characteristic human slopes (as shoulder counters, or bent elbows). They stated that this method produces performance results similar to example-based methods with the advantage that it works crude natural backgrounds are present in the image without preceding splitting to segments.
A strongly constructed silhouette illustration of an image is needed to overcome splitting segmental errors and occlusions. To do that, Agarwal and Triggs (b) (p.3) and (c) (p.6), suggested that information at the histogram edge can be useful to determine a strongly constructed shape.
They called this method shape framework (context) distributions and described the following step to reach their aim. The process starts by computing the characteristic descriptors of an area on the edge of a silhouette at points having equal spaces, thus creating a shape context (framework). Next, they use these shape contexts to convert the characteristics of the area of the silhouette into a digital form over a range of size ratios (the framework in the particular area equals approximately the diameter of a limb).
Thus, the silhouette shape is digitally encoded as a distribution, and similarity comparison of silhouettes has become a matter of comparing distributions in shape contexts or frameworks. The result is further augmented by decreasing all points’ distributions to 100 D histograms, on each silhouette.
This occurs by defining the paths of an image in quantum numbers (vector graphics are made of vectors, not pixels). Then they use the k-means algorithm to create cluster means (non-parametric rough calculations) of the significantly participating points on the established framework vectors of all training silhouettes. Agarwal and Triggs (b) suggested, in the two papers, that this method produces fairly good strongly constructed silhouette illustrations of an image.
The second means of an image representation of human body pose is representing body pose by joint angles. Joints are classified either according to their structure (fibrous, cartilaginous, or synovial) or according to the degree range of movement.
According to joints’ range of movement, there are 18 major human boy joints with the angular distance along the horizontal between a point of reference, usually the observer’s bearing, and another object (azimuth angle) that can cover up to 360 degrees. The number of parameters for the range of movements for 3D body pose retrieval can be calculated from a regression formula using cosine (angle) and sin (angle) (Agarwal and Triggs (b) p.3).
Analysis of human body movements
The skills of humans as social beings on making a conclusion on others’ actions of communication and expressions depend on their perceptual skills. These skills include visual, motion-related, emotional expression, and neural mechanisms, which all affect the understanding of what a human sees.
Therefore, analysis of human motion helps to retrieve a proper 3D human pose from a 2D image (Blake and Shiffrar, p.47). Human motion includes information about individuality, purposes, and emotions interpreted from body movements, and the human visual perception system encodes such information. Troje (p.371), using statistical and pattern analysis linear methods, developed a context that converts human motion to a representation that tolerates analysis.
In addition, the author inferred that this analysis exposes the dynamic part of human movement related to gender biological differences (as male and female walking patterns). Therefore, this context can be used for the analysis of human movements and for the creation of new prototypes of movements.
Zhang and Troje (p.1) appraised an approach to restructuring 3D periodic human movements from 2D motion successions. In the context of the learning-based method of human pose estimation, they developed a training collection of 3D data to create a linear algorithm transforming 2D image data into a reformed 3D image.
Agarwal and Triggs (b) (p.6) explained a regression formula technique to retrieve a 3D human body pose from a 2D silhouette image descriptor. Vector graphics consist of paths or vectors that have start and endpoints, while regression is to reverse to an earlier or a common form <http://www.sharpened.net/glossary/definition.php?vectorgraphic>.
In their regression model, Agarwal and Triggs (b) (p.6) followed the regression information to characterize the 3D human pose (output) as (y) vector and the input shape (silhouette) as (x) vector. Because of difficulties in pose retrieval, they had to assume x and y association is a relational and not functional and complete functional approximation of both x and y vectors by a linear combination of a group of previously determined basis functions.
Non-linear and regression methods of recovering 3D human body pose from monocular images
Romer and colleagues in 2001 (IEEE International Conference, Vancouver), presented a system for 3D hand (being a difficult issue) pose retrieval from 2D color images. The system utilizes a non-linear learning context (supervised), they called this system specialized (being for a particular part of the body) mappings architecture (design). In other words, it is a system that charts image characteristics to possible 3D hand poses using the non-linear supervised framework.
This system has two basic parts; a collection of specialized drawings (mappings) and a separate response (feedback) matching function. The forward mappings are approximated from training data, like joint make-up and visual characteristics, whereas joint angel data are obtained from CyberGlove® < http://www.mindflux.com.au/products/vti/cyberglove.html>. In training, a computer graphics module produces the visual characteristics that deliver the hand from subjective notions knowing the 22 hand joint angles.
In October 2003, Cohen and Hongxia (IEEE Workshop, Nice, France), presented a method to assume 3D body posture. They used a 3D illustrative body (hull) built from a collection of silhouettes to present 3D shapes based on appearance and independent of view. Categorizing and recognizing body posture is done by using the 3Dshape description obtained (with the help of a vector machine). The advantage of this approach is its potential to summarize various human shapes allocating the recognition of human body postures across many people.
Another learning-based method of human pose estimation is the two-directional (top-down and bottom-up) generative identification model for 3D human pose estimation from monocular images. The basic concept of this method is the identification (recognition) model is finely adjusted using samples from the producing (generative) model, which in turn is enhanced to generate implications near to those expected from the recognition model.
At balance, the two models (recognition and generative) become consistent. The advantage of this context is the production of consistent 3D initialization and retrieval of 3D human poses (Sminchisesco and colleagues, IEEE Conference New York, 2006).
Retrieval of 3D human poses from silhouettes by relevance vector regression in the learning-based context was described by Agarwal and Triggs (b) (pp. 4-5). By this method retrieval of 3D human pose takes place by direct non-linear regression against silhouettes (as shape descriptor) vectors. Moreover, these vectors are extracted from shape descriptors in automatic computer behavior from silhouettes. The advantage of this method is there is no need to have a precise body model and needs no tagging of body parts in the image beforehand.
Human body posing from static pictures
Retrieving a 3D human body posing image from static monocular (2D) images is different from holography which does not produce motion from static attitude. Retrieving human pose from static images is demanding because of the greater dimensional status, presence of image distortion, and haziness of image reflections.
Solving these problems may require the use of multiple cameras or particle filter methods to estimate the human body motion across time with the incorporation of an observation-based sampling system. Identification of a prototype of human body motions using silhouettes as shape descriptors, or making diagrams of image characteristics into arranging the outlines of the human body parts (Lee and Cohen (b), p. 127).
Lee and Cohen (b) (pp. 127-128), suggested constructing an image-producing model made up of three components; a human model including shape and joint articulating structure, background (scene) to image estimates, and development of image characteristics. They used the Markov Chain Monte Carlo (MCMC) technique to solve the problem of high dimensional state space.
Image features selected and used for computing data in the MCMC technique are three main characteristics (Lee and Cohen (b), pp. 129-134).
- Face detection: Lee and Cohen (b) used a boosting algorithm (adaboost) as described by Schapire, 1990 (after Bartlett and Traskin, p. 2347) to estimate the size from the detected face location in the image. The same method can be applied in a normal distribution curve to infer the distribution for the head position.
- Head-Shoulder contour matching: This includes developing a contour model for the head to shoulder association, and contour matching in the test images by hauling out the edges.
- Detection of skin spots: This is important for providing indications on exposed body parts as the arms.
Analysis of performing 3D human body poses from static monocular images depends on two principal factors: the region (area) similarity and the color similarity. Region similarity needs color segmental splitting to be performed dividing the input image into a number of component parts. Colour similarities reflect mainly the differences between the color of the human body (including clothes) and the colors present in the background (Lee and Cohen (b), pp. 133-135).
Human motion is interpreted by our brains as a visual flow of small range movements combined to the full range of an intended movement serving a purpose (Brain and Walton, p. 436). According to Welch and Foxlin (p.24), in computer graphics, human body motion tracking serves four basic aims. The first and often encountered is to generate near to reality moving characters imitating full or part of the body motion, this is the embodiment (avatar) animation.
The second is to present position and directional control to depict the graphics in a head-mounted demonstration, this is head trackers view control. The third is surfing and moving in a world of computer virtual graphics, and with the help of devices held at hand, helps manipulation of characters in this world.
Finally is helping to compare virtual computer images with their real corresponding images. This is particularly useful in surgery training and mechanical assembly whether in industry or for designing for the disabled.
Methods of tracking differ in monocular image sequences and in3D image-based techniques. For monocular 2D image sequences, tracking methods use signals or cues like colors, edges, or differences in density (where the measure of change associates with a change in speed of the object). However, these methods estimate motions from frame to frame and errors whatever small they are aggregate to drift the human pose estimation.
With 3D image-based methods, the use of forces applied to each moving rigid part of the model minimizes the error of the model against the data at hand. In learning-based methods of retrieval of 3D human poses from monocular images, the use of self-repeating algorithms presents a context to model and track motions (Demirdjian p.1).
According to Demirdjian (p.6), an ideal tracking model is without limitations or at least optimized to a minimum for both shape and articulated motion. That is it should allow the best motion variables substitution along the motion articulated space.
Application that include visual simulation demand superior tracking qualities to succeed in making the user feels the virtual world is compatible with the user’s visual and neurological real life experiences. Superior tracking qualities result in the absence of simulation sickness which results from disturbing the oculo-vestibular mechanisms (perceptual stability). Tracking deficits, although may not affect performance, should always be below the user level of observation (Welch and Foxlin, p.27).
Tracking performance specifications differ for static and dynamic 3D human body estimations. For static 3D estimations, the descriptions of tracking performance are spatial distortion, spatial rapid signal fluctuation, and stability. For dynamic 3D estimations, the descriptions are latency, latency rapid signal fluctuation, and motion errors other than latency (Welch and Foxlin, p.27).
A proper tracking method should depend on five physical faculties giving the human pose the ability to be appreciated by users (Welch and Foxlin, pp.25-29). This perception includes mechanical sensing, inertial, acoustic, optical and magnetic sensing.
These faculties are not associated with the currently available techniques of tracking, and each technique has its limitations and advantages. The limitations can be modality related (as with the particular medium used), measurement related (as with the instrument or device used), or can be circumstantial or based on inference (as with the application used) (Welch and Foxlin pp.25-29).
Producing sensible implicit 3D human body images from monocular 2D images is a challenge to human conceptions, physiology, and inherent experiences. Building these images using a model based method is demanding and needs rigorous work to overcome the various possibilities of human motion and articulation. Of particular difficulty, in this method, is the question of creating automatic models as it confronts the questions of various body shapes and motion shapes and ranges.
Learning based approaches, on the other hand, keeps away from the problems of preparation of software or hardware to develop a correct 3D model. It takes advantages of mathematical formulae to prepare an algorithm through which, modification of a 2D image and tracking of human body motions become possible. There is no preference of one method over the other yet, perhaps, it is a question of creation team preference.
Agarwal, A and Triggs, B (a). “Learning to Track 3D Human motion from Silhouettes.” Proceedings of the 21st. International Conference of Machine Learning. Canada, Banff. 2004.
Agarwal A and Triggs B (b). “Recovering 3D Human Pose from Monocular Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence vol 28(1) 2006. p. 1-15.
Agarwal A and Triggs B (c). “Learning Methods for Recovering 3D Human Pose from Monocular Images.” English version. Institut National Recherche en Informatique et en Automatique. Project LEAR – Learning and Recognition in Vision. 2004. Web.
Agarwal A and Triggs B (d). “A Local Basis Representation for Estimating Human Pose from Cluttered Images.” 7th Asian Conference on Computer Vision. International Institute of Information Technology. India, Hyderabad. 2006.
Bartlett P L and Traskin M. “AdaBoost is Consistent.” Journal of Machine Learning Research vol 8 2007. p. 2347-2368.
Blake R and Shiffrar M. “Perception of human motion.” Annu. Rev. Psychol. vol 58 2007. p. 47-73.
Brain L and Walton J N, et al. Brain’s Diseases Of The Nervous System. Oxford: Oxford University Press, 1969.
Cohen I and Hongxia L I. “Inference of Human Postures by Classification of 3D Human Body Shape.” IEEE International Workshop on Analysis and Modeling of Faces and Gestures. IEEE Computer Society. France, Nice. 2003.
Demirdjian D. “Enforcing the constraints for Human Body Tracking.” MIT CSAIL Vision Research. 2003. MIT Computer Science and Artificial Intelligence Laboratory. Web.
Guo, F and Qian G. “Monocular 3D Tracking of Articulated Human Motion in Silhouette and Pose Manifolds.” Arizona State University. 2005. Department of Electrical Engineering. Web.
Kakadiaris I and Metaxas D. “Model-Based Estimation for 3D Human Motion.” IEEE Transactions of Pattern Analysis and Machine Intelligence vol 22(12) 2000. p. 1453-1459.
Lee, M W and Cohen, I (a). “A model-Based Approach for Estimating Human 3D poses in Static Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence vol 28(6) 2006. p. 905-916.
Lee M W and Cohen I (b). “Human Upper Body Pose Estimation in Static Images.” University of California. 2004. School of Engineering. Web.
R’omer R, Athitose V, Sigal L and Sclaroff S. “3 Hand Pose Reconstruction Using Specialized Mappings.” IEEE International Conf. on Computer Vision (ICCV). IEEE Computer Society. Canada, Vancouver. 2001.
Sminchisescu C, Kanaujia A and Metaxas D. “Learning Joint Top-Down and Bottom-up Processes for 3D Visual Inference.” IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society. USA, New York. 2006.
Troje N F. “Decomposing biological motion: A framework for analysis and synthesis of human gait patterns.” Journal of Vision vol 2 2002. p. 371-387.
Welch G and Foxlin E. “Motion Tracking: No Silver Bullet, but a Respectable Arsenal.” IEEE Computer Graphics and Applications vol 23(1) 2002. p. 24-38.
Zhang Z and Troje N F. “3D Periodic Human Motion Reconstruction from 2D Motion sequences.” The Biomotion Lab. 2004. Queen’s University. Web.