[go: up one dir, main page]

Academia.eduAcademia.edu
448 Cont. on Image Proc., Comp. Vision, and P. R. I JPCV'07 1 Study on Human Behaviour Retrieval YanCHEN Faculty of Information Technology University of Technology, Sydney Sydney, NSW, Australian QiangWU Faculty of Information Technology University of Technology, Sydney Sydney, NSW, Australian Abstract Human behavior analysis is a hot topic in computer vision and is applied widely in many applications. Human behavior retrieval is another frontier technology in the area of multimedia information retrieval, which is related to human behavior analysis but holds several differences because of its special application purpose. Human behaviour retrieval to some extent is similar to human behaviour analysis, but the technology used for human behavior analysis cannot be used for human behavior directly. This paper will address such kind of differences and review several technologies including video retrieval, feature extraction, similarity measure and human behavior analysis. This paper will also address the importance of human behaviour retrieval. The ideas unveiled by this paper will benefit the research community and indicate a direction of human behavior retrieval research Keywords: Behaviour retrieval, Behaviour analysis 1 Introduction Human behavior analysis is recetvmg increasing attention in the area of computer vision and it has been applied in many areas, such as athletic performance analysis, surveillance and so on. There have been many such systems. For example, the real time W4 [1] detects and tracks groups of people as well as monitors their behaviors even in the presence of occlusion and in outdoor environments. In [2], Zhu et. a! proposed a system to recognize the action of tennis player and hence to improve player's performance. The metric currently used for evaluation of tools for human behavior analysis is recognition rate. The higher recognition rate is, the better the corresponding tool is regarded to be. However, such kind of metric is not enough for human behavior retrieval purpose which needs another significant· metric called recall rate. With the development of digital libraries, retrieval of an image or a clip of video from a video database becomes more and more difficult. Traditionally, keywords are used as text labels for quickly accessing large quantity of visual data. Some search engine giants such as Google and Yahoo Xiangjian HE Faculty of Information Technology University of Technology, Sydney Sydney, NSW, Australian also use meta-data information as keywords for image/video retrieval. The representation of visual data using text labels requires a large amount of manual work, which is not efficient and time-consuming. Therefore, it is important to automatically interpret image/video in order to save tremendous human efforts. Content-Based ImageNideo Retrieval (CBIR/CBVR) is a solution to the above problems. CBIRICBVR which retrieves targets based on visual information such !iS color, texture and shape has been studied in the previous years [3, 4]. It has shown many applications in medical science, art gallery and military. Human behavior retrieval [3] is a new research topic in the image retrieval area. The existing approaches to human behavior retrieval require domain knowledge and other objects information. For example, for behavior retrieval in a tennis game, the information about tennis balls is often used. Furthermore, current methods for human behavior retrieval can only be applied to specific areas [3]. Although human behavior retrieval is similar to human behavior analysis in some aspects, the methods used on human behavior analysis cannot be directly applied for human behavior retrieval. The remaining sections are organized as follows. Existing work on general video retrieval development, content based feature extraction, similarity measurement, and preliminary research on human behavior retrieval are reviewed in Section 2 to Section 5. Section 6 indicates the possible future work related to human behavior retrieval. It is concluded in Section 7. 2 Recent Research Development on Retrieval and Annotation Video Video is rapidly becoming the most popular media due to its high information and entertainment power. Applications that benefit from video are for education and training, marketing support, entertainment, sports etc. A straightforward approach to video retrieval is to represent visual contents in textual form (e.g. keywords). These keywords serve as indices to access the associated visual data and can be obtained when subtitle or transcript exist. The keyword approach has the advantage that visual database can be accessed using standard query language such as SQL language. However, sometimes this needs lots of extra manual processing. Therefore, there has been a Cont. on Image Proc., Comp. Vision, and P. R. I IPCV'07 new focus on developing content-based video retrievaVannotation system. Content based video retrieval is regarded as an extension of content-based image retrieval. Moreover, compared with image retrieval video retrieval features in several factors. These factors are primarily related to the temporal information available from a video document. While these factors may complicate the querying system, they may help in characterizing useful information for the querying. The temporal information firstly induces the concept of motion of the objects presented in the document. Generally, content based video retrieval includes three parts .that are segmentation, indexing, and query processing. Segmentation divides the video into shots or scenes, and selects one or more key frames for each shot. A shot is a set of contiguous frames all acquired through a continuous camera recording. The partitioning of the video into shots generally does not refer to any semantic analysis. Only the temporal information is used. Video shot cut involves identifying the frames where a transition takes place from one shot to another. In the next step, features are extracted from key frames supplied by the segmentation process and used to create a database index. . When a query comes, segmentation and key frame extraction are performed. Then, the necessary features are extracted from the key frames according to the query. A general video retrieval flow is shown in Fig 1 Most of the general video retrievals are based on low level features such as color, texture, shape, and motion information. Although they can be applied to a wide variety of generic video, those vis~al features have a common drawback that they can represent only low-level information. Some specific video retrieval approaches such as those used for improving sports performance [3] combined domain knowledge with low level features. This type of retrfeval which uses the technology of human behavior analysis will have high accuracy, but its drawback is that it can be used only in a specific area and to achieve large amount of domain knowledge. But the aim of the video retrieval is not only high accuracy, but also fast I 449 speed and high efficiency through dramatically reducing the amount of unrelated video data. 3 Contents Based Feature Extraction Image representation is the first step for image and video retrieval. Most of image databases have been preprocessed to obtain image features such as color, texture and shape for retrieval. What feature to be used for image/video retrieval depends on applications. Because feature extraction is an important step in image and video retrieval, we review the work on extraction of color, texture and shape as follows. 3.1 Color Color is perhaps the most dominant and distinguishing visual feature and is one of the most widely used visual features in retrieval. Color histogram is the most commonly used color descriptor in content based retrieval research. Color histogram can be found in [5-8]. Michael and Dana proposed histogram intersection and the similarity measure for the color histogram in [6]. In [5], the authors proposed the use of Gaussian Mmixture Vector Quantization (GMVQ) as a quantization method for color histogram generation and its experiment results have shown that it has better retrieval performance for color images than the conventional color histogram methods. Color histogram is easy to implement, but it does not take into account spatial information because it does not consider spatial distribution of color. In [9], Yining et al. proposed a compact color descriptor which could be indexed in a 3D color space. The descriptor consists of the representations of colors and their percentages in the region. The author claimed that this descriptor gave a more efficient indexing because of its low dimension compared with the traditional color histograms. Besides color histograms, color sets [10] and color moments [11] have also been applied to represent and retrieve images. Smith and Chang proposed a color set in [10], which was defined as a selection of colors from a quantized color space. Color set allows fast indexing and search because of its low dimension. Stricker and Orengo in [11] used the first three geometric moments to represent colors. One drawback of the color moment descriptor is that the average of all colors can be quite different from any of the original colors. Hence, like the traditional color histograms, given a color moment descriptor, it is impossible to recover the actual colors in the image. 3.2 Texture Texture is defined as a local arrangement of image irradiances projected from a surface patch of perceptually homogeneous irradiances [12]. It is often used in medical images and satellite images. Tuceryan and Jain [13] identified five major categories of features for texture that Cont. on Image Proc., Comp. Vision, and P. R. are statistical features, geometrical features, structural features, model-based features and features from signal processing. Chen and Li in [14] proposed texture-spectrum, a statistical way to describe texture features. The basic idea of using a texture spectrum is that a texture image can be represented as a set of essential small units. The statistics of all texture units over the entire image reveal a global texture feature. Wavelet transform as a kind of signal processing tool has been studied by many researchers to represent texture [15-18]. Chang and Kuo proposed a treestructured wavelet transform to improve classification accuracy. In [15], Smith and Shih-Fu used the statistics extracted from wavelet subbands as a texture representation. In [16], texture features were modeled according to the marginal distribution of wavelet coefficients using generalized Gaussian distributions. Thyagarajan, Nguyen, and Persons combined the wavelet transform with a co-occurrence matrix to take advantage of both statistic based and transform based texture analysis [17]. 3.3 Shape Shape is a key attribute of image. Shape can be used to describe object's position, orientation and size. Therefore, it is required that shape representation should be invariant to translation, rotation and scale. In general, shape representation can be divided into two categories, boundary based and region based. The former uses only the outer boundary of the shape while the later uses the entire shape region. Fourier descriptor and moments are used for both categories. The most successful representation for shape is Fourier Descriptor (FD). FD is simple to derive. Coarse shape features or global shape features are captured from the lower order coefficients of FD and the fmer shape features are captured from the higher order coefficients. Furthermore, FD is robust to noise because noise often coincides with very high frequency domains which are truncated out in FD. The FD's characters make it become a popular shape descriptor. However, FD cannot capture interior shape content which sometimes is important for shape description. For example, if two images have the same contour, but different interior, FD cannot discriminate their difference. 2-D Fourier transform in polar coordinates was employed for shape description in [19], which performed better than 1-D Fourier transform. Chuang and Kuo in [20] used a wavelet transform to describe the shape information, from which a wavelet descriptor is proposed. One advantage of wavelet descriptor is that it can provide global features at the coarser resolution levels and more detailed local features at the fmer resolution levels. However, wavelet descriptor is not invariant because its transform coefficients are different for different starting points. Moments are also popular to describe shape. Various types of moments have been used for moment based shape classification [21-23]. Hu introduced seven moments in I /PCV'07 [21], which are invariant to translation, rotation and scaling. The advantages of moment descriptors are that it is easy to implement and different level moments are independent. However, it is difficult to associate higher order moments with physical interpretation, which makes the moment descriptors hard to understand and sensitive to even small geometric and photometric distortions [24]. Belongie, Malik and Puzicha introduced a new shape descriptor called shape context in [25]. 'Shape context' can be considered as a 3-D histogram of edge point location and distribution. 'Shape context' is robust to a number of transformations including translation, scale and rotation. The drawback of 'shape context' is that it relies much on its sample points. If sample points have a small error, the whole context will be totally incorrect. The challenge with shape based CBIR system is that the shape features need very accurate segmentation of image to detect the object or region boundary. Reliable segmentation is critical, without which shape representation is meaningless. Color, texture, shape are all low level features of images. Now, more and more researches focus on the high level features of images and try to obtain the semantic features of images from low level features. A semanticssensitive approach to content-based image retrieval has been proposed in [26, 27]. A semantic categorization (e.g. graph, photograph, indoor, outdoor, etc) for appropriate feature extraction has been used in [26]. The advantage of semantic categorization is that it improves retrieval accuracy and reduces image retrieval time. But the statement in [26] that similar semantics share similar visual features is not always true. Semantic categorization can also be used for human posture retrieval because human posture is a kind of high level features. Human behaviour will use some technology in content based retrieval, but few work has been done on this specific work. 3.4 Similarity Measurement Although measurement of feature similarity is application oriented, similarity measurement based on a statistical analysis has been dominant in content based retrieval. Measuring the distance between histograms has been an active research stream for content-based retrieval when histogram is used for image representation. For contentbased retrieval, histograms have mostly been used in conjunction with color features. But there is nothing against a conjunction with texture or shape properties. Michael and Dana [6] used the intersection distance n D(H(I),H(Q)) = 'Lmin(H1 (I),H 1 (Q)) j=i (1) 1 c Conf. on Image Proc., Camp. Vision, and P. R. I IPCV'Ol 1 where H(l) and H(Q) are two histograms containing n bins each. In [28], a different approach was proposed. The distance between two histograms was defmed in vector form as: Dhisr where = (H(I)- H(Q)f A(H(l)- H(Q)) (2) Dhisr is the distance of the histograms of two images, H (I) and H (Q) are histograms of the color vectors with K components and A is a K*K similarity matrix. This measurement considers the similarity between values in the feature space. Other commonly used distance functions for color histograms include Minkowski distance: D(H(I),H(Q)) = ~ [ n H 1 (I),H1 (Q) lr 451 object. Similarity at semantic level can be found in [33]. Knowledge based type abstraction hierarchy was used to access image data based on context and a user profile generated automatically from cluster analysis of the databases. In some cases, weights were used when features were fused together to calculate the similarity. The selection of weights corresponding to the features used affects the whole retrieval results. 3.5 Human Behavior Analysis More and more efforts have been put on the research of human behavior analysis due to its promising applications in many areas such as visual surveillance, content-based image storage and retrieval, video conferencing, athletic performance analysis, virtual reality etc. A general frame for human behavior analysis is shown in Fig 2 [34]. ]){ (3) The above measures do not take into account the similarity between different, but related bins of a histogram. In [29], the histogram was applied to color images by representing colors in the HSV color space and computing the channel moments separately, resulting in nine parameters, three moments for each of three color channels. The distance of color layout was also used for color similarity measure. As for color layout, a predefined grid color layout was used as a sample. For shape comparison, the match is based on transforms, moments, deformation, scale and etc. Because image can be represented as a wavelet or Fourier function, we can process the image through its coefficients. By truncating the coefficients below a threshold, images can be sparsely represented at the cost of loss of some detail. The set of remaining coefficients can be used as a feature vector for matching. One drawback of this approach is that it depends much on clear image segmentation. Moments, especially the low order moments, are often used for image retrieval. Scale space matching is based on progressively simplifying the contour through smoothing [30]. By comparing the signature of annihilated curvature zero crossings, two scale and rotation invariant shapes are matched. Salient features are used to capture the information in image in a limited number of salient points. Similarity between images can then be checked in several different ways. One method is to store all salient points fr~m one image in a histogram on the basis of a few characteristics, and then the similarity is based on the presence of enough group-wise similar points [31]. The second method for similarity measurement of salient points is to copcentrate only on the spatial relationships among the salient points sets. In point-by-point based methods as shown in [32] for shape comparison, shape similarity was studied, where maximum curvature points on the contour and the length between them were used to characterize the Fig 2 A general framework for human behavior analysis Almost all methods for vision-based human behavior analysis starts with human detection. Human detection aims at segmenting regions corresponding to people from the rest of an image. It is a significant issue in a human behavior analysis system since the subsequent processes such as tracking and behavior understanding are greatly dependent on it. This process usually involves motion segmentation and object classification. After human detection, tracking will be applied. Tracking can be considered to establish coherent relations of image features between frames with respect to position, velocity, shape, texture, color etc. After successfully tracking moving humans from one frame to another in an image sequence, it is the turn to understand the behavior. · Behavior understanding is to analyze and recognize human motion patterns, and to produce high-level description of actions and interactions. How to know some actions sequences can be one behavior is a problem needed to be solved for human behavior understanding. General approaches for behavior understanding are template matching [35] and Cont. on Image Proc., Comp. Vision, and P. R. 452 state-space [36]. For examples, HMM is an approach using for human behavior understanding, but it is not suitable for behavior retrieval because it requires beforehand models and models may be sensitive to noise. Using natural languages to describe human behavior has received considerable attention. Its purpose is to reasonably choose a group of words or short expressions to represent the behaviors. Kojima et al. [37] proposed a method to generate natural language description of human behaviors appearing in real video sequence. Human behavior retrieval has been researched in some specific areas, especially in the sport areas [38, 39]. The current human behavior retrieval techniques use other object's information to help determine human behavior. For example, the tennis ball's position, the line and net of a tennis court and the human body, are all needed to determine a player's action. The advantage of this method is that it can recognize human behavior accurately. But it requires more computation time and plentiful specific knowledge in the area. 4 Future Work on Human Behaviour Retrieval Although human behaviour retrieval to some extent is similar to human behaviour analysis, the technology used in human behaviour analysis cannot be moved to human behaviour retrieval directly because the performance evaluation between these two areas is different. The metric used for human behaviour analysis is recognition rate. The higher recognition rate is the better performance will have. But for human behaviour retrieval, we should include not only the recognition rate as a metric, but also the recall rate and retrieval speed. It must find tradeoff's among these three metrics. Efficiently locating the video shots containing interested human behaviour based on given query data is the major aim of human behaviour retrieval. The state-space methods shown in[36] can be used for human behaviour retrieval if we can find some solution to resist noise and reduce its computation time . Each state can overlap with others to some degree may help to solve this problem. But to what degree the states can overlap with each other should be carefully examined. The benefit using state-space method is that it can be combined with hun;tan behaviour analysis easily. . Since human behavior is composed of consecutive postures, it is possible to identify behavior by the consecutive postures. There are some benefits of using postures• for behavior retrieval. The first is that is it is not restricted ·'in specific area, and it can be used in general areas. For example, it can be used to fmd some suspicious human behavior in an airport lobby. The second benefit is that it can find the targets very quickly. Using posture to retrieval human behavior, there is no guarantee that all of retrieved video are target videos. However, sacrificing proc~sing. I IPCV'07 accuracy, as a return we can save large amount of time in The third benefit is that postures have semantic meanings which are more understandable than the low level features. 5 Conclusions This paper has given a brief review on the current techniques for human behavior analysis, video annotation, feature extraction and similarity measurement. All of these techniques are relatived with human behavior retrieval. This paper has also addressed possible future works on human behavior retrieval. 6 References [I] I. Haritaoglu, D. Harwood, and L. S. Davis, "W4: real-time surveillance of people and their activities," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 22, pp. 809-830, 2000. [2] G. Zhu, C. Xu, Q. Huang, and W. Gao, "Action Recognition in Broadcast Tennis Video," in Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, 2006, pp. 251-254. [3] I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P. N. Yianilos, "The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments," Image Processing, IEEE Transactions on, vol. 9, pp. 20-37,2000. [4] A. Pentland, R. W. Picard, and S. Sclaroff, "Piiotobook: Content-based manipulation of image databases," International Journal of Computer Vision, vol. Vl8, pp. 233-254, 1996. [5] S. Jeong, C. S. Won, and R. M. Gray, "Image retrieval using color histograms generated by Gauss mixture vector quantization," Computer Vision and Image Understanding, vol. 94, pp. 44-66, 2004. [6] J. S. Michael and H. B. Dana, "Color indexing," International Journal of Computer Vision, vol. V7, pp. 1132, 1991. [7] M. Ortega, Y. Rui, K. Chakrabarti, K. Porkaew, S . Mehrotra, and T. S. Huang, "Supporting ranked Boolean similarity queries in MARS," Knowledge and Data Engineering, IEEE Transactions on, vol. 10, pp. 905-925, 1998. [8] W. Jia, H. Zhang, X. He, and Q. Wu, "A Comparison on Histogram Based Image Matching Methods," 2006, pp. 97-97. I 1 Conf. on Image Proc., Comp. Vision, and P. R. I /PCV'07 1 [9] D. Yining, B. S. Manjunath, C. Kenney, M. S. Moore, and H. Shin, "An efficient color representation for image retrieval," Image Processing, IEEE Transactions on, vol. 10, pp. 140-147, 2001. [22] H.-K. Kim and J.-D. Kim, "Region-based shape descriptor invariant to rotation, scale and translation," Signal Processing: Image Communication, vol. 16, pp. 8793,2000. [10] J. R. Smith and S, F. Chang, "Single color extraction and image query," 1995, pp. 528-531 vol.3. [23] A. Khotanzad and Y. H. Hong, "Invariant image recognition by Zemike moments," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 12, pp. 489-497, 1990. [11] M. Stricker and M. Orengo, "Similarity of Color Images," in SPIE Storage and Retrieval for Image and Video vol. 2420, 1995, pp. 381-392. [12] A. C. Bovik, M. Clark, and W. S. Geisler, "Multichannel texture analysis using localized spatial filters," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 12, pp. 55-73, 1990. [13] M. Tuceryan and A. K. Jain, "Texture Analysis," in The Handbook of Pattern Recognition and Computer Vision (2nd Edition), by C. H. Chen, L. F. Pau,P. S. P. Wang (eds.), 2, Ed.: World Scientific Publishing Co., 1998, pp. 207-248. [14] H. Dong-chen and W. Li, "Texture Unit, Texture Spectrum, And Texture Analysis," Geoscience and Remote Sensing, IEEE Transactions on, vol. 28, pp. 509-512, 1990. [15] J. R. Smith and C. Shih-Fu, "Automated binary texture feature sets for image retrieval," 1996, pp. 22392242 vol. 4. [24] K. Mikolajczyk and C. Schmid, "A performance evaluation of local descriptors," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, pp. 1615-1630, 2005. [25] S. Belongie, J. Malik, and J. Puzicha, "Shape matching and object recognition using shape contexts," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, pp. 509-522, 2002. [26] J. Z. Wang, L. Jia, and G. Wiederhold, "SIMPLicity: semantics-sens1t1ve integrated matching for picture libraries," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, pp. 947-963, 2001. [27] J. Feng, L. Mingjing, Z. Hong-Jiang, and Z. Bo, "An efficient and effective region-based image retrieval framework," Image Processing, IEEE Transactions on, vol. 13,pp. 699-709,2004. [16] M. N. Do and M. Vetterli, "Wavelet-based texture retrieval using generalized Gaussian density and KullbackLeibler distance," Image Processing, IEEE Transactions on, vol. 11, pp. 146-158, 2002. [28] J. Hafner, H. S. Sawhney, W. Equitz, M. Flickner, and W. Niblack, "Efficient color histogram indexing for quadratic form distance functions," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 17, pp. 729-736, 1995. [17] K. S. Thyagarajan, T. Nguyen, and C. E. Persons, "A maximum likelihood approach to texture classification using wavelet transform," 1994, pp. 640-644 vol.2. [29] M. Stricker and M. Orengo, "Similarity of Color Images," Storage and Retrieval for Image and Video vol. 2420, pp. 381-392, 1995. [18] T. Chang and C. C. J. Kuo, "Texture analysis and classification with tree-structured wavelet transform," Image Processing, IEEE Transactions on, vol. 2, pp. 429441, 1993. [30] F. Mokhtarian, "Silhouette-based isolated object recognition through curvature scale space," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 17, pp. 539-544, 1995. [19] Z. Dengsheng and L. Guojun, "Generic Fourier descriptor for shape-based image retrieval," 2002, pp. 425,428 vol.l. [31] T. Gevers and A. W. M. Smeulders, "PicToSeek: combining color and shape invariant features for image retrieval," Image Processing, IEEE Transactions on, vol. 9, pp. 102-119, 2000. [20] G. C. H. Chuang and C. C. J. Kuo, "Wavelet descriptor of planar curves: theory and applications," lma&,e Processing, IEEE Transactions on, vol. 5, pp. 56-70, 1996.' [21] H. Ming-Kuei, "Visual pattern recogrutwn by moment invariants," Information Theory, IEEE Transactions on, vol. 8, pp. 179-187, 1962. [32] J. Linhui and L. Kitchen, "Object-based image similarity computation using inductive learning of contoursegment relations," Image Processing, IEEE Transactions on, vol. 9, pp. 80-87, 2000. [33] H. Chih-Cheng, W. W. Chu, and R. K. Taira, "A knowledge-based approach for retrieving images by 453 454 Conf. on Image Proc., Comp. Vision, and P. R. content," Knowledge and Data Engineering, Transactions on, vol. 8, pp. 522-532, 1996. IEEE [34] L. Wang, W. Hu, and T. Tan, "Recent Developments in Human Motion Analysis," Pattern Recognition, vol. 36, pp. 585-601, 2003. [35] A. F. Bobick and J. W. Davis, "The recognition of human movement using temporal templates," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, pp. 257-267, 2001. [36] A. Galata, N. Johnson, and D. Hogg, "Learning Variable Length Markov Models of Behaviour," Computer Vision and Image Understanding, vol. 81, pp. 398-413, 2001. [37] A. Kojima, M. Izumi, T. Tamura, and K. Fukunaga, "Generating natural language description of human behavior from video images," 2000, pp. 728-731 vol.4. [38] H. Miyamori and S. I. lisaku, "Video annotation for content-based retrieval using human behavior analysis and domain knowledge," 2000, pp. 320-325. [39] G. Sudhir, J. C. M. Lee, and A. K. Jain, "Automatic classification of tennis video for high-level content-based retrieval," 1998, pp. 81-90. · I IPCV'07 I