We present a novel probabilistic framework that jointly models individuals and groups for trackin... more We present a novel probabilistic framework that jointly models individuals and groups for tracking. Managing groups is challenging, primarily because of their nonlinear dynamics and complex layout which lead to repeated splitting and merging events. The proposed approach assumes a tight relation of mutual support between the modeling of individuals and groups, promoting the idea that groups are better modeled if individuals are considered and vice versa. This concept is translated in a mathematical model using a decentralized particle filtering framework which deals with a joint individual-group state space. The model factorizes the joint space into two dependent subspaces, where individuals and groups share the knowledge of the joint individual-group distribution. The assignment of people to the different groups (and thus group initialization, split and merge) is implemented by two alternative strategies: using classifiers trained beforehand on statistics of group configurations, and through online learning of a Dirichlet process mixture model, assuming that no training data is available before tracking. These strategies lead to two different methods that can be used on top of any person detector (simulated using the ground truth in our experiments). We provide convincing results on two recent challenging tracking benchmarks.
Bag of Visual words (BoV) is one of the most successful strategy for object recognition, used to ... more Bag of Visual words (BoV) is one of the most successful strategy for object recognition, used to represent an image as a vector of counts using a learned vocabulary.
This strategy assumes that the representation is built using patches that are either densely extracted or sampled from the images using feature detectors.
However, the dense strategy captures also the noisy background information, whereas the feature detection strategy can lose important parts of the objects.
In this paper we propose a solution in-between these two strategies, by densely extracting patches from the image, and weighting them accordingly to their salience.
Intuitively, highly salient patches have an important role in describing an object, while those with low saliency are still taken with low emphasis, instead of discarding them.
We embed this idea in the word encoding mechanism adopted in the BoV approaches.
The technique is successfully applied to vector quantization and Fisher vector, on Caltech-101 and Caltech-256.
We present a statistical behavioural biometric approach for recognizing people by their aesthetic... more We present a statistical behavioural biometric approach for recognizing people by their aesthetic preferences, using colour images.
In the enrollment phase, a model is learnt for each user, using a training set of preferred images. In the recognition/authentication phase, such model is tested with an unseen set of pictures preferred by a probe subject. The approach is dubbed "pump and distill'', since the training set of each user is pumped by bagging, producing a set of image ensembles. In the distill step, each ensemble is reduced into a set of surrogates, that is, aggregates of images sharing a similar visual content. Finally, LASSO regression is performed on these surrogates; the resulting regressor, employed as a classifier, takes test images belonging to a single user, predicting his identity. The approach improves the state-of-the-art on recognition and authentication tasks in average, on a dataset of 40000 Flickr images and 200 users. In practice, given a pool of 20 preferred images of a user, the approach recognizes his identity with an accuracy of 92%, and sets an authentication accuracy of 91% in terms of normalized Area Under the Curve of the CMC and ROC curve, respectively.
Automated surveillance of human activities has traditionally been a Computer Vision field interes... more Automated surveillance of human activities has traditionally been a Computer Vision field interested in the recognition of motion patterns and in the production of high-level descriptions for actions and interactions among entities of interest (Cedras and Shah, 1995; Aggarwal and Cai, 1999; Gavrila, 1999; Moeslund et al., 2006; Buxton, 2003; Hu et al., 2004; Turaga et al., 2008; Dee and Velastin, 2008; Aggarwal and Ryoo, 2011; Borges et al., 2013). In the last five years, the study on human activities has been revitalized by addressing the so-called social signals (Pentland, 2007). In fact, these nonverbal cues inspired by the social, affective, and psychological literature (Vinciarelli et al., 2009b), have allowed a more principled understanding of how humans act and react to other people and to their environment. Social Signal Processing (SSP) is the scientific field making a systematic, algorithmic and computational analysis of social signals, drawing significant concepts from anthropology and social psychology (Vinciarelli et al., 2009b). In particular, SSP does not stop at just modeling human activities, but aims at coding and decoding human behavior. In other words, it focuses on unveiling the underlying hidden states that drive one to act in a distinct way, with particular actions. This challenge is supported by decades of investigation in human sciences (psychology, anthropology, sociology, etc.) that showed how humans use nonverbal behavioral cues like facial expressions, vocalizations (laughter, fillers, back-channel, etc.), gestures or postures to convey, often outside conscious awareness, their attitude towards other people and social environments, as well as emotions (V.Richmond and J.McCroskey, 1995). The understanding of these cues is thus paramount in order to understand the social meaning of human activities. The formal marriage of automated video surveillance with Social Signal Processing had its programmatic start during SISM 2010, the Inter- national Workshop on Socially Intelligent Surveillance and Monitoring1, associated with the IEEE Computer Vision and Pattern Recognition conference. In that venue, the discussion was focused on what kind of social signals can be captured in a generic surveillance scenario, detailing then the specific scenarios where the modeling of social aspects could be beneficial the most. After 2010, SSP hybridizations with surveillance applications have grown rapidly in number, and systematic essays about the topic started to compare in the Computer Vision literature (Cristani et al., 2013). In this chapter, after giving a short overview of those surveillance approaches which adopt SSP methodologies, we examine a recent appli- cation where the connection between the two worlds promises to give intriguing results, namely the modeling of interactions via instant mes- saging platforms. Here, the environment to be monitored is not real" anymore: instead, we move into another realm, that of the social web. On instant messaging platforms, one of the most important challenges is the identification of people involved in conversations. It has become important in the wake of social media penetration into everyday life, together with the possibility of interacting with persons hiding their identity behind nick-names or potentially fake profiles. Under scenarios like these, classification approaches (proper of the classical surveillance literature) can be improved with social signals, by importing behavioral cues that come from conversation analysis. In practice, sets of features are designed to encode effectively how a person converses: since chats are crossbreeds of written text and face-to-face verbal communication, the features inherit equally from textual authorship attribution and conversational analysis of speech. Importantly, the cues ignore completely the semantics of the chat, relying solely on non-verbal aspects typical of SSP, taking care of possible privacy and ethical issues. With this modeling, identity safekeeping can be faced. Finally, in the conclusions, some considerations summarize what has been achieved so far in the surveillance under a social and psychological perspective; future perspectives are then given, identifying how and where social signals and methods of surveillance could create a very effective mixing.
We present a novel probabilistic framework that jointly models individuals and groups for trackin... more We present a novel probabilistic framework that jointly models individuals and groups for tracking. Managing groups is challenging, primarily because of their nonlinear dynamics and complex layout which lead to repeated splitting and merging events. The proposed approach assumes a tight relation of mutual support between the modeling of individuals and groups, promoting the idea that groups are better modeled if individuals are considered and vice versa. This concept is translated in a mathematical model using a decentralized particle filtering framework which deals with a joint individual-group state space. The model factorizes the joint space into two dependent subspaces, where individuals and groups share the knowledge of the joint individual-group distribution. The assignment of people to the different groups (and thus group initialization, split and merge) is implemented by two alternative strategies: using classifiers trained beforehand on statistics of group configurations, and through online learning of a Dirichlet process mixture model, assuming that no training data is available before tracking. These strategies lead to two different methods that can be used on top of any person detector (simulated using the ground truth in our experiments). We provide convincing results on two recent challenging tracking benchmarks.
Bag of Visual words (BoV) is one of the most successful strategy for object recognition, used to ... more Bag of Visual words (BoV) is one of the most successful strategy for object recognition, used to represent an image as a vector of counts using a learned vocabulary.
This strategy assumes that the representation is built using patches that are either densely extracted or sampled from the images using feature detectors.
However, the dense strategy captures also the noisy background information, whereas the feature detection strategy can lose important parts of the objects.
In this paper we propose a solution in-between these two strategies, by densely extracting patches from the image, and weighting them accordingly to their salience.
Intuitively, highly salient patches have an important role in describing an object, while those with low saliency are still taken with low emphasis, instead of discarding them.
We embed this idea in the word encoding mechanism adopted in the BoV approaches.
The technique is successfully applied to vector quantization and Fisher vector, on Caltech-101 and Caltech-256.
We present a statistical behavioural biometric approach for recognizing people by their aesthetic... more We present a statistical behavioural biometric approach for recognizing people by their aesthetic preferences, using colour images.
In the enrollment phase, a model is learnt for each user, using a training set of preferred images. In the recognition/authentication phase, such model is tested with an unseen set of pictures preferred by a probe subject. The approach is dubbed "pump and distill'', since the training set of each user is pumped by bagging, producing a set of image ensembles. In the distill step, each ensemble is reduced into a set of surrogates, that is, aggregates of images sharing a similar visual content. Finally, LASSO regression is performed on these surrogates; the resulting regressor, employed as a classifier, takes test images belonging to a single user, predicting his identity. The approach improves the state-of-the-art on recognition and authentication tasks in average, on a dataset of 40000 Flickr images and 200 users. In practice, given a pool of 20 preferred images of a user, the approach recognizes his identity with an accuracy of 92%, and sets an authentication accuracy of 91% in terms of normalized Area Under the Curve of the CMC and ROC curve, respectively.
Automated surveillance of human activities has traditionally been a Computer Vision field interes... more Automated surveillance of human activities has traditionally been a Computer Vision field interested in the recognition of motion patterns and in the production of high-level descriptions for actions and interactions among entities of interest (Cedras and Shah, 1995; Aggarwal and Cai, 1999; Gavrila, 1999; Moeslund et al., 2006; Buxton, 2003; Hu et al., 2004; Turaga et al., 2008; Dee and Velastin, 2008; Aggarwal and Ryoo, 2011; Borges et al., 2013). In the last five years, the study on human activities has been revitalized by addressing the so-called social signals (Pentland, 2007). In fact, these nonverbal cues inspired by the social, affective, and psychological literature (Vinciarelli et al., 2009b), have allowed a more principled understanding of how humans act and react to other people and to their environment. Social Signal Processing (SSP) is the scientific field making a systematic, algorithmic and computational analysis of social signals, drawing significant concepts from anthropology and social psychology (Vinciarelli et al., 2009b). In particular, SSP does not stop at just modeling human activities, but aims at coding and decoding human behavior. In other words, it focuses on unveiling the underlying hidden states that drive one to act in a distinct way, with particular actions. This challenge is supported by decades of investigation in human sciences (psychology, anthropology, sociology, etc.) that showed how humans use nonverbal behavioral cues like facial expressions, vocalizations (laughter, fillers, back-channel, etc.), gestures or postures to convey, often outside conscious awareness, their attitude towards other people and social environments, as well as emotions (V.Richmond and J.McCroskey, 1995). The understanding of these cues is thus paramount in order to understand the social meaning of human activities. The formal marriage of automated video surveillance with Social Signal Processing had its programmatic start during SISM 2010, the Inter- national Workshop on Socially Intelligent Surveillance and Monitoring1, associated with the IEEE Computer Vision and Pattern Recognition conference. In that venue, the discussion was focused on what kind of social signals can be captured in a generic surveillance scenario, detailing then the specific scenarios where the modeling of social aspects could be beneficial the most. After 2010, SSP hybridizations with surveillance applications have grown rapidly in number, and systematic essays about the topic started to compare in the Computer Vision literature (Cristani et al., 2013). In this chapter, after giving a short overview of those surveillance approaches which adopt SSP methodologies, we examine a recent appli- cation where the connection between the two worlds promises to give intriguing results, namely the modeling of interactions via instant mes- saging platforms. Here, the environment to be monitored is not real" anymore: instead, we move into another realm, that of the social web. On instant messaging platforms, one of the most important challenges is the identification of people involved in conversations. It has become important in the wake of social media penetration into everyday life, together with the possibility of interacting with persons hiding their identity behind nick-names or potentially fake profiles. Under scenarios like these, classification approaches (proper of the classical surveillance literature) can be improved with social signals, by importing behavioral cues that come from conversation analysis. In practice, sets of features are designed to encode effectively how a person converses: since chats are crossbreeds of written text and face-to-face verbal communication, the features inherit equally from textual authorship attribution and conversational analysis of speech. Importantly, the cues ignore completely the semantics of the chat, relying solely on non-verbal aspects typical of SSP, taking care of possible privacy and ethical issues. With this modeling, identity safekeeping can be faced. Finally, in the conclusions, some considerations summarize what has been achieved so far in the surveillance under a social and psychological perspective; future perspectives are then given, identifying how and where social signals and methods of surveillance could create a very effective mixing.
Uploads
Papers
challenging, primarily because of their nonlinear dynamics and complex layout which lead to repeated splitting and merging events.
The proposed approach assumes a tight relation of mutual support between the modeling of individuals and groups, promoting the
idea that groups are better modeled if individuals are considered and vice versa. This concept is translated in a mathematical model
using a decentralized particle filtering framework which deals with a joint individual-group state space. The model factorizes the joint
space into two dependent subspaces, where individuals and groups share the knowledge of the joint individual-group distribution.
The assignment of people to the different groups (and thus group initialization, split and merge) is implemented by two alternative
strategies: using classifiers trained beforehand on statistics of group configurations, and through online learning of a Dirichlet process
mixture model, assuming that no training data is available before tracking. These strategies lead to two different methods that can be
used on top of any person detector (simulated using the ground truth in our experiments). We provide convincing results on two recent
challenging tracking benchmarks.
This strategy assumes that the representation is built using patches that are either densely extracted or sampled from the images using feature detectors.
However, the dense strategy captures also the noisy background information, whereas the feature detection strategy can lose important parts of the objects.
In this paper we propose a solution in-between these two strategies, by densely extracting patches from the image, and weighting them accordingly to their salience.
Intuitively, highly salient patches have an important role in describing an object, while those with low saliency are still taken with low emphasis, instead of discarding them.
We embed this idea in the word encoding mechanism adopted in the BoV approaches.
The technique is successfully applied to vector quantization and Fisher vector, on Caltech-101 and Caltech-256.
In the enrollment phase, a model is learnt for each user, using a training set of preferred images. In the recognition/authentication phase, such model is tested with an unseen set of pictures preferred by a probe subject. The approach is dubbed "pump and distill'', since the training set of each user is pumped by bagging, producing a set of image ensembles. In the distill step, each ensemble is reduced into a set of surrogates, that is, aggregates of images sharing a similar visual content. Finally, LASSO regression is performed on these surrogates; the resulting regressor, employed as a classifier, takes test images belonging to a single user, predicting his identity. The approach improves the state-of-the-art on recognition and authentication tasks in average, on a dataset of 40000 Flickr images and 200 users. In practice, given a pool of 20 preferred images of a user, the approach recognizes his identity with an accuracy of 92%, and sets an authentication accuracy of 91% in terms of normalized Area Under the Curve of the CMC and ROC curve, respectively.
challenging, primarily because of their nonlinear dynamics and complex layout which lead to repeated splitting and merging events.
The proposed approach assumes a tight relation of mutual support between the modeling of individuals and groups, promoting the
idea that groups are better modeled if individuals are considered and vice versa. This concept is translated in a mathematical model
using a decentralized particle filtering framework which deals with a joint individual-group state space. The model factorizes the joint
space into two dependent subspaces, where individuals and groups share the knowledge of the joint individual-group distribution.
The assignment of people to the different groups (and thus group initialization, split and merge) is implemented by two alternative
strategies: using classifiers trained beforehand on statistics of group configurations, and through online learning of a Dirichlet process
mixture model, assuming that no training data is available before tracking. These strategies lead to two different methods that can be
used on top of any person detector (simulated using the ground truth in our experiments). We provide convincing results on two recent
challenging tracking benchmarks.
This strategy assumes that the representation is built using patches that are either densely extracted or sampled from the images using feature detectors.
However, the dense strategy captures also the noisy background information, whereas the feature detection strategy can lose important parts of the objects.
In this paper we propose a solution in-between these two strategies, by densely extracting patches from the image, and weighting them accordingly to their salience.
Intuitively, highly salient patches have an important role in describing an object, while those with low saliency are still taken with low emphasis, instead of discarding them.
We embed this idea in the word encoding mechanism adopted in the BoV approaches.
The technique is successfully applied to vector quantization and Fisher vector, on Caltech-101 and Caltech-256.
In the enrollment phase, a model is learnt for each user, using a training set of preferred images. In the recognition/authentication phase, such model is tested with an unseen set of pictures preferred by a probe subject. The approach is dubbed "pump and distill'', since the training set of each user is pumped by bagging, producing a set of image ensembles. In the distill step, each ensemble is reduced into a set of surrogates, that is, aggregates of images sharing a similar visual content. Finally, LASSO regression is performed on these surrogates; the resulting regressor, employed as a classifier, takes test images belonging to a single user, predicting his identity. The approach improves the state-of-the-art on recognition and authentication tasks in average, on a dataset of 40000 Flickr images and 200 users. In practice, given a pool of 20 preferred images of a user, the approach recognizes his identity with an accuracy of 92%, and sets an authentication accuracy of 91% in terms of normalized Area Under the Curve of the CMC and ROC curve, respectively.