Skip to main content
Dimitris Metaxas

    Dimitris Metaxas

    Image-based characterization and disease understanding involve integrative analysis of morphological, spatial, and topological information across biological scales. The development of graph convolutional networks (GCNs) has created the... more
    Image-based characterization and disease understanding involve integrative analysis of morphological, spatial, and topological information across biological scales. The development of graph convolutional networks (GCNs) has created the opportunity to address this information complexity via graph-driven architectures, since GCNs can perform feature aggregation, interaction, and reasoning with remarkable flexibility and efficiency. These GCNs capabilities have spawned a new wave of research in medical imaging analysis with the overarching goal of improving quantitative disease understanding, monitoring, and diagnosis. Yet daunting challenges remain for designing the important image-to-graph transformation for multi-modality medical imaging and gaining insights into model interpretation and enhanced clinical decision support. In this review, we present recent GCNs developments in the context of medical image analysis including imaging data from radiology and histopathology. We discuss ...
    Magnetic Resonance (MR) image reconstruction from highly undersampled k-space data is critical in accelerated MR imaging (MRI) techniques. In recent years, deep learning-based methods have shown great potential in this task. This paper... more
    Magnetic Resonance (MR) image reconstruction from highly undersampled k-space data is critical in accelerated MR imaging (MRI) techniques. In recent years, deep learning-based methods have shown great potential in this task. This paper proposes a learned halfquadratic splitting algorithm for MR image reconstruction and implements the algorithm in an unrolled deep learning network architecture. We compare the performance of our proposed method on a public cardiac MR dataset against DC-CNN and LPDNet, and our method outperforms other methods in both quantitative results and qualitative results with fewer model parameters and faster reconstruction speed. Finally, we enlarge our model to achieve superior reconstruction quality, and the improvement is 1.76 dB and 2.74 dB over LPDNet in peak signal-to-noise ratio on 5× and 10× acceleration, respectively. Code for our method is publicly available at https://github.com/hellopipu/HQS-Net.
    ABSTRACT Previous research on facial expression recognition mainly focuses on near frontal face images, while in realistic interactive scenarios, the interested subjects may appear in arbitrary non-frontal poses. In this paper, we propose... more
    ABSTRACT Previous research on facial expression recognition mainly focuses on near frontal face images, while in realistic interactive scenarios, the interested subjects may appear in arbitrary non-frontal poses. In this paper, we propose a framework to recognize six prototypical facial expressions, namely, anger, disgust, fear, joy, sadness and surprise, in an arbitrary head pose. We build a multi-pose training set by rendering 3D face scans from the BU-4DFE dynamic facial expression database at 49 different viewpoints. We extract Local Binary Pattern (LBP) descriptors and further utilize multiple instance learning to mitigate the influence of inaccurate alignment in this challenging task. Experimental results demonstrate the power and validate the effectiveness of the proposed multi-pose facial expression recognition framework.
    Past research in deception detection at the University of Arizona has guided the investigation of intent detection. A theoretical foundation and model for the analysis of intent detection is proposed. Available test beds for intent... more
    Past research in deception detection at the University of Arizona has guided the investigation of intent detection. A theoretical foundation and model for the analysis of intent detection is proposed. Available test beds for intent analysis are discussed and two proof-of-concept studies exploring nonverbal communication within the context of intent detection are shared.
    Computer-based sign language recognition from video is a challenging problem because of the spatiotemporal complexities inherent in sign production and the variations within and across signers. However, linguistic information can help... more
    Computer-based sign language recognition from video is a challenging problem because of the spatiotemporal complexities inherent in sign production and the variations within and across signers. However, linguistic information can help constrain sign recognition to make it a more feasible classification problem. We have previously explored recognition of linguistically significant 3D hand configurations, as start and end handshapes represent one major component of signs; others include hand orientation, place of articulation in space, and movement. Thus, although recognition of handshapes (on one or both hands) at the start and end of a sign is essential for sign identification, it is not sufficient. Analysis of hand and arm movement trajectories can provide additional information critical for sign
    identification. In order to test the discriminative potential of the hand motion analysis, we performed sign recognition based exclusively on hand trajectories while holding the handshape constant. To facilitate this evaluation, we captured a collection of videos involving signs with a constant handshape produced by multiple subjects; and we automatically annotated the 3D motion trajectories. 3D hand locations are normalized in accordance with invariant properties of ASL movements. We trained time-series learning-based models for different signs of constant handshape in our dataset using the normalized 3D motion trajectories. Results show significant computer-based sign recognition accuracy across subjects and across a diverse set of signs. Our framework demonstrates the discriminative power and importance of 3D hand motion trajectories for sign recognition, given known handshapes.
    Without a commonly accepted writing system for American Sign Language (ASL), Deaf or Hard of Hearing (DHH) ASL signers who wish to express opinions or ask questions online must post a video of their signing, if they prefer not to use... more
    Without a commonly accepted writing system for American Sign Language (ASL), Deaf or Hard of Hearing (DHH) ASL signers who wish to express opinions or ask questions online must post a video of their signing, if they prefer not to use written English, a language in which they may feel less profcient. Since the face conveys essential linguistic meaning, the face cannot simply be removed from the video in order to preserve anonymity. Thus, DHH ASL signers cannot easily discuss sensitive, personal, or controversial topics in their primary language, limiting engagement in online debate or inquiries about health or legal issues. We explored several recent attempts to address this problem through development of “face swap" technologies to automatically disguise the face in videos while preserving essential facial expressions and natural human appearance. We presented several prototypes to DHH ASL signers (N=16) and examined their interests in and requirements for such technology. After viewing transformed videos of other signers and of themselves, participants evaluated the understandability, naturalness of appearance, and degree of anonymity protection of these technologies. Our study revealed users’ perception of key trade-offs among these three dimensions, factors that contribute to each, and their views on transformation options enabled by this technology, for use in various contexts. Our fndings guide future designers of this technology and inform selection of applications and design features.
    The American Sign Language Linguistic Research Project (ASLLRP) provides Internet access to high-quality ASL video data, generally including front and side views and a close-up of the face. The manual and non-manual components of the... more
    The American Sign Language Linguistic Research Project (ASLLRP) provides Internet access to high-quality ASL video data, generally including front and side views and a close-up of the face. The manual and non-manual components of the signing have been linguistically annotated using SignStream®. The recently expanded video corpora can be browsed and searched through the Data Access Interface (DAI 2) we have designed; it is possible to carry out complex searches. The data from our corpora can also be downloaded; annotations are available in an XML export format. We have also developed the ASLLRP Sign Bank, which contains almost 6,000 sign entries for lexical signs, with distinct English-based glosses, with a total of 41,830 examples of lexical signs (in addition to about 300 gestures, over 1,000 fingerspelled signs, and 475 classifier examples). The Sign Bank is likewise accessible and searchable on the Internet; it can also be accessed from within SignStream® (software to facilitate linguistic annotation and analysis of visual language data) to make annotations more accurate and efficient. Here we describe the available resources. These data have been used for many types of research in linguistics and in computer-based sign language recognition from video; examples of such research are provided in the latter part of this article.
    One of the most routine actions humans perform is walking. To date, however, an automated tool for generating human gait is not available. This paper addresses the gait generation problem through three modular components. We present... more
    One of the most routine actions humans perform is walking. To date, however, an automated tool for generating human gait is not available. This paper addresses the gait generation problem through three modular components. We present ElevWalker, a new lowlevel gait generator based on sagittal elevation angles, which allows curved locomotion - walking along a curved path - to be created easily; ElevInterp, which uses a new inverse motion interpolation algorithm to handle uneven terrain locomotion; and MetaGait, a high-level control module which allows an animator to control a figure's walking simply by specifying a path. The synthesis of these components is an easy-to-use, real-time, fully automated animation tool suitable for off-line animation, virtual environments and simulation.
    In this video, we present a research project on cardiac trabeculae segmentation. Trabeculae are fine muscle columns within human ventricles whose both ends are attached to the wall. Extracting these structures are very challenging even... more
    In this video, we present a research project on cardiac trabeculae segmentation. Trabeculae are fine muscle columns within human ventricles whose both ends are attached to the wall. Extracting these structures are very challenging even with state-of-the-art image segmentation techniques. We observed that these structures form natural topological handles. Based on such observation, we developed a topological approach, which employs advanced computational topology methods and achieve high quality segmentation results.
    In this paper we present a framework for recognizing American Sign Language (ASL). The main challenges in developing scalable recognition systems are to devise the basic building blocks from which to build up the signs, and to handle... more
    In this paper we present a framework for recognizing American Sign Language (ASL). The main challenges in developing scalable recognition systems are to devise the basic building blocks from which to build up the signs, and to handle simultaneous events, such as signs where both the hand moves and the handshape changes. The latter challenge is particularly thorny, because a naive approach to handling them can quickly result in a combinatorial explosion. We loosely follow the Movement-Hold model to devise a breakdown of the signs into their constituent phonemes, which provide the fundamental building blocks. We also show how to integrate the handshape into this breakdown, and discuss what handshape representation works best. To handle simultaneous events, we split up the signs into a number of channels that are independent from one another. We validate our framework in experiments with a 22-sign vocabulary and up to three channels.
    We present Optimal Transport GAN (OT-GAN), a variant of generative adversarial nets minimizing a new metric measuring the distance between the generator distribution and the data distribution. This metric, which we call mini-batch energy... more
    We present Optimal Transport GAN (OT-GAN), a variant of generative adversarial nets minimizing a new metric measuring the distance between the generator distribution and the data distribution. This metric, which we call mini-batch energy distance, combines optimal transport in primal form with an energy distance defined in an adversarially learned feature space, resulting in a highly discriminative distance function with unbiased mini-batch gradients. Experimentally we show OT-GAN to be highly stable when trained with large mini-batches, and we present state-of-the-art results on several popular benchmark problems for image generation.
    Iterative Hard Thresholding (IHT) is a popular class of first-order greedy selection methods for loss minimization under cardinality constraint. The existing IHT-style algorithms, however, are proposed for minimizing the primal... more
    Iterative Hard Thresholding (IHT) is a popular class of first-order greedy selection methods for loss minimization under cardinality constraint. The existing IHT-style algorithms, however, are proposed for minimizing the primal formulation. It is still an open issue to explore duality theory and algorithms for such a non-convex and NP-hard combinatorial optimization problem. To address this issue, we develop in this article a novel duality theory for `2-regularized empirical risk minimization under cardinality constraint, along with an IHT-style algorithm for dual optimization. Our sparse duality theory establishes a set of sufficient and/or necessary conditions under which the original non-convex problem can be equivalently or approximately solved in a concave dual formulation. In view of this theory, we propose the Dual IHT (DIHT) algorithm as a super-gradient ascent method to solve the non-smooth dual problem with provable guarantees on primal-dual gap convergence and sparsity re...
    We introduce a new general framework for sign recognition from monocular video using limited quantities of annotated data. The novelty of the hybrid framework we describe here is that we exploit state-of-the art learning methods while... more
    We introduce a new general framework for sign recognition from monocular video using limited quantities of annotated data. The novelty of the hybrid framework we describe here is that we exploit state-of-the art learning methods while also incorporating features based on what we know about the linguistic composition of lexical signs. In particular, we analyze hand shape, orientation, location, and motion trajectories, and then use CRFs to combine this linguistically significant information for purposes of sign recognition. Our robust modeling and recognition of these sub-components of sign production allow an efficient parameterization of the sign recognition problem as compared with purely data-driven methods. This parameterization enables a scalable and extendable time-series learning approach that advances the state of the art in sign recognition, as shown by the results reported here for recognition of isolated, citation-form, lexical signs from American Sign Language (ASL).
    2017 marked the release of a new version of SignStream® software, designed to facilitate linguistic analysis of ASL video. SignStream® provides an intuitive interface for labeling and time-aligning manual and non-manual components of the... more
    2017 marked the release of a new version of SignStream® software, designed to facilitate linguistic analysis of ASL video. SignStream® provides an intuitive interface for labeling and time-aligning manual and non-manual components of the signing. Version 3 has many new features. For example, it enables representation of morpho-phonological information, including display of handshapes. An expanding ASL video corpus, annotated through use of SignStream®, is shared publicly on the Web. This corpus (video plus annotations) is Web-accessible—browsable, searchable, and downloadable—thanks to a new, improved version of our Data Access Interface: DAI 2. DAI 2 also offers Web access to a brand new Sign Bank, containing about 10,000 examples of about 3,000 distinct signs, as produced by up to 9 different ASL signers. This Sign Bank is also directly accessible from within SignStream®, thereby boosting the efficiency and consistency of annotation; new items can also be added to the Sign Bank. S...
    Data augmentation is widely used to increase data variance in training deep neural networks. However, previous methods require either comprehensive domain knowledge or high computational cost. Can we learn data transformation... more
    Data augmentation is widely used to increase data variance in training deep neural networks. However, previous methods require either comprehensive domain knowledge or high computational cost. Can we learn data transformation automatically and efficiently with limited domain knowledge? Furthermore, can we leverage data transformation to improve not only network training but also network testing? In this work, we propose adaptive data transformation to achieve the two goals. The AdaTransform can increase data variance in training and decrease data variance in testing. Experiments on different tasks prove that it can improve generalization performance.
    In American Sign Language (ASL) as well as other signed languages, different classes of signs (e.g., lexical signs, fingerspelled signs, and classifier constructions) have different internal structural properties. Continuous sign... more
    In American Sign Language (ASL) as well as other signed languages, different classes of signs (e.g., lexical signs, fingerspelled signs, and classifier constructions) have different internal structural properties. Continuous sign recognition accuracy can be improved through use of distinct recognition strategies, as well as different training datasets, for each class of signs. For these strategies to be applied, continuous signing video needs to be segmented into parts corresponding to particular classes of signs. In this paper we present a multiple instance learning-based segmentation system that accurately labels 91.27% of the video frames of 500 continuous utterances (including 7 different subjects) from the publicly accessible NCSLGR corpus (Neidle and Vogler, 2012). The system uses novel feature descriptors derived from both motion and shape statistics of the regions of high local motion. The system does not require a hand tracker.
    When doing high field (1.5T) magnetic resonance breast imaging, the use of a compression plate during imaging after a contrast-agent injection may critically change the enhancement characteristics of the tumor, making the tracking of its... more
    When doing high field (1.5T) magnetic resonance breast imaging, the use of a compression plate during imaging after a contrast-agent injection may critically change the enhancement characteristics of the tumor, making the tracking of its boundaries very difficult. A new method for clinical breast biopsy is presented, based on a deformable finite element model of the breast. The geometry of the model is constructed from MR data, and its mechanical properties are based on a non-linear material model. This method allows imaging the breast without compression before the procedure, then compressing the breast and using the finite element model to predict the tumor’s position. The axial breast contours and the segmented slices are ported to a custom-written MR-image contour analysis program, which generates a finite element model (FEM) input file readable by a commercial FEM software. A deformable silicone gel phantom was built to study the movement of an inclusion inside a deformable env...
    Computer-based sign language recognition from video is a challenging problem because of the spatiotemporal complexities inherent in sign production and the variations within and across signers. However, linguistic information can help... more
    Computer-based sign language recognition from video is a challenging problem because of the spatiotemporal complexities inherent in sign production and the variations within and across signers. However, linguistic information can help constrain sign recognition to make it a more feasible classification problem. We have previously explored recognition of linguistically significant 3D hand configurations, as start and end handshapes represent one major component of signs; others include hand orientation, place of articulation in space, and movement. Thus, although recognition of handshapes (on one or both hands) at the start and end of a sign is essential for sign identification, it is not sufficient. Analysis of hand and arm movement trajectories can provide additional information critical for sign identification. In order to test the discriminative potential of the hand motion analysis, we performed sign recognition based exclusively on hand trajectories while holding the handshape ...
    Graph kernels are kernel methods measuring graph similarity and serve as a standard tool for graph classification. However, the use of kernel methods for node classification, which is a related problem to graph representation learning, is... more
    Graph kernels are kernel methods measuring graph similarity and serve as a standard tool for graph classification. However, the use of kernel methods for node classification, which is a related problem to graph representation learning, is still ill-posed and the state-of-the-art methods are heavily based on heuristics. Here, we present a novel theoretical kernel-based framework for node classification that can bridge the gap between these two representation learning problems on graphs. Our approach is motivated by graph kernel methodology but extended to learn the node representations capturing the structural information in a graph. We theoretically show that our formulation is as powerful as any positive semidefinite kernels. To efficiently learn the kernel, we propose a novel mechanism for node feature aggregation and a data-driven similarity metric employed during the training phase. More importantly, our framework is flexible and complementary to other graph-based deep learning ...
    Memory-efficient continuous Sign Language Translation is a significant challenge for the development of assisted technologies with real-time applicability for the deaf. In this work, we introduce a paradigm of designing recurrent deep... more
    Memory-efficient continuous Sign Language Translation is a significant challenge for the development of assisted technologies with real-time applicability for the deaf. In this work, we introduce a paradigm of designing recurrent deep networks whereby the output of the recurrent layer is derived from appropriate arguments from nonparametric statistics. A novel variational Bayesian sequence-to-sequence network architecture is proposed that consists of a) a full Gaussian posterior distribution for data-driven memory compression and b) a nonparametric Indian Buffet Process prior for regularization applied on the Gated Recurrent Unit non-gate weights. We dub our approach Stick-Breaking Recurrent network and show that it can achieve a substantial weight compression without diminishing modeling performance.
    Essential grammatical information is conveyed in signed languages by clusters of events involving facial expressions and movements of the head and upper body. This poses a significant challenge for computer-based sign language... more
    Essential grammatical information is conveyed in signed languages by clusters of events involving facial expressions and movements of the head and upper body. This poses a significant challenge for computer-based sign language recognition. Here, we present new methods for the recognition of nonmanual grammatical markers in American Sign Language (ASL) based on: (1) new 3D tracking methods for the estimation of 3D head pose and facial expressions to determine the relevant low-level features; (2) methods for higher-level analysis of component events (raised/lowered eyebrows, periodic head nods and head shakes) used in grammatical markings―with differentiation of temporal phases (onset, core, offset, where appropriate), analysis of their characteristic properties, and extraction of corresponding features; (3) a 2-level learning framework to combine low- and high-level features of differing spatio-temporal scales. This new approach achieves significantly better tracking and recognition ...
    This paper studies the following problem: given samples from a high dimensional discrete distribution, we want to estimate the leading (δ, ρ)-modes of the underlying distributions. A point is defined to be a (δ, ρ)-mode if it is a local... more
    This paper studies the following problem: given samples from a high dimensional discrete distribution, we want to estimate the leading (δ, ρ)-modes of the underlying distributions. A point is defined to be a (δ, ρ)-mode if it is a local optimum of the density within a δ-neighborhood under metric ρ. As we increase the "scale" parameter δ, the neighborhood size increases and the total number of modes monotonically decreases. The sequence of the (δ, ρ)-modes reveal intrinsic topographical information of the underlying distributions. Though the mode finding problem is generally intractable in high dimensions, this paper unveils that, if the distribution can be approximated well by a tree graphical model, mode characterization is significantly easier. An efficient algorithm with provable theoretical guarantees is proposed and is applied to applications like data analysis and multiple predictions.
    The ability for computational agents to reason about the high-level content of real world scene images is important for many applications. Existing attempts at addressing the problem of complex scene understanding lack representational... more
    The ability for computational agents to reason about the high-level content of real world scene images is important for many applications. Existing attempts at addressing the problem of complex scene understanding lack representational power, efficiency, and the ability to create robust meta-knowledge about scenes. In this paper, we introduce scenarios as a new way of representing scenes. The scenario is a simple, low-dimensional, data-driven representation consisting of sets of frequently co-occurring objects and is useful for a wide range of scene understanding tasks. We learn scenarios from data using a novel matrix factorization method which we integrate into a new neural network architecture, the ScenarioNet. Using ScenarioNet, we can recover semantic information about real world scene images at three levels of granularity: 1) scene categories, 2) scenarios, and 3) objects. Training a single ScenarioNet model enables us to perform scene classification, scenario recognition, mul...
    Identifying the lineage path of neural cells is critical for understanding the development of brain. Accurate neural cell detection is a crucial step to obtain reliable delineation of cell lineage. To solve this task, in this paper we... more
    Identifying the lineage path of neural cells is critical for understanding the development of brain. Accurate neural cell detection is a crucial step to obtain reliable delineation of cell lineage. To solve this task, in this paper we present an efficient neural cell detection method based on SSD (single shot multibox detector) neural network model. Our method adapts the original SSD architecture and removes the unnecessary blocks, leading to a light-weight model. Moreover, we formulate the cell detection as a binary regression problem, which makes our model much simpler. Experimental results demonstrate that, with only a small training set, our method is able to accurately capture the neural cells under severe shape deformation in a fast way.
    In the animation industry, cartoon videos are usually produced at low frame rate since hand drawing of such frames is costly and time-consuming. Therefore, it is desirable to develop computational models that can automatically interpolate... more
    In the animation industry, cartoon videos are usually produced at low frame rate since hand drawing of such frames is costly and time-consuming. Therefore, it is desirable to develop computational models that can automatically interpolate the in-between animation frames. However, existing video interpolation methods fail to produce satisfying results on animation data. Compared to natural videos, animation videos possess two unique characteristics that make frame interpolation difficult: 1) cartoons comprise lines and smooth color pieces. The smooth areas lack textures and make it difficult to estimate accurate motions on animation videos. 2) cartoons express stories via exaggeration. Some of the motions are non-linear and extremely large. In this work, we formally define and study the animation video interpolation problem for the first time. To address the aforementioned challenges, we propose an effective framework, AnimeInterp, with two dedicated modules in a coarse-to-fine manne...
    We propose a Dynamic Graph-Based Spatial-Temporal Attention (DG-STA) method for hand gesture recognition. The key idea is to first construct a fully-connected graph from a hand skeleton, where the node features and edges are then... more
    We propose a Dynamic Graph-Based Spatial-Temporal Attention (DG-STA) method for hand gesture recognition. The key idea is to first construct a fully-connected graph from a hand skeleton, where the node features and edges are then automatically learned via a self-attention mechanism that performs in both spatial and temporal domains. We further propose to leverage the spatial-temporal cues of joint positions to guarantee robust recognition in challenging conditions. In addition, a novel spatial-temporal mask is applied to significantly cut down the computational cost by 99%. We carry out extensive experiments on benchmarks (DHG-14/28 and SHREC'17) and prove the superior performance of our method compared with the state-of-the-art methods. The source code can be found at this https URL.
    The multi-label classification problem involves finding a model that maps a set of input features to more than one output label. Class imbalance is a serious issue in multilabel classification. We introduce an extension of structured... more
    The multi-label classification problem involves finding a model that maps a set of input features to more than one output label. Class imbalance is a serious issue in multilabel classification. We introduce an extension of structured forests, a type of random forest used for structured prediction, called Sparse Oblique Structured Hellinger Forests (SOSHF). We explore using structured forests in the general multi-label setting and propose a new imbalance-aware formulation by altering how the splitting functions are learned in two ways. First, we account for cost-sensitivity when converting the multi-label problem to a single-label problem at each node in the tree. Second, we introduce a new objective function for determining oblique splits based on the Hellinger distance, a splitting criterion that has been shown to be robust to class imbalance. We empirically validate our method on a number of benchmarks against standard and state-of-the-art multi-label classification algorithms wit...
    In this paper, we present a sample distributed greedy pursuit method for non-convex sparse learning under cardinality constraint. Given the training samples uniformly randomly partitioned across multiple machines, the proposed method... more
    In this paper, we present a sample distributed greedy pursuit method for non-convex sparse learning under cardinality constraint. Given the training samples uniformly randomly partitioned across multiple machines, the proposed method alternates between local inexact sparse minimization of a Newton-type approximation and centralized global results aggregation. Theoretical analysis shows that for a general class of convex functions with Lipschitze continues Hessian, the method converges linearly with contraction factor scaling inversely to the local data size; whilst the communication complexity required to reach desirable statistical accuracy scales logarithmically with respect to the number of machines for some popular statistical learning models. For nonconvex objective functions, up to a local estimation error, our method can be shown to converge to a local stationary sparse solution with sub-linear communication complexity. Numerical results demonstrate the efficiency and accurac...
    Existing neural cell tracking methods generally use the morphology cell features for data association. However, these features are limited to the quality of cell segmentation and are prone to errors for mitosis determination. To overcome... more
    Existing neural cell tracking methods generally use the morphology cell features for data association. However, these features are limited to the quality of cell segmentation and are prone to errors for mitosis determination. To overcome these issues, in this work we propose an online multi-object tracking method that leverages both cell appearance and motion features for data association. In particular, we propose a supervised blob-seed network (BSNet) to predict the cell appearance features and an unsupervised optical flow network (UnFlowNet) for capturing the cell motions. The data association is then solved using the Hungarian algorithm. Experimental evaluation shows that our approach achieves better performance than existing neural cell tracking methods.
    Non-linear regression is a fundamental and yet under-developing methodology in solving many problems in Artificial Intelligence. The canonical control and predictions mostly utilize linear models or multi-linear models. However, due to... more
    Non-linear regression is a fundamental and yet under-developing methodology in solving many problems in Artificial Intelligence. The canonical control and predictions mostly utilize linear models or multi-linear models. However, due to the high non-linearity of the systems, those linear prediction models cannot fully cover the complexity of the problems. In this paper, we propose a robust two-stage hierarchical regression approach, to solve a popular Human-Computer Interaction, the unconstrained face-in-the-wild keypoint detection problem for computers. The environment is the still images, videos and live camera streams from machine vision. We firstly propose a holistic regression model to initialize the face fiducial points under different head pose assumptions. Second, to reduce local shape variance, a hierarchical part-based regression method is further proposed to refine the global regression output. Experiments on several challenging faces-in-the-wild datasets demonstrate the c...
    In this study, we address a cross-domain problem of applying computer vision approaches to reason about human facial behaviour when people play The Resistance game. To capture the facial behaviours, we first collect several hours of video... more
    In this study, we address a cross-domain problem of applying computer vision approaches to reason about human facial behaviour when people play The Resistance game. To capture the facial behaviours, we first collect several hours of video where the participants playing The Resistance game assume the roles of deceivers (spies) vs truth-tellers (villagers). We develop a novel attention-based neural network (NN) that advances the state of the art in understanding how a NN predicts the players’ roles. This is accomplished by discovering through learning those pixels and related frames which are discriminative and contributed the most to the NN’s inference. We demonstrate the effectiveness of our attention-based approach in discovering the frames and facial Action Units (AUs) that contributed to the NN’s class decision. Our results are consistent with the current communication theory on deception.
    W e present a physics-based deformable model framework for the incremental object shape estimation and tracking in image sequences. The model is estimated by an optimization process that relates image-based cost functions to model motion... more
    W e present a physics-based deformable model framework for the incremental object shape estimation and tracking in image sequences. The model is estimated by an optimization process that relates image-based cost functions to model motion via the Lagrangian dynamics equations. Although previous approaches have investigated various combinations of cues in the context of deformable model shape and motion estimation, they generally assume a fixed, known, model parameterization, along with a single model discretization in terms of points. Our technique for object shape estimation and tracking is based on the incremental fusing of point information and new line information. Assuming that a deformable model has been initialized to fit part of a complex object (e.g., a bicycle) new line features belonging to the object but excluded from the initial model parameterization are identified during tracking. The identification is based on a set of novel model-based geometric consistency checks re...
    In Section 1 we describe a large—and expanding—set of linguistically annotated video data collected from native ASL signers. This corpus is accessible for use by the linguistics and computer science research communities and for... more
    In Section 1 we describe a large—and expanding—set of linguistically annotated video data collected from native ASL signers. This corpus is accessible for use by the linguistics and computer science research communities and for educational purposes through a new Web-based Data Access Interface (DAI) that we have been developing to facilitate viewing, searching, and downloading relevant subsets of the data. The annotations, also available in XML format, have been carried out using SignStream®, for which a Java reimplementation with many new features (especially for efficient input of phonological information) is now underway.
    Sequential face alignment, in essence, deals with nonrigid deformation that changes over time. Although numerous methods have been proposed to show impressive success on still images, many of them still suffer from limited performance... more
    Sequential face alignment, in essence, deals with nonrigid deformation that changes over time. Although numerous methods have been proposed to show impressive success on still images, many of them still suffer from limited performance when it comes to sequential alignment in wild scenarios, e.g., involving large pose/expression variations and partial occlusions. The underlying reason is that they usually perform sequential alignment by independently applying models trained offline in each frame in a tracking-by-detection manner but completely ignoring temporal constraints that become available in sequence. To address this issue, we propose to exploit incremental learning for person-specific alignment. Our approach takes advantage of part-based representation and cascade regression for robust and efficient alignment on each frame. More importantly, it incrementally updates the representation subspace and simultaneously adapts the cascade regressors in parallel using a unified framewo...
    We design a new connectivity pattern for the U-Net architecture. Given several stacked U-Nets, we couple each U-Net pair through the connections of their semantic blocks, resulting in the coupled U-Nets (CU-Net). The coupling connections... more
    We design a new connectivity pattern for the U-Net architecture. Given several stacked U-Nets, we couple each U-Net pair through the connections of their semantic blocks, resulting in the coupled U-Nets (CU-Net). The coupling connections could make the information flow more efficiently across U-Nets. The feature reuse across U-Nets makes each U-Net very parameter efficient. We evaluate the coupled U-Nets on two benchmark datasets of human pose estimation. Both the accuracy and model parameter number are compared. The CU-Net obtains comparable accuracy as state-of-the-art methods. However, it only has at least 60% fewer parameters than other approaches.
    The idea behind data augmentation techniques is based on the fact that slight changes in the percept do not change the brain cognition. In classification, neural networks use this fact by applying transformations to the inputs to learn to... more
    The idea behind data augmentation techniques is based on the fact that slight changes in the percept do not change the brain cognition. In classification, neural networks use this fact by applying transformations to the inputs to learn to predict the same label. However, in deep subspace clustering (DSC), the ground-truth labels are not available, and as a result, one cannot easily use data augmentation techniques. We propose a technique to exploit the benefits of data augmentation in DSC algorithms. We learn representations that have consistent subspaces for slightly transformed inputs. In particular, we introduce a temporal ensembling component to the objective function of DSC algorithms to enable the DSC networks to maintain consistent subspaces for random transformations in the input data. In addition, we provide a simple yet effective unsupervised procedure to find efficient data augmentation policies. An augmentation policy is defined as an image processing transformation with...

    And 571 more