[go: up one dir, main page]

 
 
applsci-logo

Journal Browser

Journal Browser

Multimodal Deep Learning Methods for Video Analytics

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (15 December 2019) | Viewed by 58670

Special Issue Editor


E-Mail Website
Guest Editor
Department of Industrial Security, Chung-Ang University, Seoul 06974, Republic of Korea
Interests: databases; big data analysis; music retrieval; multimedia systems; machine learning; knowledge management; computational intelligence
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The presence of video capturing devices is ubiquitous in the current era. The nature and range of video data now virtually covers all aspects of our daily lives. The wide variety of videos captured by video capturing devices include edited videos (movies, serials, etc.) at one end and a huge amount of unedited content (consumer videos, ego-centric videos, etc.) on the other end. Because of this ubiquitous presence of video capturing devices, the videos now contain rich information and knowledge which can be extracted and analyzed for a variety of applications. Video analytics is a broad field which encompasses the design and development of the systems having the capability to automatically analyze the videos for the detection of spatial and temporal events of interest.

In the last few years, deep learning algorithms have shown tremendous performance in many research areas especially computer vision and natural language processing (NLP). The deep learning-based algorithms have attained such remarkable performance in tasks like image recognition, speech recognition and NLP which was beyond expectation a decade ago. In multimodal deep learning, the data is obtained from different sources and then used to learn features over multiple modalities. This helps in generation of a shared representation between different modalities. It is expected that the usage of multiple modalities results in superior performance. An example of usage of multiple modalities in video analytics is the usage of audio, visual and (possibly) textual data for the sake of analysis.

The objectives of this Special Issue are to gather work done in video analytics using multimodal deep learning-based methods and to introduce work done on large scale new real-world applications of video analytics.

We solicit original research and survey papers addressing the topics including (but not limited to):

  • Analysis of first-person/wearable videos using multimodal deep learning techniques,
  • Analysis of web videos, ego-centric videos, surveillance videos, movies or any other type of videos using multimodal deep learning techniques,
  • Data collections, benchmarking, and performance evaluation of deep learning-based video analytics.
  • Multimodal deep convolutional neural network for audio-visual emotion recognition
  • Multimodal deep learning framework with cross weights
  • Multimodal information fusion via deep learning or machine learning methods

The topics in video analytics may include (but are not limited to):

  • Object detection and recognition
  • Action recognition
  • Event detection
  • Video highlights, summary and storyboard generation
  • Segmentation and tracking
  • Authoring and editing of videos
  • Scene understanding
  • People analysis
  • Security issues in surveillance videos

Dr. Seungmin Rho
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Audio-Visual Emotion Recognition
  • Deep Learning
  • Natural Language Processing
  • Video Analytics

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

11 pages, 3294 KiB  
Article
Cover the Violence: A Novel Deep-Learning-Based Approach Towards Violence-Detection in Movies
by Samee Ullah Khan, Ijaz Ul Haq, Seungmin Rho, Sung Wook Baik and Mi Young Lee
Appl. Sci. 2019, 9(22), 4963; https://doi.org/10.3390/app9224963 - 18 Nov 2019
Cited by 95 | Viewed by 8894
Abstract
Movies have become one of the major sources of entertainment in the current era, which are based on diverse ideas. Action movies have received the most attention in last few years, which contain violent scenes, because it is one of the undesirable features [...] Read more.
Movies have become one of the major sources of entertainment in the current era, which are based on diverse ideas. Action movies have received the most attention in last few years, which contain violent scenes, because it is one of the undesirable features for some individuals that is used to create charm and fantasy. However, these violent scenes have had a negative impact on kids, and they are not comfortable even for mature age people. The best way to stop under aged people from watching violent scenes in movies is to eliminate these scenes. In this paper, we proposed a violence detection scheme for movies that is comprised of three steps. First, the entire movie is segmented into shots, and then a representative frame from each shot is selected based on the level of saliency. Next, these selected frames are passed from a light-weight deep learning model, which is fine-tuned using a transfer learning approach to classify violence and non-violence shots in a movie. Finally, all the non-violence scenes are merged in a sequence to generate a violence-free movie that can be watched by children and as well violence paranoid people. The proposed model is evaluated on three violence benchmark datasets, and it is experimentally proved that the proposed scheme provides a fast and accurate detection of violent scenes in movies compared to the state-of-the-art methods. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Show Figures

Figure 1

Figure 1
<p>Proposed framework for violent scene detection in movies.</p>
Full article ">Figure 2
<p>Sample frames from movie ‘Undisputed II’ along with their saliency maps. The frames with red bounding boxes are discarded while the frame in blue is selected as a key-frame with a maximum saliency score of 0.7261.</p>
Full article ">Figure 3
<p>Visualization of different kinds convolution operations: (<b>a</b>) standard convolution; (<b>b</b>) depth-wise convolution; (<b>c</b>) point-wise convolution.</p>
Full article ">Figure 4
<p>Feature maps of violence and non-violence scenes.</p>
Full article ">Figure 5
<p>Sample frames from the datasets used for evaluation. In each row, the first three samples are from the violent class and the last three samples are from the non-violent class. (<b>a</b>) Violence in Movies, (<b>b</b>) Hockey Fights, and (<b>c</b>) Violence Scenes Detection datasets. The frames in red and blue are detected as violent and non-violent by the proposed method, respectively.</p>
Full article ">Figure 6
<p>Confusion matrices of the (<b>a</b>) Violence in Movies, (<b>b</b>) Hockey Fight, (<b>c</b>) Violence Scenes Detection datasets, and the (<b>d</b>) combined dataset.</p>
Full article ">Figure 7
<p>Performance evolution of the proposed model.</p>
Full article ">
20 pages, 8168 KiB  
Article
A Study on Development of the Camera-Based Blind Spot Detection System Using the Deep Learning Methodology
by Donghwoon Kwon, Ritesh Malaiya, Geumchae Yoon, Jeong-Tak Ryu and Su-Young Pi
Appl. Sci. 2019, 9(14), 2941; https://doi.org/10.3390/app9142941 - 23 Jul 2019
Cited by 12 | Viewed by 7187
Abstract
One of the recent news headlines is that a pedestrian was killed by an autonomous vehicle because safety features in this vehicle did not detect an object on a road correctly. Due to this accident, some global automobile companies announced plans to postpone [...] Read more.
One of the recent news headlines is that a pedestrian was killed by an autonomous vehicle because safety features in this vehicle did not detect an object on a road correctly. Due to this accident, some global automobile companies announced plans to postpone development of an autonomous vehicle. Furthermore, there is no doubt about the importance of safety features for autonomous vehicles. For this reason, our research goal is the development of a very safe and lightweight camera-based blind spot detection system, which can be applied to future autonomous vehicles. The blind spot detection system was implemented in open source software. Approximately 2000 vehicle images and 9000 non-vehicle images were adopted for training the Fully Connected Network (FCN) model. Other data processing concepts such as the Histogram of Oriented Gradients (HOG), heat map, and thresholding were also employed. We achieved 99.43% training accuracy and 98.99% testing accuracy of the FCN model, respectively. Source codes with respect to all the methodologies were then deployed to an off-the-shelf embedded board for actual testing on a road. Actual testing was conducted with consideration of various factors, and we confirmed 93.75% average detection accuracy with three false positives. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Show Figures

Figure 1

Figure 1
<p>Radar module of the blind spot detection system.</p>
Full article ">Figure 2
<p>Camera-based blind spot system.</p>
Full article ">Figure 3
<p>Example of vehicle images.</p>
Full article ">Figure 4
<p>Camera installation for preliminary testing.</p>
Full article ">Figure 5
<p>Example of the extracted images.</p>
Full article ">Figure 6
<p>Data visualization of the vehicle and non-vehicle images.</p>
Full article ">Figure 7
<p>Example of HOG feature extraction for the vehicle and non-vehicle object.</p>
Full article ">Figure 8
<p>Blind spot setting zone.</p>
Full article ">Figure 9
<p>Overall research framework. A total of 2000 vehicle images and approximately 9000 non-vehicle images were extracted for the purpose of data preprocessing. The HOG descriptor was applied for feature extraction, and 1188 HOG features for both vehicle and non-vehicle objects were fed into the FCN model as inputs. Keep in mind that representation or feature learning was not applied in this research due to hardware constraints. Two hidden layers, 1 ×<math display="inline"><semantics> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>6</mn> </mrow> </msup> </semantics></math> learning rate, and 10 epochs with Adam optimizer and ReLus activation were employed for training and testing the FCN model, and the training and testing accuracies achieved were 99.4% and 99.0%, respectively. We set a blind spot to a range of 400–800 px in x-axis and 450–550 px in y-axis in the 1300 px × 800 px image and applied the sliding window technique to detect vehicle features. Finally, the heat map and thresholding were employed to reduce false positive errors.</p>
Full article ">Figure 10
<p>Experimental results of both scenarios.</p>
Full article ">Figure 11
<p>System architecture of the embedded board: The embedded board comes with 64-bit Quad-Core and Dual-Core CPU clusters, a GPU with 256 Compute Unified Device Architecture (CUDA) cores, and 8 GB 128-bit LPDDR4 RAM. The main purpose and advantage of this embedded board is used for embedded AI platform with speed and power efficiency [<a href="#B43-applsci-09-02941" class="html-bibr">43</a>], but to use a variety of features offered by the embedded board, a lot of software packages should be installed.</p>
Full article ">Figure 12
<p>Experimental result of the first testing scenario.</p>
Full article ">Figure 13
<p>Experimental result of the second testing scenario.</p>
Full article ">
19 pages, 8018 KiB  
Article
A Video-Based Fire Detection Using Deep Learning Models
by Byoungjun Kim and Joonwhoan Lee
Appl. Sci. 2019, 9(14), 2862; https://doi.org/10.3390/app9142862 - 18 Jul 2019
Cited by 156 | Viewed by 15161
Abstract
Fire is an abnormal event which can cause significant damage to lives and property. In this paper, we propose a deep learning-based fire detection method using a video sequence, which imitates the human fire detection process. The proposed method uses Faster Region-based Convolutional [...] Read more.
Fire is an abnormal event which can cause significant damage to lives and property. In this paper, we propose a deep learning-based fire detection method using a video sequence, which imitates the human fire detection process. The proposed method uses Faster Region-based Convolutional Neural Network (R-CNN) to detect the suspected regions of fire (SRoFs) and of non-fire based on their spatial features. Then, the summarized features within the bounding boxes in successive frames are accumulated by Long Short-Term Memory (LSTM) to classify whether there is a fire or not in a short-term period. The decisions for successive short-term periods are then combined in the majority voting for the final decision in a long-term period. In addition, the areas of both flame and smoke are calculated and their temporal changes are reported to interpret the dynamic fire behavior with the final fire decision. Experiments show that the proposed long-term video-based method can successfully improve the fire detection accuracy compared with the still image-based or short-term video-based method by reducing both the false detections and the misdetections. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Show Figures

Figure 1

Figure 1
<p>The proposed network architecture.</p>
Full article ">Figure 2
<p>Timing diagram of fire detection.</p>
Full article ">Figure 3
<p>Faster Region-Based Convolutional Neural Network (R-CNN) structure for fire detection.</p>
Full article ">Figure 4
<p>Sample images for training Faster R-CNN; (<b>a</b>), (<b>b</b>) and (<b>c</b>) are flames, smoke, and non-fire images, respectively.</p>
Full article ">Figure 5
<p>The spatial feature extraction from Faster R-CNN.</p>
Full article ">Figure 6
<p>Frames with multiple suspected regions of fire (SRoFs). (<b>a</b>) Shows the case that the predicted flame and smoke area intersect, and (<b>b</b>) shows the case that the predicted flame and smoke area do not intersect.</p>
Full article ">Figure 7
<p>The long short-term memory (LSTM) network for fire detection.</p>
Full article ">Figure 8
<p>Calculation of weighted areas of SRoFs.</p>
Full article ">Figure 9
<p>Results of Faster R-CNN fire detection, (<b>a</b>) flames detection results, (<b>b</b>) smoke detection results, and (<b>c</b>) false positive detection results.</p>
Full article ">Figure 10
<p>Sample still shots including flame, smoke, and non-fire objects taken from video clips for LSTM training. The (<b>a</b>) images are taken from videos of fire, while the (<b>b</b>) images are from non-fire video clips.</p>
Full article ">Figure 11
<p>Set of the representative images from Foggia et al.’s [<a href="#B11-applsci-09-02862" class="html-bibr">11</a>] dataset. The (<b>a</b>) images are taken from videos of fires, while the (<b>b</b>) images are from non-fire videos.</p>
Full article ">Figure 12
<p>Sample still shots taken from video clips for the experiment of majority voting and interpretation of dynamic fire behavior: (<b>a</b>) real fires, and (<b>b</b>) non-fires such as chimney smoke, sunset, cloud, fog, and light.</p>
Full article ">Figure 13
<p>Sample still shots of the video clip shown in the first row and first column.</p>
Full article ">Figure 14
<p>Changes in areas of flame and smoke with final decisions by majority voting for the video clip of <a href="#applsci-09-02862-f013" class="html-fig">Figure 13</a>.</p>
Full article ">Figure 15
<p>Sample still shots of the sunrise video clip.</p>
Full article ">Figure 16
<p>Changes in areas of flame and smoke with final decisions by majority voting for the video clip of <a href="#applsci-09-02862-f015" class="html-fig">Figure 15</a>.</p>
Full article ">Figure 17
<p>Sample still shots of decreasing followed by increasing flame video clip.</p>
Full article ">Figure 18
<p>Changes in areas of flame and smoke with final decisions by majority voting for the video clip of <a href="#applsci-09-02862-f017" class="html-fig">Figure 17</a>.</p>
Full article ">Figure 19
<p>Sample still shots of decreasing flame video clip.</p>
Full article ">Figure 20
<p>Changes in areas of flame and smoke with final decisions by majority voting for the video clip of <a href="#applsci-09-02862-f019" class="html-fig">Figure 19</a>.</p>
Full article ">
13 pages, 2578 KiB  
Article
Classification of Marine Vessels with Multi-Feature Structure Fusion
by Erhu Zhang, Kelu Wang and Guangfeng Lin
Appl. Sci. 2019, 9(10), 2153; https://doi.org/10.3390/app9102153 - 27 May 2019
Cited by 23 | Viewed by 4195
Abstract
The classification of marine vessels is one of the important problems of maritime traffic. To fully exploit the complementarity between different features and to more effectively identify marine vessels, a novel feature structure fusion method based on spectral regression discriminant analysis (SF-SRDA) was [...] Read more.
The classification of marine vessels is one of the important problems of maritime traffic. To fully exploit the complementarity between different features and to more effectively identify marine vessels, a novel feature structure fusion method based on spectral regression discriminant analysis (SF-SRDA) was proposed. Firstly, we selected the different convolutional neural network features that better describe the characteristics of ships, and constructed the features based on graphs by the similarity metric. Then we weighed the concatenate multi-feature and fused their structures according to the linear relationship assumption. Finally, we constructed the optimization formula to solve the fusion features and structure by using spectral regression discriminant analyses. Experiments on the VAIS dataset show that the proposed SF-SRDA method can reduce the feature dimension from the original 102,400 dimensions to 5 dimensions, that the classification accuracy of visible images can reach 87.60%, and that that of the infrared image can reach 74.68% at daytime. The experimental results demonstrate that the proposed method can not only extract the optimal features from the original redundant feature space, but also greatly reduce the dimensions of the feature. Furthermore, the classification performance of SF-SRDA also gets a promising result. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Show Figures

Figure 1

Figure 1
<p>The overall framework of multi-feature Structure Fusion based on Spectral Regression Discriminant Analysis (SF-SRDA). CNN: convolutional neural network, IR: Infrared images.</p>
Full article ">Figure 2
<p>The location of the extracted features from VGG-19 or ResNet-152.</p>
Full article ">Figure 3
<p>Internal structure fusion of multi-feature.</p>
Full article ">Figure 4
<p>Some visible image of VAIS dataset.</p>
Full article ">Figure 5
<p>Some infrared image of VAIS dataset.</p>
Full article ">Figure 6
<p>Fine-grained image examples under the sailing class.</p>
Full article ">
13 pages, 1429 KiB  
Article
Parallel Image Captioning Using 2D Masked Convolution
by Chanrith Poleak and Jangwoo Kwon
Appl. Sci. 2019, 9(9), 1871; https://doi.org/10.3390/app9091871 - 7 May 2019
Cited by 3 | Viewed by 3290
Abstract
Automatically generating a novel description of an image is a challenging and important problem that brings together advanced research in both computer vision and natural language processing. In recent years, image captioning has significantly improved its performance by using long short-term memory (LSTM) [...] Read more.
Automatically generating a novel description of an image is a challenging and important problem that brings together advanced research in both computer vision and natural language processing. In recent years, image captioning has significantly improved its performance by using long short-term memory (LSTM) as a decoder for the language model. However, despite this improvement, LSTM itself has its own shortcomings as a model because the structure is complicated and its nature is inherently sequential. This paper proposes a model using a simple convolutional network for both encoder and decoder functions of image captioning, instead of the current state-of-the-art approach. Our experiment with this model on a Microsoft Common Objects in Context (MSCOCO) captioning dataset yielded results that are competitive with the state-of-the-art image captioning model across different evaluation metrics, while having a much simpler model and enabling parallel graphics processing unit (GPU) computation during training, resulting in a faster training time. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Show Figures

Figure 1

Figure 1
<p>The architectural overview of our image captioning using the 2D-mask convolutional network.</p>
Full article ">Figure 2
<p>The DenseNet architecture of the 2D masked convolution.</p>
Full article ">Figure 3
<p>The detailed implementation of our 2D convolution. After receiving the output from word embedding, we apply a 2D masked convolutional layer followed by a gated linear unit and an attention layer. We stack this 2D convolution four times before applying output embedding to get the output word sequences.</p>
Full article ">Figure 4
<p>The top three sentences generated by the beam search algorithm.</p>
Full article ">Figure 5
<p>The qualitative comparison between the captions generated by our model and the ground truth (GT) captions randomly picked from the MSCOCO dataset.</p>
Full article ">Figure 6
<p>The progression of scores and cross entropy loss improvement along the epochs: (<b>a</b>) the BLEU-1 evaluation score, starting from 59.5 in Epoch 1 and peaking at 73.6 in Epoch 26 before it stop improving; (<b>b</b>) the CIDEr evaluation score, starting from 50.6 in Epoch 1 and peaking at 99.6 in Epoch 24 before no longer improving; and (<b>c</b>) the cross entropy loss of our model across the epoch.</p>
Full article ">
16 pages, 1141 KiB  
Article
Bilinear CNN Model for Fine-Grained Classification Based on Subcategory-Similarity Measurement
by Xinghua Dai, Shengrong Gong, Shan Zhong and Zongming Bao
Appl. Sci. 2019, 9(2), 301; https://doi.org/10.3390/app9020301 - 16 Jan 2019
Cited by 9 | Viewed by 4703
Abstract
One of the challenges in fine-grained classification is that subcategories with significant similarity are hard to be distinguished due to the equal treatment of all subcategories in existing algorithms. In order to solve this problem, a fine-grained image classification method by combining a [...] Read more.
One of the challenges in fine-grained classification is that subcategories with significant similarity are hard to be distinguished due to the equal treatment of all subcategories in existing algorithms. In order to solve this problem, a fine-grained image classification method by combining a bilinear convolutional neural network (B-CNN) and the measurement of subcategory similarities is proposed. Firstly, an improved weakly supervised localization method is designed to obtain the bounding box of the main object, which allows the model to eliminate the influence of background noise and obtain more accurate features. Then, sample features in the training set are computed by B-CNN so that the fuzzing similarity matrix for measuring interclass similarities can be obtained. To further improve classification accuracy, the loss function is designed by weighting triplet loss and softmax loss. Extensive experiments implemented on two benchmarks datasets, Stanford Cars-196 and Caltech-UCSD Birds-200-2011 (CUB-200-2011), show that the newly proposed method outperforms in accuracy several state-of-the-art weakly supervised classification models. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Show Figures

Figure 1

Figure 1
<p>Sample examples on Stanford Cars-196.</p>
Full article ">Figure 2
<p>Schematic view of overall structure.</p>
Full article ">Figure 3
<p>Schematic diagram of weakly supervised localization.</p>
Full article ">Figure 4
<p>Architecture of VGG-16. The blue layer represents the pooling layer while the white layer represents the convolution and activation layer.</p>
Full article ">Figure 5
<p>Schematic diagram of triplet-loss learning.</p>
Full article ">Figure 6
<p>Convergence-rate comparison.</p>
Full article ">Figure 7
<p>Relation between training error and iteration number.</p>
Full article ">Figure 8
<p>Performance of different <math display="inline"><semantics> <mi>α</mi> </semantics></math> values.</p>
Full article ">
19 pages, 3670 KiB  
Article
Deep Learning Based Computer Generated Face Identification Using Convolutional Neural Network
by L. Minh Dang, Syed Ibrahim Hassan, Suhyeon Im, Jaecheol Lee, Sujin Lee and Hyeonjoon Moon
Appl. Sci. 2018, 8(12), 2610; https://doi.org/10.3390/app8122610 - 13 Dec 2018
Cited by 48 | Viewed by 9776
Abstract
Generative adversarial networks (GANs) describe an emerging generative model which has made impressive progress in the last few years in generating photorealistic facial images. As the result, it has become more and more difficult to differentiate between computer-generated and real face images, even [...] Read more.
Generative adversarial networks (GANs) describe an emerging generative model which has made impressive progress in the last few years in generating photorealistic facial images. As the result, it has become more and more difficult to differentiate between computer-generated and real face images, even with the human’s eyes. If the generated images are used with the intent to mislead and deceive readers, it would probably cause severe ethical, moral, and legal issues. Moreover, it is challenging to collect a dataset for computer-generated face identification that is large enough for research purposes because the number of realistic computer-generated images is still limited and scattered on the internet. Thus, a development of a novel decision support system for analyzing and detecting computer-generated face images generated by the GAN network is crucial. In this paper, we propose a customized convolutional neural network, namely CGFace, which is specifically designed for the computer-generated face detection task by customizing the number of convolutional layers, so it performs well in detecting computer-generated face images. After that, an imbalanced framework (IF-CGFace) is created by altering CGFace’s layer structure to adjust to the imbalanced data issue by extracting features from CGFace layers and use them to train AdaBoost and eXtreme Gradient Boosting (XGB). Next, we explain the process of generating a large computer-generated dataset based on the state-of-the-art PCGAN and BEGAN model. Then, various experiments are carried out to show that the proposed model with augmented input yields the highest accuracy at 98%. Finally, we provided comparative results by applying the proposed CNN architecture on images generated by other GAN researches. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Example of computer-generated images generated by GAN (One of these is a photo of a real face, other faces are completely created by a GAN).</p>
Full article ">Figure 2
<p>Overall architecture of the proposed method.</p>
Full article ">Figure 3
<p>CGFace model with detailed configuration for each layer.</p>
Full article ">Figure 4
<p>Generated images with different values of the hyper parameter <math display="inline"><semantics> <mi>γ</mi> </semantics></math> (some noise-like regions are marked with red).</p>
Full article ">Figure 5
<p>Differences in the generated images when the number of iterations increases.</p>
Full article ">Figure 6
<p>Images with the size of 64 × 64 generated by the BEGAN model.</p>
Full article ">Figure 7
<p>Generated images from the trained PCGAN model.</p>
Full article ">Figure 8
<p>Generated images from the trained BGAN model.</p>
Full article ">Figure 9
<p>A comparison regarding AUC for various methods through 4 folds on the imbalanced environment (No. CG images / No. real images = 1/10).</p>
Full article ">Figure 10
<p>The AUC performance of the proposed models and two other models on different datasets.</p>
Full article ">Figure 11
<p>The time complexity between the proposed models and other models.</p>
Full article ">
21 pages, 6127 KiB  
Article
Temporal Modeling on Multi-Temporal-Scale Spatiotemporal Atoms for Action Recognition
by Guangle Yao, Tao Lei, Xianyuan Liu and Ping Jiang
Appl. Sci. 2018, 8(10), 1835; https://doi.org/10.3390/app8101835 - 6 Oct 2018
Cited by 2 | Viewed by 3105
Abstract
As an important branch of video analysis, human action recognition has attracted extensive research attention in computer vision and artificial intelligence communities. In this paper, we propose to model the temporal evolution of multi-temporal-scale atoms for action recognition. An action can be considered [...] Read more.
As an important branch of video analysis, human action recognition has attracted extensive research attention in computer vision and artificial intelligence communities. In this paper, we propose to model the temporal evolution of multi-temporal-scale atoms for action recognition. An action can be considered as a temporal sequence of action units. These action units which we referred to as action atoms, can capture the key semantic and characteristic spatiotemporal features of actions in different temporal scales. We first investigate Res3D, a powerful 3D CNN architecture and create the variants of Res3D for different temporal scale. In each temporal scale, we design some practices to transfer the knowledge learned from RGB to optical flow (OF) and build RGB and OF streams to extract deep spatiotemporal information using Res3D. Then we propose an unsupervised method to mine action atoms in the deep spatiotemporal space. Finally, we use long short-term memory (LSTM) to model the temporal evolution of atoms for action recognition. The experimental results show that our proposed multi-temporal-scale spatiotemporal atoms modeling method achieves recognition performance comparable to that of state-of-the-art methods on two challenging action recognition datasets: UCF101 and HMDB51. Full article
(This article belongs to the Special Issue Multimodal Deep Learning Methods for Video Analytics)
Show Figures

Figure 1

Figure 1
<p>The framework of our proposed method.</p>
Full article ">Figure 2
<p>The illustration of a stream in our proposed method.</p>
Full article ">Figure 3
<p>Two consecutive frames, optical flow in the horizontal and vertical directions and 3-channel optical flow.</p>
Full article ">Figure 4
<p>Comparison of two types of training: (<b>a</b>) fine-tuning using the pre-trained model; and (<b>b</b>) training from scratch. The experiment is conducted in in the optical flow (OF) stream of 8f temporal scale on UCF101 split-1.</p>
Full article ">Figure 5
<p>The mined atoms (S = 4) sorted by timestamps and their corresponding clips in the 8f temporal scale.</p>
Full article ">Figure 6
<p>The structure of long short-term memory (LSTM) unit [<a href="#B45-applsci-08-01835" class="html-bibr">45</a>].</p>
Full article ">Figure 7
<p>Three types of LSTM inferences.</p>
Full article ">Figure 8
<p>Fusions of the RGB and OF streams.</p>
Full article ">Figure 9
<p>Sample frames from UCF101 and HMDB51.</p>
Full article ">Figure 10
<p>Visualization of the Conv1 feature maps. Res3D captures the appearance information in the first frame of the clip and thereafter captures the motion information.</p>
Full article ">Figure 11
<p>Visualization of the kernels. The shape of the 64 Conv1 kernels is 3 × 3 × 7 × 7 (C*T*H*W). The visualization shows the learned Conv1 kernels with T = 2, which are arranged in 8 × 8 with a 4× upscale.</p>
Full article ">Figure 12
<p>Atoms embedding visualizations of the 4f, 8f and 16f temporal scales on UCF101 training split-1. The averaged atom of each action is visualized as a point and the averaged atoms that belong to the same action class have the same color.</p>
Full article ">
Back to TopTop