We introduce Retrieval Augmented Classification (RAC), a generic approach to augmenting standard ... more We introduce Retrieval Augmented Classification (RAC), a generic approach to augmenting standard image classification pipelines with an explicit retrieval module. RAC consists of a standard base image encoder fused with a parallel retrieval branch that queries a non-parametric external memory of pre-encoded images and associated text snippets. We apply RAC to the problem of long-tail classification and demonstrate a significant improvement over previous state-of-the-art on Places365-LT and iNaturalist2018 (14.5% and 6.7% respectively), despite using only the training datasets themselves as the external information source. We demonstrate that RAC’s retrieval module, without prompting, learns a high level of accuracy on tail classes. This, in turn, frees the base encoder to focus on common classes, and improve its performance thereon. RAC represents an alternative approach to utilizing large, pretrained models without requiring fine-tuning, as well as a first step towards more effecti...
Deep generative models have been used in recent years to learn coherent latent representations in... more Deep generative models have been used in recent years to learn coherent latent representations in order to synthesize high quality images. In this work we propose a neural network to learn a generative model for sampling consistent indoor scene layouts. Our method learns the co-occurrences, and appearance parameters such as shape and pose, for different objects categories through a grammar-based auto-encoder, resulting in a compact and accurate representation for scene layouts. In contrast to existing grammar-based methods with a user-specified grammar, we construct the grammar automatically by extracting a set of production rules on reasoning about object co-occurrences in training data. The extracted grammar is able to represent a scene by an augmented parse tree. The proposed auto-encoder encodes these parse trees to a latent code, and decodes the latent code to a parse-tree, thereby ensuring the generated scene is always valid. We experimentally demonstrate that the proposed aut...
Digital Hampi: Preserving Indian Cultural Heritage, 2017
Heritage artefacts and monuments are important components of social science. Those are under cons... more Heritage artefacts and monuments are important components of social science. Those are under constant threat of decaying and degrading due to exposition to unfriendly natural environment and hooliganism. Restoration of heritage artefacts such as murals and paintings is an important task for preservation of social, cultural and political history of a nation. As being in the temples in India, a significant share of murals and paintings are not accessible for physical restoration. This motivates many researchers to put effort in restoration of such priceless paintings and reliefs digitally in augmented reality domain. In this work, we have proposed an exemplar based coherent texture synthesis technique to inpaint the digital image of damaged portion of murals and paintings. Inpainting method, while maintaining the spatial coherency, usually introduces blurring as well as structured noise to the inpainted regions. To overcome this problem, we have combined the proposed patch-based diffusion technique with a novel technique for high-frequency generation that leads to edge sharpening and denoising simultaneously. Finally, the proposed constraint and interactive nature of the method is found efficient to handle rich variety of such paintings. The experimental results with empirical evaluation show the efficacy of the proposed method.
Bishnupur is an attractive tourist place in West Bengal, India and is known for its terracotta te... more Bishnupur is an attractive tourist place in West Bengal, India and is known for its terracotta temples. The place is one of the prospective candidates to be included in the list of UNESCO World Heritage sites. We intend to preserve this heritage site digitally and also to present some virtual interaction for the tourist and researchers. In this paper, we present an image dataset of different temples (namely, Jor Bangla, Kalachand, Madan Mohan, Radha Madhav, Rasmancha, Shyamrai and Nandalal) in Bishnupur for evaluating different types of computer vision and image processing algorithms (like 3D reconstruction, image inpainting, texture classification and content specific image retrieval). The dataset is captured using four different cameras with different parameter settings. Some datasets are extracted and earmarked for certain applications such as texture classification, image inpainting and content specific image retrieval. Example results of baseline methods are also shown for thes...
Heritage preservation and awareness building has become a major application domain for computer v... more Heritage preservation and awareness building has become a major application domain for computer vision techniques in recent days. Bishnupur is an important and well-known heritage site in West Bengal, India. This attractive tourist place is famous for its terracotta temples. In this article, we present an image dataset created by us for developing and evaluating various computer vision algorithms for preservation and visualization of heritage artifacts in digital space. The dataset includes images of some important temples, such as Jor Bangla temple, Kalachand temple, Madan Mohan temple, Nandalal temple, Radha Madhav temple, Rasmancha, and Shyamrai temple. Though this dataset can be used for many types of computer vision and image analysis algorithms, we have shown here its usefulness by testing the images for four different applications: 3D reconstruction, image inpainting, texture classification, and content-specific figure spotting and retrieval. Note that we have shown the resul...
Deep generative models have been used in recent years to learn coherent latent representations in... more Deep generative models have been used in recent years to learn coherent latent representations in order to synthesize high-quality images. In this work, we propose a neural network to learn a generative model for sampling consistent indoor scene layouts. Our method learns the co-occurrences, and appearance parameters such as shape and pose, for different objects categories through a grammar-based auto-encoder, resulting in a compact and accurate representation for scene layouts. In contrast to existing grammar-based methods with a user-specified grammar, we construct the grammar automatically by extracting a set of production rules on reasoning about object co-occurrences in training data. The extracted grammar is able to represent a scene by an augmented parse tree. The proposed auto-encoder encodes these parse trees to a latent code, and decodes the latent code to a parse tree, thereby ensuring the generated scene is always valid. We experimentally demonstrate that the proposed au...
Image based localization is one of the important problems in computer vision due to its wide appl... more Image based localization is one of the important problems in computer vision due to its wide applicability in robotics, augmented reality, and autonomous systems. There is a rich set of methods described in the literature on how to geometrically register a 2D image w.r.t. a 3D model. In particular, data augmentation methods such as synthetic image generation have been shown to be useful for this task. In this work, we propose a synthetic data augmentation technique and design a deep neural network, that can be trained to estimate the absolute pose of an image from synthesized sparse feature descriptors. Our choice of using sparse feature descriptors has two major advantages: first, our network is significantly smaller than the CNNs proposed in the literature for this task—thereby making our approach more efficient and scalable. Second—and more importantly—, usage of sparse features allows to augment the training data with synthetic viewpoints, which leads to substantial improvements...
In this paper, a novel Pixel-Voxel network is proposed for dense 3D semantic mapping, which can p... more In this paper, a novel Pixel-Voxel network is proposed for dense 3D semantic mapping, which can perform dense 3D mapping while simultaneously recognizing and labelling the semantic category each point in the 3D map. In our approach, we fully leverage the advantages of different modalities. That is, the PixelNet can learn the high-level contextual information from 2D RGB images, and the VoxelNet can learn 3D geometrical shapes from the 3D point cloud. Unlike the existing architecture that fuses score maps from different modalities with equal weights, we propose a softmax weighted fusion stack that adaptively learns the varying contributions of PixelNet and VoxelNet and fuses the score maps according to their respective confidence levels. Our approach achieved competitive results on both the SUN RGB-D and NYU V2 benchmarks, while the runtime of the proposed system is boosted to around 13 Hz, enabling near-real-time performance using an i7 eight-cores PC with a single Titan X GPU.
We introduce Retrieval Augmented Classification (RAC), a generic approach to augmenting standard ... more We introduce Retrieval Augmented Classification (RAC), a generic approach to augmenting standard image classification pipelines with an explicit retrieval module. RAC consists of a standard base image encoder fused with a parallel retrieval branch that queries a non-parametric external memory of pre-encoded images and associated text snippets. We apply RAC to the problem of long-tail classification and demonstrate a significant improvement over previous state-of-the-art on Places365-LT and iNaturalist2018 (14.5% and 6.7% respectively), despite using only the training datasets themselves as the external information source. We demonstrate that RAC’s retrieval module, without prompting, learns a high level of accuracy on tail classes. This, in turn, frees the base encoder to focus on common classes, and improve its performance thereon. RAC represents an alternative approach to utilizing large, pretrained models without requiring fine-tuning, as well as a first step towards more effecti...
Deep generative models have been used in recent years to learn coherent latent representations in... more Deep generative models have been used in recent years to learn coherent latent representations in order to synthesize high quality images. In this work we propose a neural network to learn a generative model for sampling consistent indoor scene layouts. Our method learns the co-occurrences, and appearance parameters such as shape and pose, for different objects categories through a grammar-based auto-encoder, resulting in a compact and accurate representation for scene layouts. In contrast to existing grammar-based methods with a user-specified grammar, we construct the grammar automatically by extracting a set of production rules on reasoning about object co-occurrences in training data. The extracted grammar is able to represent a scene by an augmented parse tree. The proposed auto-encoder encodes these parse trees to a latent code, and decodes the latent code to a parse-tree, thereby ensuring the generated scene is always valid. We experimentally demonstrate that the proposed aut...
Digital Hampi: Preserving Indian Cultural Heritage, 2017
Heritage artefacts and monuments are important components of social science. Those are under cons... more Heritage artefacts and monuments are important components of social science. Those are under constant threat of decaying and degrading due to exposition to unfriendly natural environment and hooliganism. Restoration of heritage artefacts such as murals and paintings is an important task for preservation of social, cultural and political history of a nation. As being in the temples in India, a significant share of murals and paintings are not accessible for physical restoration. This motivates many researchers to put effort in restoration of such priceless paintings and reliefs digitally in augmented reality domain. In this work, we have proposed an exemplar based coherent texture synthesis technique to inpaint the digital image of damaged portion of murals and paintings. Inpainting method, while maintaining the spatial coherency, usually introduces blurring as well as structured noise to the inpainted regions. To overcome this problem, we have combined the proposed patch-based diffusion technique with a novel technique for high-frequency generation that leads to edge sharpening and denoising simultaneously. Finally, the proposed constraint and interactive nature of the method is found efficient to handle rich variety of such paintings. The experimental results with empirical evaluation show the efficacy of the proposed method.
Bishnupur is an attractive tourist place in West Bengal, India and is known for its terracotta te... more Bishnupur is an attractive tourist place in West Bengal, India and is known for its terracotta temples. The place is one of the prospective candidates to be included in the list of UNESCO World Heritage sites. We intend to preserve this heritage site digitally and also to present some virtual interaction for the tourist and researchers. In this paper, we present an image dataset of different temples (namely, Jor Bangla, Kalachand, Madan Mohan, Radha Madhav, Rasmancha, Shyamrai and Nandalal) in Bishnupur for evaluating different types of computer vision and image processing algorithms (like 3D reconstruction, image inpainting, texture classification and content specific image retrieval). The dataset is captured using four different cameras with different parameter settings. Some datasets are extracted and earmarked for certain applications such as texture classification, image inpainting and content specific image retrieval. Example results of baseline methods are also shown for thes...
Heritage preservation and awareness building has become a major application domain for computer v... more Heritage preservation and awareness building has become a major application domain for computer vision techniques in recent days. Bishnupur is an important and well-known heritage site in West Bengal, India. This attractive tourist place is famous for its terracotta temples. In this article, we present an image dataset created by us for developing and evaluating various computer vision algorithms for preservation and visualization of heritage artifacts in digital space. The dataset includes images of some important temples, such as Jor Bangla temple, Kalachand temple, Madan Mohan temple, Nandalal temple, Radha Madhav temple, Rasmancha, and Shyamrai temple. Though this dataset can be used for many types of computer vision and image analysis algorithms, we have shown here its usefulness by testing the images for four different applications: 3D reconstruction, image inpainting, texture classification, and content-specific figure spotting and retrieval. Note that we have shown the resul...
Deep generative models have been used in recent years to learn coherent latent representations in... more Deep generative models have been used in recent years to learn coherent latent representations in order to synthesize high-quality images. In this work, we propose a neural network to learn a generative model for sampling consistent indoor scene layouts. Our method learns the co-occurrences, and appearance parameters such as shape and pose, for different objects categories through a grammar-based auto-encoder, resulting in a compact and accurate representation for scene layouts. In contrast to existing grammar-based methods with a user-specified grammar, we construct the grammar automatically by extracting a set of production rules on reasoning about object co-occurrences in training data. The extracted grammar is able to represent a scene by an augmented parse tree. The proposed auto-encoder encodes these parse trees to a latent code, and decodes the latent code to a parse tree, thereby ensuring the generated scene is always valid. We experimentally demonstrate that the proposed au...
Image based localization is one of the important problems in computer vision due to its wide appl... more Image based localization is one of the important problems in computer vision due to its wide applicability in robotics, augmented reality, and autonomous systems. There is a rich set of methods described in the literature on how to geometrically register a 2D image w.r.t. a 3D model. In particular, data augmentation methods such as synthetic image generation have been shown to be useful for this task. In this work, we propose a synthetic data augmentation technique and design a deep neural network, that can be trained to estimate the absolute pose of an image from synthesized sparse feature descriptors. Our choice of using sparse feature descriptors has two major advantages: first, our network is significantly smaller than the CNNs proposed in the literature for this task—thereby making our approach more efficient and scalable. Second—and more importantly—, usage of sparse features allows to augment the training data with synthetic viewpoints, which leads to substantial improvements...
In this paper, a novel Pixel-Voxel network is proposed for dense 3D semantic mapping, which can p... more In this paper, a novel Pixel-Voxel network is proposed for dense 3D semantic mapping, which can perform dense 3D mapping while simultaneously recognizing and labelling the semantic category each point in the 3D map. In our approach, we fully leverage the advantages of different modalities. That is, the PixelNet can learn the high-level contextual information from 2D RGB images, and the VoxelNet can learn 3D geometrical shapes from the 3D point cloud. Unlike the existing architecture that fuses score maps from different modalities with equal weights, we propose a softmax weighted fusion stack that adaptively learns the varying contributions of PixelNet and VoxelNet and fuses the score maps according to their respective confidence levels. Our approach achieved competitive results on both the SUN RGB-D and NYU V2 benchmarks, while the runtime of the proposed system is boosted to around 13 Hz, enabling near-real-time performance using an i7 eight-cores PC with a single Titan X GPU.
Uploads
Papers by Pulak Purkait