Visual attention in deep learning: a review

doi:10.15406/iratj.2018.04.00113

eISSN: 2574-8092

International Robotics & Automation Journal

Mini Review Volume 4 Issue 3

Visual attention in deep learning: a review

Xinyi Liu,¹ Mariofanna Milanova²

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

¹System Engineering Department, University of Arkansas at Little Rock, USA
²Computer Science Department, University of Arkansas at Little Rock, USA

Correspondence: Mariofanna Milanova, Computer Science Department, University of Arkansas at Little Rock, USA

Received: April 06, 2018 | Published: May 1, 2018

Citation: Liu X, Milanova M. Visual attention in deep learning: a review. Int Rob Auto J. 2018;4(3):154-155. DOI: 10.15406/iratj.2018.04.00113

Download PDF

Abstract

Visual Attention is an essential part of human visual perception. This review describes two main classes of attention models. Benefiting from the rapid growth of deep learning, CNN-based or other deep learning-based models are capable of establish new state-of-the-art in this challenging research problem.

Keywords: deep learning, visual attention, CNN, saliency

Introduction

When facing a complex visual scene, human can efficiently locate region of interest and analyze the scene by selectively processing subsets of visual input. Attention was employed to narrow down the search and speed up the process.¹ Visual attention is a hot topic in computer vision, neuroscience and deep learning area. It’s widely used in object segmentation, object recognition, image caption generation^2,3and visual question answering (VQA) etc.⁴ In last few years, deep learning has been growing rapidly. Many Convolutional neural networks and recurrent neural networks have achieved much better performance in various computer vision and natural language processing tasks, compared to previous traditional methods. The visual attention models are mainly categorized into Bottom-up models and top-down models.

Bottom-up attention models

The Models are based on the image feature of the visual scene. The goal of bottom-up model is to find the fixation points, where it stands out from its surrounding and grab our attention at first glance. The main idea of bottom up visual attention models is that the attention is unconsciously driven by low level stimulus. Most of traditional bottom-up attention models use hand-designed low-level image features (e.g., color, intensity) to produce saliency map and image representation. As a classic Salient Region Detection method, histogram-based contrast (HC) and region-based contrast (RC) algorithm⁵ generate saliency map by evaluating global contrast differences and spatial weighted coherence scores. CNN-based models have achieved state-of-the-art result by producing more accurate feature map. The bottom-up attention model proposed in⁶ was implemented with Faster R-CNN, while spatial regions are represented as bounding boxes; provide significant improvement on VQA tasks.

Top-down attention models

The Models are driven by the observer’s prior knowledge and current goal. The recurrent attention model(RAM) proposed in⁷ mimicked human attention and eye movement mechanism, supposed there is a ‘bandwidth’ limit for each glimpse, to predict future eye movements and location to see at next time step. This method is computational effective compared to classical sliding window paradigm, which come at a high computational cost to apply convolving filter maps with entire image. The RAM model is an recurrent neural network (RNN)⁸ consists of glimpse network,⁷ core network, action network and location network. The glimpse representation is input for core network, which combining with the internal representation at previous time step, produces the new internal state. The location network and the action network produce the next location to look at and the action/classification using the internal state of the model. Although RAM has shown its performance on digit classification tasks, the ability of dealing with multiple objects is limited. Deep recurrent visual attention model (DRAM)⁹ was proposed to expand it for multiple object recognition. It is a deep recurrent neural network to decide where to focus its computation and trained with reinforcement learning. DRAM is composed of glimpse network, recurrent network, emission network, context network and classification network. When DRAM explores the image in a sequential manner with attention mechanism, it learns to predict one object at a time. A label sequence will be generated for multiple objects. An attention-based model to generate neural image caption was proposed.³ The form of attention described has two variants: stochastic “hard” attention and deterministic “soft” attention. The idea of this attention-based model is to focus on different attention regions of the visual input as it refined predictions and move focus in time. A salient object detection method proposed¹⁰ and introduced short connections to the skip-layer structures within the Holistically-Nested Edge Detector (HED) architecture.¹¹ A Unified Framework¹² was proposed to extract richer feature representations for pixel-wise binary regression problems (e.g., salient object segmentation and edge detection). They construct a horizontal cascade, which connects a sequence of stages together to transmit the multi-level features horizontally. Enriched deep recurrent visual attention model (EDRAM)¹³ combined special transformer and recurrent architecture, was used for multiple object recognition. It can perform object localization and recognition at same time. With Spatial transformer¹⁴ used as attention mechanism, the architecture became fully differentiable, achieved superior performance on MNIST Cluttered dataset¹⁵ and SVHN dataset.^16–18

Conclusion

In this brief review, several widely used and state-of-the-art visual attention models were introduced. With visual attention model being introduced into image classification, object recognition and VQA tasks, the computational cost was relieved. Benefiting from the rapid growth of deep learning, CNN-based or other deep learning-based models are capable of establish new state-of-the-art result on datasets like MNIST and SVHN, it also can deal with more complicated cases compared to classical methods. We introduced them in a manner of categorizing as Top-down and Bottom-up mechanism.