Elsevier

Information Fusion

Volume 57, May 2020, Pages 89-101
Information Fusion

Neural architecture search for image saliency fusion

https://doi.org/10.1016/j.inffus.2019.12.007Get rights and content

Highlights

  • Neural architecture search addressed with Genetic Programming and Backpropagation.

  • Genetic Programming efficiently provides blueprints for neural network architectures.

  • Backpropagation significantly improves the performance of candidate blueprints.

  • Proper fusion of hand-crafted saliency methods can outperform deep learning methods.

  • Proper fusion of deep learning methods outperforms the state of the art.

Abstract

Saliency detection methods proposed in the literature exploit different rationales, visual clues, and assumptions, but there is no single best saliency detection algorithm that is able to achieve good results on all the different benchmark datasets. In this paper we show that fusing different saliency detection algorithms together by exploiting neural network architectures makes it possible to obtain better results. Designing the best architecture for a given task is still an open problem since the existing techniques have some limits with respect to the problem formulation, to the search space, and require very high computational resources. To overcome these problems, in this paper we propose a three-step fusion approach. In the first step, genetic programming techniques are exploited to combine the outputs of existing saliency algorithms using a set of provided operations. Having a discrete search space allows us a fast generation of the candidate solutions. In the second step, the obtained solutions are converted into backbone Convolutional Neural Networks (CNNs) where operations are all implemented with differentiable functions, allowing an efficient optimization of the corresponding parameters (in a continuous space) by backpropagation. In the last step, to enrich the expressiveness of the initial architectures, the networks are further extended with additional operations on intermediate levels of the processing that are once again efficiently optimized through backpropagation.

Extensive experimental evaluations show that the proposed saliency fusion approach outperforms the state-of-the-art on the MSRAB dataset and it is able to generalize to unseen data of different benchmark datasets.

Introduction

According to [1], “Visual salience (or visual saliency) is the distinct subjective perceptual quality which makes some items in the world stand out from their neighbors and immediately grab our attention”. The human vision system is able to efficiently detect salient areas in a scene and further process them to extract high-level information [2], [3]. Visual saliency has been primarily studied by neuroscientists, cognitive scientists and recently has received attention from other research communities working in the fields of computer vision, computer graphics and multimedia e.g. [4]. In the area of multimedia and computer vision, visual saliency can be used to emphasize object-level regions in the scene that can serve as a pre-processing step for scene recognition [5], [6], object detection [7], [8], segmentation [9], and tracking [10]. It can also be exploited for image manipulation and visualization in applications such as image retargeting [11], image collage [12], and non-photorealistic rendering [13]. Moreover, in multimedia application saliency can be exploited for image and video summarization [14], [15], [16], enhancement [17], retrieval [18], and image quality or aesthetic assessment [19], [20].

Saliency detection methods can be divided into two categories: bottom-up and top-down. Bottom-up methods are stimuli-driven [21]. The saliency is usually modeled by local or global contrast on hand-crafted visual features and knowledge about human visual attention is embedded in the model exploiting some heuristic priors such as background [22], compactness [23], or objectness [24]. With these methods no explicit information about the semantics of the salient regions is provided but it is indirectly embedded via prior assumptions that are made on the location, shape or visual properties of the salient regions to be detected. Bottom-up methods can be considered general purpose.

Top-down saliency methods are designed to find regions in the images that are relevant for a given task. They are often also referred to as task-driven approaches. These methods usually formulate the saliency detection as a supervised learning problem [25]. The rationale of top-down saliency methods is to identify image regions that belong to a pre-defined object category [26]. For this reason, these methods are theoretically more robust for identifying salient regions in cluttered backgrounds where bottom-up methods may fail. Top-down approaches rely on the use of training data to build the detection model. They can be very robust for the specific task on which they are trained but may not generalize well to other tasks.

In order to make the detection more robust and to improve the generalization capabilities, saliency methods often integrate different features [27] that can be both hand-crafted or learned by Convolutional Neural Networks (CNNs) [28], [29], [30], or fuse saliency maps generated from different methods [31]. However, the feature definition and selection, and the combination strategies are usually empirically designed.

Since multiple observers may consider salient different regions in the scene depending on the scene context and/or on the observer’s cultural background, saliency detection is an ill-posed problem [22], [32]. Saliency detection methods proposed in the literature exploit different rationales, visual clues, and assumptions but as demonstrated by the experiments in [33], there is no best overall saliency detection algorithm that is able to achieve good results on all the different benchmark datasets.

In our previous works [34], [35], we have exploited genetic programming (GP) to build the rationale with which to combine the binary outputs of several change detection algorithms. By using a-priori defined unary, binary and n-ary operators, the GP approach automatically combined the inputs using the provided operators and built an optimal, task-driven, solution (i.e. program) in the form of a hierarchical tree structure.

In this work we want to further investigate and extend this approach to combine graylevel saliency maps, a domain we first addressed in [36]. We first create a candidate solution for combining the saliency maps using GP with a set of operations whose parameters are a-priori fixed. To further improve this solution, we should also tune these parameters, but they cannot be easily (or efficiently) optimized within the GP framework. In order to optimize the parameters, we use the candidate solution obtained by the GP as a blueprint upon which to design the architecture of a backbone Convolutional Neural Network. Within the CNN optimization framework, it is now easier and much more efficient to search for the optimal parameters of the operations of the GP solution. Another important advantage of the implementation of the backbone CNN is that the proposed solutions can be evaluated and then we can easily and safely create deeper variants of the CNN by including other operations (e.g. post-processing) on intermediate results. These operations, initialized as identities, are further optimized or can be completely ignored by the CNN during training.

The extensive experiments on benchmark datasets, both qualitative and quantitative, validate the effectiveness of the proposed fusion strategy.

Finally, beyond the focus on saliency estimation for the scope of this paper, the proposed information fusion technique can be considered a general purpose method, with possible applications to other fields such as change detection [35] and semantic segmentation [37].

Section snippets

Saliency detection algorithms

Borji et al. [33] benchmarked 41 different saliency detection algorithm each based on different assumptions and heuristics. For example, Li et al. [38] compute saliency from the perspective of image reconstruction error of background images generated at different level of details. A graph-based approach is used instead by Yang et al. [39]. Again, superpixels are the base for the saliency computation. Foreground and background region queries are used to rank each image regions using a

Proposed method

Our proposed saliency estimation approach aims at combining the advantages of Genetic Programming with those of Convolutional Neural Networks. With our approach, we design and optimize GP-generated solutions for saliency estimation in three steps. In the first step, Genetic Programming techniques are exploited to combine existing saliency maps using a set of provided operations. The output of this step is a fusion tree that encodes the optimal fusion strategy with respect to the defined

Experiments

In this section, we first describe the experimental setup, by introducing the input saliency estimation algorithms, the datasets that have been adopted at different phases of the optimization, and the evaluation metrics. We then present the following experiments: we select different fusion trees from the Genetic Programming phase, generate the corresponding CNNs, and evaluate them on various datasets for a comparison with the input algorithms.

Conclusions

We have proposed a general purpose neural architecture search strategy, with a focus on the estimation of image saliency. Specifically, we have devised a three-step optimization process that combines the output of existing algorithms for saliency estimation.

First, a fusion tree is generated through genetic programming, working on a set of predefined operators. The discrete search space of the operators to be used and combined is efficiently handled by the evolutionary algorithm. This initial

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. The research leading to these results has received funding from TEINVEIN: TEcnologie INnovative per i VEicoli Intelligenti, CUP (Codice Unico Progetto – Unique Project Code): E96D17000110009 – Call “Accordi per la Ricerca e l’Innovazione”, cofunded by POR FESR 2014–2020 (Programma Operativo Regionale, Fondo Europeo di Sviluppo Regionale – Regional Operational

References (80)

  • A. Azaza et al.

    Context proposals for saliency detection

    Comput. Vis. Image Underst.

    (2018)
  • R. Miikkulainen et al.

    Evolving deep neural networks

    Artificial Intelligence in the Age of Neural Networks and Brain Computing

    (2019)
  • L. Itti

    Visual saliency

    Scholarpedia

    (2007)
  • R.M. Shiffrin et al.

    Controlled and automatic human information processing: ii. perceptual learning, automatic attending and a general theory.

    Psychol. Rev.

    (1977)
  • W. Schneider et al.

    Controlled and automatic human information processing: I. Detection, search, and attention.

    Psychol. Rev.

    (1977)
  • L. Itti et al.

    Computational modelling of visual attention

    Nat. Rev. Neurosci.

    (2001)
  • D. Gao et al.

    Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • Z. Ren et al.

    Region-based saliency detection and its application in object recognition.

    IEEE Trans. Circuits Syst. Video Technol.

    (2014)
  • S. Mitri et al.

    Robust object detection at regions of interest with an application in ball recognition

    Proceedings of the IEEE International Conference on Robotics and Automation

    (2005)
  • V. Navalpakkam et al.

    An integrated model of top-down and bottom-up attention for optimizing detection speed

    Proceedings of the 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2006)
  • Q. Li et al.

    Saliency based image segmentation

    Proceedings of the 2011 International Conference on Multimedia Technology

    (2011)
  • V. Mahadevan et al.

    Saliency-based discriminant tracking

    Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition

    (2009)
  • S. Avidan et al.

    Seam carving for content-aware image resizing

    ACM Trans. Graph.

    (2007)
  • R. Margolin et al.

    Saliency for image manipulation

    Vis. Comput.

    (2013)
  • D. DeCarlo et al.

    Stylization and abstraction of photographs

    ACM Trans. Graph.

    (2002)
  • N. Ouerhani et al.

    Adaptive color image compression based on visual attention

    Proceedings of the 11th International Conference on Image Analysis and Processing (ICIAP)

    (2001)
  • S. Corchs et al.

    Video summarization using a neurodynamical model of visual attention

    Proceedings of the 6th IEEE Workshop on Multimedia Signal Processing

    (2004)
  • C. Guo et al.

    A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression

    IEEE Trans. Image Process.

    (2010)
  • F. Gasparini et al.

    Low quality image enhancement using visual attention

    Opt. Eng.

    (2007)
  • Y. Gao et al.

    Database saliency for fast image retrieval

    IEEE Trans. Multimed.

    (2015)
  • A. Li et al.

    Color image quality assessment combining saliency and FSIM

    Proceedings of the Fifth International Conference on Digital Image Processing (ICDIP 2013)

    (2013)
  • L. Wong et al.

    Saliency retargeting: an approach to enhance image aesthetics

    Proceedings of the 2011 IEEE Workshop on Applications of Computer Vision (WACV)

    (2011)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • Y. Wei et al.

    Geodesic saliency using background priors

    Proceedings of the European Conference On Computer vision

    (2012)
  • F. Perazzi et al.

    Saliency filters: contrast based filtering for salient region detection

    Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2012)
  • Y. Li et al.

    The secrets of salient object segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • H. Jiang et al.

    Salient object detection: a discriminative regional feature integration approach

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • H. Cholakkal et al.

    Top-down saliency with locality-constrained contextual sparse coding.

    Proceedings of the 2015 BMVC

    (2015)
  • T. Liu et al.

    Learning to detect a salient object

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • G. Li et al.

    Deep contrast learning for salient object detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • Q. Hou et al.

    Deeply supervised salient object detection with short connections

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • S. Bianco et al.

    Multiscale fully convolutional network for image saliency

    J. Electron. Imaging

    (2018)
  • L. Mai et al.

    Saliency aggregation: a data-driven approach

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • M. Amirul Islam et al.

    Revisiting salient object detection: simultaneous detection, ranking, and subitizing of multiple salient objects

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • A. Borji et al.

    Salient object detection: a benchmark

    IEEE Trans. Image Process.

    (2015)
  • S. Bianco et al.

    How far can you get by combining change detection algorithms?

    Proceedings of the International Conference on Image Analysis and Processing – ICIAP 2017

    (2017)
  • S. Bianco et al.

    Combination of video change detection algorithms by genetic programming

    IEEE Trans. Evol. Comput.

    (2017)
  • M. Buzzelli et al.

    Combining saliency estimation methods

    Proceedings of the International Conference on Image Analysis and Processing – ICIAP 2019

    (2019)
  • D. Mazzini et al.

    A CNN architecture for efficient semantic segmentation of street scenes

    Proceedings of the 8th IEEE International Conference on Consumer Electronics – Berlin (ICCE-Berlin)

    (2018)
  • X. Li et al.

    Saliency detection via dense and sparse reconstruction

    Proceedings of the 2013 IEEE International Conference on Computer Vision

    (2013)
  • Cited by (22)

    • AutoTinyML for microcontrollers: Dealing with black-box deployability

      2022, Expert Systems with Applications
      Citation Excerpt :

      According to the literature, several NAS approaches are available, such as Evolutionary Algorithms (EA) (Angeline, Saunders, & Pollack, 1994; Bianco, Buzzelli, Ciocca, & Schettini, 2020; Miikkulainen et al., 2019; Suganuma, Shirakawa, & Nagao, 2017; Xie & Yuille, 2017), Random Search (Liu, Simonyan, Vinyals, Fernando, & Kavukcuoglu, 2018), Reinforcement Learning (Baker, Gupta, Naik, & Raskar, 2017; Zhong, Yan, Wu, Shao, & Liu, 2018; Zoph & Le, 2017) and Sequential Model-based Optimization (SMBO) - aka Bayesian Optimization (BO) (Bergstra, Yamins, & Cox, 2013; Domhan, Springenberg, & Hutter, 2015; Jin, Song, & Hu, 2019; Mendoza, Klein, Feurer, Springenberg, & Hutter, 2016; Zela, Klein, Falkner, & Hutter, 2018), but they do not consider any constraint about the deployment of the trained model onto a device with limited hardware capacity. Indeed, a promising direction is to develop NAS methods for multi-objective problems (Dong, Cheng, Juan, Wei, & Sun, 2018; Elsken, Metzen, & Hutter, 2019; Zhou et al., 2018), in which measures of resource efficiency are used as objectives along with the predictive performance on unseen data.

    • Automated design of CNN architecture based on efficient evolutionary search

      2022, Neurocomputing
      Citation Excerpt :

      The efficient building units of architectures can ensure the effectiveness of the generated architectures, so that the algorithm can find an architecture with good performance as soon as possible. In the optimization of saliency detection algorithms, Bianco et al. [18] encoded the operators in existing saliency algorithms and constructed discrete search spaces to accelerate the generation of candidate solutions. Performance predictors aim to avoid time-consuming training processes by predicting the fitness values of DNNs.

    • Evolutionary neural architecture search for remaining useful life prediction

      2021, Applied Soft Computing
      Citation Excerpt :

      The approaches proposed in [62], instead, make use of Grammatical Evolution [63] to evolve DNNs. In [64], GP is used to design CNNs to perform image saliency fusion. As a final remark, as shown in [10–12], it is worth to note that evolutionary approaches to NAS yield NNs that have a good trade-off between performance and size.

    View all citing articles on Scopus
    View full text