Submit manuscript...
eISSN: 2574-8092

International Robotics & Automation Journal

Review Article Volume 3 Issue 3

Harness the model uncertainty via hierarchical weakly informative priors in bayesian neural network

Wenjun Bai, Changqin Quan

Department of System Informatics, Kobe University, Japan

Correspondence: Changqin Quan, Department of System Informatics, Kobe University, Japan, Tel 81-78-803-6068

Received: September 20, 2017 | Published: October 25, 2017

Citation: Bai W, Quan C. Harness the model uncertainty via hierarchical weakly informative priors in bayesian neural network. Int Rob Auto J.2017;3(3):307-308. DOI: 10.15406/iratj.2017.03.00057

Download PDF

Abstract

Despite its introduced superiority in pattern recognition, the conventional neural network is yet to cope with the model uncertainty. To tackle this issue, Bayesian neural network, which allows the predictions to equip with model uncertainty, is succinctly reviewed. However, the misspecification of the model prior can lead the posterior to be of no avail. For this, contrast with the conventional informative priors, e.g., Normal and Laplace priors, the suggestion of employing weakly informative priors in Bayesian neural network is put forward for their tolerance of prior misspecification. For empirical datasets with available semantic annotations, the consideration of hierarchical weakly informative priors is further introduced in order to boost the discriminability of the model. A conducted evaluation experiment revealed the effectiveness of hierarchical weakly informative priors over other priors in a hand-written digit classification task.

Keywords: neural network, bayesian neural network, weakly informative prior

Introduction

The recent renaissance of deep neural network, aka, DNN, has made leap forward in algorithmic innovations in image and nature language processing tasks.1 However, the boosted discriminability in DNN does not belie its intrinsic inferiority in outputting the probabilistic prediction, i.e., the model uncertainty. The introduction of Bayesian neural network, aka, BNN, which models the distribution of the weights rather than the single estimate of the weights in conventional neural network allows DNN to make probabilistic inferences.2,3 For the illustrative purpose, a simulated simple BNN with its produced model uncertainty is depicted below in Figure 1. In the nutshell, the BNN is merely a neural network with a properly defined prior distribution on its weights.4 The distinction between DNN and BNN is not superficial, as the former targets on the stacked multiple non-linear extrapolations of input-output relations, whereas the latter aims for a chain of inferential processes from setting up the prior to applying the likelihood (evidence) to correct the prior in yielding the posterior.

Figure 1 Model Uncertainty, is measured via the posterior predictive variance from defined prior to posterior of weights. In this BNN simulation, the model is less certain about the predictions in the shaded(middle) area compare to the predictions in dark area for the classification of one datum as class 1(colour red).

It is clear that given the pre-determined likelihood function, different priors that reflect our heterogeneous prior beliefs towards the to-be-modelled task, should result in diversity of posterior weight distributions. Unfortunately, previous researches in BNN uniformly focused on one specific type of prior, i.e., the informative prior, such as a zero mean, spherical Gaussian for its conjugacy between prior and posterior and its computational convenience.5 The issue of such standard informative prior lies on its assumed assumption of proper specification of the model as its relative small variance and the fixed prior shape in constraining the variability of posterior distributions of weights. However, in a practical problem, where the specification of a model is commonly sub-optimal in the first place, hence, a prior that is defaulted with some amount of information but not as overwhelmed as an informative prior is demanded. To fulfil this research enquiry, we resort on the weakly informative priors, which assert controlled influence towards the posterior compare to the informative ones (see6 for detailed review). However, previously documented weakly informative priors, such as the uniform prior, is not ideal for BNN due its yielded improper posterior distribution in outputting biased and less interpretable probabilistic predictions. As a result, we are gravitated towards the usage of weakly informative priors that belong to Cauchy distributions for their flexibility. Moreover, it is advisable to place weakly informative priors in a hierarchical order to penalise the large weights in BNN.

Discussion

For empirical evaluation of above mentioned hierarchical weakly informative priors in BNN, an experiment on Digits dataset7 was conducted. Consider a three-layer neural network with one densely connected hidden layer, and both input and output variables are in the classification setting. The likelihood function for this BNN is defined as equation (1):

p( y n |w, x n , σ 2 )=Normal( y n |NN( x n ;w ), σ 2 ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr 0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaKqbacbaaaaaaa aapeGaamiCamaabmaapaqaa8qacaWG5bWdamaaBaaajuaibaWdbiaa d6gaa8aabeaajuaGpeGaaiiFaiaadEhacaGGSaGaamiEa8aadaWgaa qcfasaa8qacaWGUbaajuaGpaqabaWdbiaacYcacqaHdpWCpaWaaWba aKqbGeqabaWdbiaaikdaaaaajuaGcaGLOaGaayzkaaGaeyypa0Jaam Otaiaad+gacaWGYbGaamyBaiaadggacaWGSbWaaeWaa8aabaWdbiaa dMhapaWaaSbaaKqbGeaapeGaamOBaaqcfa4daeqaa8qacaGG8bGaam Otaiaad6eadaqadaWdaeaapeGaamiEa8aadaWgaaqcfasaa8qacaWG UbaajuaGpaqabaWdbiaacUdacaWG3baacaGLOaGaayzkaaGaaiilai abeo8aZ9aadaahaaqcfasabeaapeGaaGOmaaaaaKqbakaawIcacaGL Paaaaaa@5EA7@  (1)

Where NN stands for the conventional neural network whose weights and biases from the latent weight variable w (assumed known σ 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr 0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaKqbacbaaaaaaa aapeGaeq4Wdm3damaaCaaajuaibeqaa8qacaaIYaaaaaaa@3988@ this case). To verify whether the proposed hierarchical weakly informative priors produce more robust performance in Digits classification compare to other ordinary used Normal and Laplace priors, we explicitly compare among the performance of BNNs with three corresponding priors. The parameterisations of three priors are expressed in following equation (2) to (4):

Normal Prior: f( x|μ, b )= 1 2b exp{ | xμ | b } MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr 0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaKqbacbaaaaaaa aapeGaamOzamaabmaapaqaa8qacaWG4bGaaiiFaiabeY7aTjaacYca caGGGcGaamOyaaGaayjkaiaawMcaaiabg2da9maalaaapaqaa8qaca aIXaaapaqaa8qacaaIYaGaamOyaaaacaWGLbGaamiEaiaadchadaGa daWdaeaapeGaeyOeI0YaaSaaa8aabaWdbmaaemaapaqaa8qacaWG4b GaeyOeI0IaeqiVd0gacaGLhWUaayjcSdaapaqaa8qacaWGIbaaaaGa ay5Eaiaaw2haaaaa@517C@  (2)

Laplace Prior:    f( x|μ, τ )= τ 2π exp{ τ 2 ( xμ ) 2 } MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr 0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaKqbacbaaaaaaa aapeGaamOzamaabmaapaqaa8qacaWG4bGaaiiFaiabeY7aTjaacYca caGGGcGaeqiXdqhacaGLOaGaayzkaaGaeyypa0ZaaOaaa8aabaWdbm aalaaapaqaa8qacqaHepaDa8aabaWdbiaaikdacqaHapaCaaaabeaa caWGLbGaamiEaiaadchadaGadaWdaeaapeGaeyOeI0YaaSaaa8aaba Wdbiabes8a0bWdaeaapeGaaGOmaaaadaqadaWdaeaapeGaamiEaiab gkHiTiabeY7aTbGaayjkaiaawMcaa8aadaahaaqcfasabeaapeGaaG OmaaaaaKqbakaawUhacaGL9baaaaa@5623@     (3)

Cauchy Prior: f( x|μ, τ )= τ 2π exp{ τ 2 ( xμ ) 2 } MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr 0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaKqbacbaaaaaaa aapeGaamOzamaabmaapaqaa8qacaWG4bGaaiiFaiabeY7aTjaacYca caGGGcGaeqiXdqhacaGLOaGaayzkaaGaeyypa0ZaaOaaa8aabaWdbm aalaaapaqaa8qacqaHepaDa8aabaWdbiaaikdacqaHapaCaaaabeaa caWGLbGaamiEaiaadchadaGadaWdaeaapeGaeyOeI0YaaSaaa8aaba Wdbiabes8a0bWdaeaapeGaaGOmaaaadaqadaWdaeaapeGaamiEaiab gkHiTiabeY7aTbGaayjkaiaawMcaa8aadaahaaqcfasabeaapeGaaG OmaaaaaKqbakaawUhacaGL9baaaaa@5623@    (4)

In this experiment, we set the μ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr 0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaKqbacbaaaaaaa aapeGaeqiVd0gaaa@3850@ as 0, b as 1 for implementing the Normal prior, then define μ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr 0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaKqbacbaaaaaaa aapeGaeqiVd0gaaa@3850@ as 0, τ( σ 2 ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr 0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaKqbacbaaaaaaa aapeGaeqiXdq3aaeWaaeaacqaHdpWCpaWaaWbaaKqbGeqabaWdbiaa ikdaaaaajuaGcaGLOaGaayzkaaaaaa@3D64@  as 1 for the Laplace prior. For the hierarchical weakly informative priors, we placed the two Cauchy priors with both centered ( α ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr 0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaKqbacbaaaaaaa aapeWaaeWaaeaacqaHXoqyaiaawIcacaGLPaaaaaa@39C2@  at 0 and scale ( β ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr 0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaKqbacbaaaaaaa aapeWaaeWaaeaacqaHYoGyaiaawIcacaGLPaaaaaa@39C4@  at 2.5 and 1.0, respectively. The demonstration of these applied priors is shown in Figure 2. Posterior to the parameter estimations, we resorted on the scalar valued metrics to access the performance of trained models. From (Table 1), it is lucid that our proposed hierarchical weakly informative priors, i.e., stacked Cauchy priors with gradually constrained variances, improved the discriminability of BNN in the largest extent compare to other informative priors, e.g., Normal and Laplace priors.

Figure 2 Demonstration of applied priors, e.g., the Normal, Laplace, and hierarchical weakly priors (different Cauchy priors for weights in different layers).

Priors

Precision

Recall

F-1 Score

Laplace

0.938

0.936

0.936

Normal

0.945

0.946

0.945

Hierarchical Weakly Informative

0.966

0.961

0.965

Table 1 Performance Comparison among Three Priors in Digits Dataset Classification. (All values refer to testing session performance)

Conclusion

As briefly reviewed in this article, Bayesian neural network can compensate the vanilla neural network with its induced model uncertainty. Contrast with ordinarily applied informative priors, such as Normal and Laplace priors, the adoption of hierarchical weakly informative priors, i.e., stacked Cauchy priors, leads to flexible model specification and consequently gives arise to superior discriminative performance reflected in an empirical experiment. However, the current practical implementation of BNN is still suffered from the prolonged and biased approximation to the intractable posterior.8 This calls for future researches on improving the posterior inferences process.

Acknowledgments

This study is partially supported by the Okawa Foundation for Information and Telecommunications, and National Natural Science Foundation of China under Grant No.61472117.

Conflict of interest

No conflict of interest exists.

References

Creative Commons Attribution License

©2017 Bai, et al. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.