Unified learning rules based on different cost functions

doi:10.15406/mojabb.2019.03.00102

MOJ

eISSN: 2576-4519

Applied Bionics and Biomechanics

Opinion Volume 3 Issue 3

Unified learning rules based on different cost functions

Harold Szu,¹

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Patrick Ho,² Yichen Sun³

¹Dept. of Biomedical Engineering, Catholic University of America, USA
²Parian LLC, Saratoga, USA
³Netflix Corporation, Los Gatos, USA

Correspondence: Department of Biomedical Engineering, The Catholic University, Wash DC, USA

Received: May 07, 2019 | Published: May 9, 2019

Citation: Szu HH, Ho P, Sun Y. Unified learning rules based on different cost functions. MOJ App Bio Biomech. 2019;3(3):45-47. DOI: 10.15406/mojabb.2019.03.00102

Download PDF

Abstract

There have been multiple waves of Artificial Intelligence (AI) research. The first wave was 6 decades ago, when Alan Turing posed the question: “Can machines think?” in his seminal 1950 paper.¹ He ushered in the quest for Artificial Intelligence (AI). This quest followed many paths, with Prof. Frank Rosenblatt proposing and developing a single-layer Artificial Neural Net called a “Perceptron” that can now be found in the Smithsonian museum in Washington DC; unfortunately he passed away too early to further advance the work. In response to the perceptron, MIT Prof. Marvin Minsky proposed following a rule based system for computers following “If …, Then…” which set the direction for funding and progress in Artificial Intelligence.² The 2nd Wave began in March 2017 when Google’s AlphaGo Brain beat Korean genius Lee Sedol in Go by 4-1. AlphaGo used supervised deep learning with labeled training data. Dr. Harold Szu and his collaborators have systematically developed a learning system since 2017 that emulates three intelligences found in the brain- logical IQ, emotional IQ, and Claustrum IQ (cf. Compilation Book from Open Journal by “MedCrave Bionics & Biomechanics”). The following is a new contribution to answer common important questions from the two accompanied authors about what is the essential difference between Artificial Neural Networks using supervised learning via Least Mean Squares and unsupervised learning via Minimum Free Energy. The quick answer: no difference; but the devil, if any, is in the details.

Introduction

We believe that the machine learning ability of the n-th waves of Artificial Intelligence (AI) will eventually approach the Darwinian animal survival level of Natural Intelligence (NI), as n>4.We observe animals satisfying the sufficient conditions of “having homeostatic brains at constant temperature regardless of the external environment, and equipped with the power of paired sensors” shall exhibit NI at the survival level. For example, Homo sapiens have 5 pairs of sensors: two eyes, two ears, two nostrils, two sides of tongues, two sensing hands. We believe that this an adaptive trait for fast pre-processing for survival, i.e., “when the sensors agree, there is a signal; when the sensors disagree, it is noise” In this short communication, we shall show that NI follows a minimum free energy cost function for learning rule derived from thermodynamics, rather than the classical least means square (LMS) cost function derived from statistics that has previously organized many machine learning systems. The performance cost function will be the essential difference. We begin with the following summary theorem:

Theorem of minimum free energy for natural intelligence

Unsupervised learning based on Minimum Free Energy may be derived from the first two laws of thermodynamics. The second law defines the change of heat energy to be proportional to the change of Boltzmann entropy and the proportional constant is the Kelvin absolute temperature.

$Δ Q = T_{0} Δ S$ (1)

Then we can begin with the definition of entropy of Ludwig Boltzmann (as formulated by Max Planck)³

$S_{t o t} \equiv k_{B} L o g W_{t o t}$ (2)

$W_{t o t} = \exp (\frac{S_{t o t}}{k_{B}}) = \exp (\frac{S_{t o t} T_{°}}{k_{B} T_{°}}) = \exp (\frac{(S_{r e s} + S_{b r a i n}) T_{°}}{k_{B} T_{°}}) \equiv exp (- \frac{H_{b r a i n}}{k_{B} T_{°}})$ (3)

We define as free energy of the brain, or the useful energy of the brain, total energy less thermal energy.

$H_{b r a i n} \equiv E_{b r a i n} - T_{0} S_{b r a i n}$ (4)

Derivation

From the first law of thermodynamics, conservation of energy between the environmental thermal reservoir heat energy kept at the temperature and brain internal energy Thus, when we integrate and drop the constant, we have $T_{0} S_{r e s} = - E_{b r a i n}$
We have arrived at $\exp (\frac{(S_{r e s} + S_{b r a i n}) T_{0}}{K_{B} T_{0}}) = \exp (- \frac{(E_{b r a i n} - T_{0} S_{b r a i n})}{K_{B} T_{0}}) = \exp (- \frac{H_{b r a i n}}{K_{B} T_{0}})$

Where $H_{b r a i n} \equiv E_{b r a i n} - T_{0} S_{b r a i n}$

Now we must move to the anatomy of brain neural physiology. Our brains have approximately 10 billion neurons which have sigmoid-threshold output firing rates (While the sigmoid is linear near the threshold, it becomes nonlinear saturating away from the threshold). Neurons are represented by the following model (Figure 1):

Figure 1

$y_{i} = σ (D_{i}); D_{i} \equiv \sum_{j} [W_{i, j}] x_{j} \equiv [W_{i, α}] x_{α}$ (5)

and 100 billion of neuroglial cells working in the g-lymph system in our brains. Our unsupervised learning rule requires their symbiotic collaboration as follows (Figure 2): We shall mathematically introduce the A.M. Lyapunov control theory of monotonic convergence^4,5 as a constraint to our model of brain free energy.

Figure 2

Appendix A: Derivation of Sigmoid Logic

Two neuronal input/output (I/O) states must be normalization with a norm that turns out to be the sigmoid logic. This is consistently obtained from the canonical probability of usable brain energy as follows:

$\frac{I n p u t}{n o r m} = \frac{exp (- β H_{b r a i n}^{i n p u t})}{\exp (- β H_{b r a i n}^{i n p u t}) + exp (- β H_{b r a i n}^{o u t p u t})} = \frac{1}{1 + exp (- β (H_{b r a i n}^{o u t p u t} - H_{b r a i n}^{i n p u t}))} \equiv σ (x)$

Where $H_{b r a i n}^{I / O} \equiv E_{b r a i n}^{I / O} - S T_{o};$ ; $β \equiv \frac{1}{k_{B} T_{o}}$ ;

for $27^{o} C + 273^{o} K = 300^{o} K = \frac{1}{40} e V$

Let’s consider “calcium ions” used in the communication vehicles among neurons (repuling one another like “ducks” walking & quacking across the axon road (but ushered in a line-up (by ten times more and ten times smaller house cleaning) neuralgia cells))

$y = σ (x) = \frac{1}{1 + exp (- x)} \equiv φ (x) = - \frac{φ^{'}}{φ} = - \frac{d l o g φ (x)}{d x} C a l c i u m i o n s$

$\frac{d y}{d x} = y^{2} - y$

$l H S = \frac{d σ}{d x} = - \frac{φ^{″}}{φ} + {(\frac{φ^{'}}{φ})}^{2} = R H S = {(\frac{φ^{'}}{φ})}^{2} + \frac{φ^{'}}{φ}$

$φ^{'} = - φ^{″}$

Streaming term is set zero at the wave front of diffusion of calcium ions. We have derived Albert Einstein Diffusion Equation: $φ_{t} = φ^{″}$

Infamous San Francisco Fire with Smokes Diffusion. Cf.⁴

$\frac{Δ H_{b r a i n}}{Δ t} = \frac{Δ H_{b r a i n}}{Δ [W_{i, j}]} \frac{Δ [W_{i, j}]}{Δ t} = - {(\frac{Δ [W_{i, j}]}{Δ t})}^{2} \leq 0$ (6)

If & only if the following learning rule is true will learning exponentially converge:

$\frac{Δ [W_{i, j}]}{Δ t} = - \frac{Δ H_{b r a i n}}{Δ [W_{i, j}]}$ (7)

We introduce the Dendrite sum of the output firing rates as follows

$D_{i} \equiv [W_{i α}] y_{α}; t h u s \frac{Δ D_{i}}{Δ [W_{i, j}]} = y_{j}$ (8)

$\frac{Δ H_{b r a i n}}{Δ [W_{i, j}]} = \frac{Δ H_{b r a i n}}{Δ D_{i}} \frac{Δ D_{i}}{Δ [W_{i, j}]} \equiv - g_{i} y_{j}$ (9)

Canadian neurophysiologist Donald O. Hebb observed 5 decades ago that the rule of changing the synaptic weight matrix is “Neurons that fire together wire together,”⁶

Which defines the Neuroglia cells to be the negative slope as the $Δ H_{b r a i n} \leq 0$

$g_{i} \equiv - \frac{Δ H_{b r a i n}}{Δ D_{i}}$ (10)

In the standard PDP Book⁴ Prof. Geoffrey Hinton (formerly Canada Univ. Toronto, now Prof. at Google Silicon Valley as Chief Scientist (Protégé Yashua Bengio⁷)) gives Backward Error Propagation supervised learning as:

So that the positive learning synaptic weight matrix becomes the error energy slope,

$\frac{Δ [W_{i, j}]}{Δ t} = - \frac{Δ H_{b r a i n}}{Δ [W_{i, j}]} = g_{i} y_{j} Δ t$ (12)

$n e w [W_{i, j}] = o l d [W_{i, j}] + Δ [W_{i, j}]$ (13a)

$Δ [W_{i, j}] ≅ Δ t g_{i} y_{j}$ (13b)

This is either unsupervised minimum free energy (MFE) or supervised least mean square (LMS) learning our brain models.

Conclusion

We have reviewed the fundamentals of artificial neural network (ANN) modeling of Biological Neural Networks (BNN) that may yield an ANN that can compete on the basis of Charles Darwin’s “survival of the fittest.” We found out for weakly nonlinear systems there is only one sequential learning rule. Thus the Natural Intelligence (NI) or Artificial Intelligence (AI) share a similar learning rule even though they are derived from different origins - the former from thermodynamics and the latter from statistics. This might answer Prof. Yann LeCun⁷ of NYU Courant Institute in his Youtube Lecture, or Prof. Andrew Ng of Stanford teaching Deep Learning at the commercial online school Coursera. The only difference in the methods is when to apply either- with a labeled dataset, use supervised learning and with an unlabeled dataset use unsupervised learning.