Input-Output Maps are Strongly Biased Towards Simple Outputs

About this paper

Input-Output Maps are Strongly Biased Towards Simple Outputs, Kamaludin Dingle, Chico Q. Camargo and Ard A. Louis, Nature Communications 9, 761 (2018).

Notes

On Saturday I went to my alma mater’s Morning of Theoretical Physics, which was actually on “the Physics of Life” (or Active Matter as theoretical physicists seem to call it). Professor Louis presented this work in relation to RNA folding, but it’s the relevance to neural networks that interested me.

The assertion made in this paper is that if you have a map of a lot of inputs to a (significantly smaller, but still large) collection of outputs, the outputs are not equally likely to occur. Instead, the simpler outputs are preferentially selected.

A quick demonstration of the intuition behind this argument: imagine randomly assembling a fixed number of lego bricks into a shape. Some particular unique shape with weird branches can only be formed by an individual configuration of the bricks. On the other hand, a simpler shape with large degree of symmetry can be formed from different configurations. Therefore the process of randomly selecting a shape will preferentially pick the symmetric shape.

The complexity metric that’s useful here is called Kolmogorov complexity, and roughly speaking it’s a measure of the length of a Universal Turing Machine program needed to describe the shape (or other object). Consider strings. A random string of 40 characters, say a56579418dc7908ce5f0b24b05c78e085cb863dc, may not be representable in any more efficient way than its own characters. But the string aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, which is 40 characters, can be written with the Python program:

'a'*40

which is seven characters long including the newline. Assuming eight bits per character, the random string needs 40*8=320 bits to be represented. The forty as can be found by actually finding the Python program, which is 56 bits. The assertion is that a “find a program that generates character sequences of length 40” algorithm (with some particular assumptions in place) will find the a56579… string with probablity 2^-320, but will find the aaa… string with probability 2^-56, which is much, much more likely.

In fact, this paper shows that the upper and lower bounds on the probability of a map yielding a particular output for random input are both dependent on the Kolmogorov complexity of the output. It happens that due to the halting problem, you can’t calculate Kolmogorov complexity for arbitrary outputs. But you can approximate it, for example using Lempel-Ziv complexity (i.e. the length of the input to a lossless compression algorithm needed to recover the same output).

Where does this meet neural networks? In a preprint of a paper selected for the ICLR 2019, with two of the same authors as this paper. Here, we find that a neural network can be thought of as a map between the inputs and weights to a function that does the inference.

Typically neural network architectures have lots more parameters than there are points in the training set, so how is it that they manage to generalise so well? And why is it that different training techniques, including stochastic gradient descent and genetic algorithms, result in networks with comparable performance?

The authors argue that a generalising function is much less complex than an overfitting function, using the same idea of complexity shown above. And that as the training process for the network is sampling the map of inputs->functions, it is more likely to hit on the simple functions than the complex ones. Therefore the fact that neural networks generalise well is intrinsic to the way they select functions from a wealth of possibilities.

My hope is that this is a step toward a predictive theory of neural network architectures. That by knowing the something of the function we want to approximate, we can set a lower bound on the complexity of a network needed to discover sufficiently generalisable functions. This would be huge for both reducing the training effort needed for networks, and for reducing the evaluation runtime. That, in turn, would make it easier to use pretrained networks on mobile and IoT devices.