$ cabal install zlib

# Get a Brain

Let’s build a neural network from scratch. Our artificial brain should run on just the core Haskell system.

We follow the free online book
*Neural Networks and Deep Learning*
by Michael Nielsen. We will train a network to recognize handwritten
digits, specifically those in the MNIST
database of handwritten digits.

## Give me your digits

Let us first dispense with the unpleasantries. While our neural network shall run on the base install, to read the input data we must venture slightly further because the database files are gzip-compressed. Accordingly, we run:

We download four input files:

The MNIST page describes their format, which the following program can read:

```
import Codec.Compression.GZip (decompress)
import qualified Data.ByteString.Lazy as BS
import Data.Functor
import System.Random
render n = let s = " .:oO@" in s !! (fromIntegral n * length s `div` 256)
main = do
s <- decompress <$> BS.readFile "train-images-idx3-ubyte.gz"
l <- decompress <$> BS.readFile "train-labels-idx1-ubyte.gz"
n <- (`mod` 60000) <$> randomIO
putStr . unlines $
[(render . BS.index s . (n*28^2 + 16 + r*28 +)) <$> [0..27] | r <- [0..27]]
print $ BS.index l (n + 8)
```

A sample run:

:@@@@:. :O@@@@@@O .::@@@@@@@@@@: OO@@@@@@@@Oo@@@@ :OO@@@@@: O@@@ OO @@@o O@@@. .@@@O .@@@@: @@@@: :@@@: O@@. :@@O .@@@. O@@. o@@@ @@@O @@O @@@. :@O 7

## Ample Samples

Later on, we will need to sample from a normal distribution. Packages exist for this, but we stay within core Haskell by implementing the Box-Muller transform. Though our version is inefficient, it’s good enough for our purposes.

```
gauss :: Float -> IO Float
gauss stdev = do
x1 <- randomIO
x2 <- randomIO
return $ stdev * sqrt (-2 * log x1) * cos (2 * pi * x2)
```

## Weight Training

A single neuron takes in a bunch of floats from other neurons or directly from
the input data, multiplies each by a *weight*, and sums them. It adds this
total to a *bias*, passes it through the *activation function*, and outputs the
result.

This is easier to state with code. Let `ws`

be the weights, `b`

be the bias,
and `as`

be the inputs, and `f`

be the activation function. Then the output
is:

f (sum (zipWith (*) ws as) + b)

We choose the
*rectifier*
(*ReLU*) to be our activation function. This differs from Nielsen, who uses
the sigmoid function, and it means we should initialize our biases differently.

relu = max 0

Like Nielsen, our network has an input layer of 28^{2} = 784 raw numbers which
feed into a hidden layer of 30 neurons, which in turn feeds into an output
layer of 10 neurons. Each pair of adjacent layers has full mesh topology,
namely, every node of one layer is connected to every node of the next.

The input values lie in [0,1] and indicate the darkness of a pixel.

The output neurons give nonnegative values: we’ll train the nth neuron to output at least 1 if the digit is n, and 0 otherwise.

This scheme results in a peculiar cost function. If we had also chosen the
sigmoid function, then our outputs would be bounded above by 1 and we could
train our network to aim for exactly 1. But we’ve chosen `relu`

, whose outputs
can be arbitarily large. Inituitively, I feel it makes sense to interpret
anything that is at least 1 as a strong positive indication, and we should
train for 0 otherwise. I don’t know what the literature recommends.

For each layer of neurons, we store the biases in a list of floats `[Float]`

:
the ith float is the bias for the ith neuron. Similarly, we store the weights
in a two-dimensional matrix of floats `[[Float]]`

: the ith row holds the
weights of the inputs to the ith node. Since our neural network is a bunch of
layers, we store it as a list of biases and weights for each layer:
`[([Float], [[Float]])]`

.

We initialize all the biases to 1, and the weights to come from a normal distribution with mean 0 and standard deviation 0.01.

```
newBrain :: [Int] -> [([Float], [[Float]])]
newBrain szs@(_:ts) = zip (flip replicate 1 <$> ts) <$>
zipWithM (\m n -> replicateM n $ replicateM m $ gauss 0.01) szs ts
main = do
putStrLn "3 inputs, a hidden layer of 4 neurons, and 2 output neurons:"
print $ newBrain [3, 4, 2]
```

We’ll want to hang on to the values before they are fed to the
activation functions; these are called the *weighted inputs*. We compute an
entire layer of weighted inputs from the previous layer as follows:

```
zLayer :: [Float] -> ([Float], [[Float]]) -> [Float]
zLayer as (bs, wvs) = zipWith (+) bs $ sum . zipWith (*) as <$> wvs
```

I’ve used `wvs`

to suggest "weight vectors", to remind us this is a list of
rows of a matrix.

We can run the activation function on the weighted inputs to compute the actual
outputs with `(relu <$>)`

. A suitable fold takes us from the start of the
neural network to the end:

```
feed :: [Float] -> [([Float], [[Float]])] -> [Float]
feed = foldl' (((relu <$>) . ) . zLayer)
-- Generate a neural network, and supply the inputs [0.1, 0.2, 0.3].
main = newBrain [3, 4, 2] >>= print . feed [0.1, 0.2, 0.3]
```

## Lesson Plan

We train the weights and biases using *gradient descent*, so we need the
derivative of the activation function (with its discontinuity filled in):

relu' | x < 0 = 0 | otherwise = 1

We use the same desired output vector format as Nielsen, that is,
if a given training example is labeled with the digit `i`

, then we set the ith
component to 1, and the others to 0.

However, since we consider any value greater than 1 to play the same role as 1, in our output vectors, our derivative of the cost with respect to the output value is:

```
dCost a y | y == 1 && a >= y = 0
| otherwise = a - y
```

Now, *backpropagation* requires the weighted inputs and activations of every
neuron, that is, the values just before and after we run the activation
function during feedforward. Because backpropagation goes through the layers
in reverse order, it helps to have these values in reverse order, which
leads to the following awkwardly-named utility function:

```
-- xv: vector of inputs
-- Returns a list of (weighted inputs, activations) of each layer,
-- from last layer to first.
revaz :: [Float] -> [([Float], [[Float]])] -> ([[Float]], [[Float]])
revaz xv = foldl' (\(avs@(av:_), zs) (bs, wms) -> let
zs' = zLayer av (bs, wms) in (((relu <$> zs'):avs), (zs':zs))) ([xv], [])
```

From these values we compute the output error `delta0`

, and backpropagate:

```
-- xv: vector of inputs
-- yv: vector of desired outputs
-- Returns list of (activations, deltas) of each layer in order.
deltas :: [Float] -> [Float] -> [([Float], [[Float]])] -> ([[Float]], [[Float]])
deltas xv yv layers = let
(avs@(av:_), zv:zvs) = revaz xv layers
delta0 = zipWith (*) (zipWith dCost av yv) (relu' <$> zv)
in (reverse avs, f (transpose . snd <$> reverse layers) zvs [delta0]) where
f _ [] dvs = dvs
f (wm:wms) (zv:zvs) dvs@(dv:_) = f wms zvs $ (:dvs) $
zipWith (*) [(sum $ zipWith (*) row dv) | row <- wm] (relu' <$> zv)
```

We opt for *online learning*, in other words, we modify our weights and biases
immediately after processing a case, rather than batching the changes.
Our *learning rate* is 0.002:

```
eta = 0.002
descend av dv = zipWith (-) av ((eta *) <$> dv)
learn :: [Float] -> [Float] -> [([Float], [[Float]])] -> [([Float], [[Float]])]
learn xv yv layers = let (avs, dvs) = deltas xv yv layers
in zip (zipWith descend (fst <$> layers) dvs) $
zipWith3 (\wvs av dv -> zipWith (\wv d -> descend wv ((d*) <$> av)) wvs dv)
(snd <$> layers) avs dvs
```

When trained on the first 10000 cases, our neural network correctly classifies about 80% of the test cases.

## The Numbers Game

The following program assumes the four input MNIST database files reside in the current directory. We randomly select a test case as an example and print it. We then show how training pays off with crude graphs that represent the neural network’s response to our example. Finally, we try all 10000 test cases and count the number we get right.

The neural network takes about 40 lines and compiles on a base Haskell system. The code to read the data, train the network, and print results at certain checkpoints takes about another 40 lines.

```
import Codec.Compression.GZip (decompress)
import qualified Data.ByteString.Lazy as BS
import Control.Monad
import Data.Functor
import Data.Ord
import Data.List
import System.Random
gauss scale = do
x1 <- randomIO
x2 <- randomIO
return $ scale * sqrt (-2 * log x1) * cos (2 * pi * x2)
newBrain :: [Int] -> IO [([Float], [[Float]])]
newBrain szs@(_:ts) = zip (flip replicate 1 <$> ts) <$>
zipWithM (\m n -> replicateM n $ replicateM m $ gauss 0.01) szs ts
relu = max 0
relu' x | x < 0 = 0
| otherwise = 1
zLayer as (bs, wvs) = zipWith (+) bs $ sum . zipWith (*) as <$> wvs
feed = foldl' (((relu <$>) . ) . zLayer)
revaz xs = foldl' (\(avs@(av:_), zs) (bs, wms) -> let
zs' = zLayer av (bs, wms) in ((relu <$> zs'):avs, zs':zs)) ([xs], [])
dCost a y | y == 1 && a >= y = 0
| otherwise = a - y
deltas xv yv layers = let
(avs@(av:_), zv:zvs) = revaz xv layers
delta0 = zipWith (*) (zipWith dCost av yv) (relu' <$> zv)
in (reverse avs, f (transpose . snd <$> reverse layers) zvs [delta0]) where
f _ [] dvs = dvs
f (wm:wms) (zv:zvs) dvs@(dv:_) = f wms zvs $ (:dvs) $
zipWith (*) [sum $ zipWith (*) row dv | row <- wm] (relu' <$> zv)
eta = 0.002
descend av dv = zipWith (-) av ((eta *) <$> dv)
learn xv yv layers = let (avs, dvs) = deltas xv yv layers
in zip (zipWith descend (fst <$> layers) dvs) $
zipWith3 (\wvs av dv -> zipWith (\wv d -> descend wv ((d*) <$> av)) wvs dv)
(snd <$> layers) avs dvs
getImage s n = fromIntegral . BS.index s . (n*28^2 + 16 +) <$> [0..28^2 - 1]
getX s n = (/ 256) <$> getImage s n
getLabel s n = fromIntegral $ BS.index s (n + 8)
getY s n = fromIntegral . fromEnum . (getLabel s n ==) <$> [0..9]
render n = let s = " .:oO@" in s !! (fromIntegral n * length s `div` 256)
main = do
[trainI, trainL, testI, testL] <- mapM ((decompress <$>) . BS.readFile)
[ "train-images-idx3-ubyte.gz"
, "train-labels-idx1-ubyte.gz"
, "t10k-images-idx3-ubyte.gz"
, "t10k-labels-idx1-ubyte.gz"
]
b <- newBrain [784, 30, 10]
n <- (`mod` 10000) <$> randomIO
putStr . unlines $
take 28 $ take 28 <$> iterate (drop 28) (render <$> getImage testI n)
let
example = getX testI n
bs = scanl (foldl' (\b n -> learn (getX trainI n) (getY trainL n) b)) b [
[ 0.. 999],
[1000..2999],
[3000..5999],
[6000..9999]]
smart = last bs
cute d score = show d ++ ": " ++ replicate (round $ 70 * min 1 score) '+'
bestOf = fst . maximumBy (comparing snd) . zip [0..]
forM_ bs $ putStrLn . unlines . zipWith cute [0..9] . feed example
putStrLn $ "best guess: " ++ show (bestOf $ feed example smart)
let guesses = bestOf . (\n -> feed (getX testI n) smart) <$> [0..9999]
let answers = getLabel testL <$> [0..9999]
putStrLn $ show (sum $ fromEnum <$> zipWith (==) guesses answers) ++
" / 10000"
```

Let’s walk through a sample run. The program first randomly picks a reasonably legible "9'. Without any training, the 9th output is large, but sadly, so is every other output.

.o@@@@@: o@@@@@@@@ o@@@@@@@@@: .@@@O. .@@@. :@@: :@@@@ :@@@@@@@@@@ O@@@@@@@@: :@@@@@@O O@@o @@@ :@@O O@@: .@@@ O@@o :@@@. .@@@o o@@O @@@: .@@@ o@: 0: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 4: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 5: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 6: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 7: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 8: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 9: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

After 1000 iterations, all outputs are lower. Trouble is brewing apparently: the 9th output is tiny, and there are 6 relatively large outputs.

0: ++++++ 1: +++++ 2: 3: +++++++++ 4: +++++++++ 5: ++ 6: +++++++++++ 7: ++++++++++++ 8: + 9: +

After 2000 more iterations, we see marked improvement: now the most likely contenders are 7 and 9, though it likes 7 more than 9.

0: 1: +++++ 2: 3: +++ 4: +++++++++++ 5: +++++++++++ 6: 7: ++++++++++++++++++++++++++++++++++ 8: +++++ 9: +++++++++++++++++++

By the 10000th iteration, the 9th output has caught up to the 7th. In fact, the 9th output is larger, but the graph is too coarse to show this. We wind up classifying our example correctly.

Out of the 10000 test cases, we classify 8467 cases correctly.

0: 1: 2: 3: +++++++ 4: 5: 6: 7: +++++++++++++++++++++++++++++++++ 8: 9: +++++++++++++++++++++++++++++++++ best guess: 9 8467 / 10000

## Beginner’s Luck

This project was guided by expedience. I wanted to code a neural network from scratch, but I also wanted it running as soon as possible, so I cut corners.

I chose online learning so I could avoid code for batching deltas. My code also trains on the data in the order of appearance.

I chose ReLU because I heard it can converge faster that the sigmoid. This meant I was unable to mimic Nielsen’s code for picking weights and biases, because a ReLU neuron that is initially likely to output zero will be hard to train. Luckily, a brief web search revealed that we can set all the initial biases to 1. As for the weights, the normal distribution around 0 with standard deviation 0.01 felt about right.

I stored matrices in Haskell lists: great for economy of expression, but not so great for performance. To compensate, the code only trains for one session, and on only one sixth of the data. To further alleviate my impatience, my program gives updates on its progress.

Despite so many compromises, the above reliably produces neural networks that correctly classify roughly 75% to 85% of the test cases, taking about a minute to run on my laptop.

## Haste Makes Sloth

I obtained better results by randomizing the order of the training cases and feeding them all to the network, though sometimes the results were worse. One run produced a neural net scoring 9202 out of 10000.

I made a webpage to show this off, complete with a crude editor.

I had hoped to code in Haskell exclusively, but the Haste compiler introduced too much overhead. I resorted to moving both the neural net feedforward routine and the sample image data directly into JavaScript.

The source is available in the following Git repository.

*blynn@cs.stanford.edu*💡