Get a Brain

Let’s build a neural network from scratch. Our artificial brain should run on just the core Haskell system.

We follow the free online book Neural Networks and Deep Learning by Michael Nielsen. We will train a network to recognize handwritten digits, specifically those in the MNIST database of handwritten digits.

Give me your digits

Let us first dispense with the unpleasantries. While our neural network shall run on the base install, to read the input data we must venture slightly further because the database files are gzip-compressed. Accordingly, we run:

$ cabal install zlib

We download four input files:

The MNIST page describes their format, which the following program can read:

import Codec.Compression.GZip (decompress)
import qualified Data.ByteString.Lazy as BS
import Data.Functor
import System.Random

render n = let s = " .:oO@" in s !! (fromIntegral n * length s `div` 256)

main = do
  s <- decompress <$> BS.readFile "train-images-idx3-ubyte.gz"
  l <- decompress <$> BS.readFile "train-labels-idx1-ubyte.gz"
  n <- (`mod` 60000) <$> randomIO
  putStr . unlines $
    [(render . BS.index s . (n*28^2 + 16 + r*28 +)) <$> [0..27] | r <- [0..27]]
  print $ BS.index l (n + 8)

A sample run:

             :@@@@:.
           :O@@@@@@O
       .::@@@@@@@@@@:
     OO@@@@@@@@Oo@@@@
    :OO@@@@@:    O@@@
       OO        @@@o
                O@@@.
               .@@@O
              .@@@@:
              @@@@:
             :@@@:
             O@@.
            :@@O
           .@@@.
           O@@.
          o@@@
          @@@O
          @@O
          @@@.
          :@O

7

Ample Samples

Later on, we will need to sample from a normal distribution. Packages exist for this, but we stay within core Haskell by implementing the Box-Muller transform. Though our version is inefficient, it’s good enough for our purposes.

gauss :: Float -> IO Float
gauss stdev = do
  x1 <- randomIO
  x2 <- randomIO
  return $ stdev * sqrt (-2 * log x1) * cos (2 * pi * x2)

Weight Training

A single neuron takes in a bunch of floats from other neurons or directly from the input data, multiplies each by a weight, and sums them. It adds this total to a bias, passes it through the activation function, and outputs the result.

This is easier to state with code. Let ws be the weights, b be the bias, and as be the inputs, and f be the activation function. Then the output is:

f (sum (zipWith (*) ws as) + b)

We choose the rectifier (ReLU) to be our activation function. This differs from Nielsen, who uses the sigmoid function, and it means we should initialize our biases differently.

relu = max 0

Like Nielsen, our network has an input layer of 282 = 784 raw numbers which feed into a hidden layer of 30 neurons, which in turn feeds into an output layer of 10 neurons. Each pair of adjacent layers has full mesh topology, namely, every node of one layer is connected to every node of the next.

The input values lie in [0,1] and indicate the darkness of a pixel.

The output neurons give nonnegative values: we’ll train the nth neuron to output at least 1 if the digit is n, and 0 otherwise.

This scheme results in a peculiar cost function. If we had also chosen the sigmoid function, then our outputs would be bounded above by 1 and we could train our network to aim for exactly 1. But we’ve chosen relu, whose outputs can be arbitarily large. Inituitively, I feel it makes sense to interpret anything that is at least 1 as a strong positive indication, and we should train for 0 otherwise. I don’t know what the literature recommends.

For each layer of neurons, we store the biases in a list of floats [Float]: the ith float is the bias for the ith neuron. Similarly, we store the weights in a two-dimensional matrix of floats [[Float]]: the ith row holds the weights of the inputs to the ith node. Since our neural network is a bunch of layers, we store it as a list of biases and weights for each layer: [([Float], [[Float]])].

We initialize all the biases to 1, and the weights to come from a normal distribution with mean 0 and standard deviation 0.01.

newBrain :: [Int] -> [([Float], [[Float]])]
newBrain szs@(_:ts) = zip (flip replicate 1 <$> ts) <$>
  zipWithM (\m n -> replicateM n $ replicateM m $ gauss 0.01) szs ts

main = do
  putStrLn "3 inputs, a hidden layer of 4 neurons, and 2 output neurons:"
  print $ newBrain [3, 4, 2]

We’ll want to hang on to the values before they are fed to the activation functions; these are called the weighted inputs. We compute an entire layer of weighted inputs from the previous layer as follows:

zLayer :: [Float] -> ([Float], [[Float]]) -> [Float]
zLayer as (bs, wvs) = zipWith (+) bs $ sum . zipWith (*) as <$> wvs

I’ve used wvs to suggest "weight vectors", to remind us this is a list of rows of a matrix.

We can run the activation function on the weighted inputs to compute the actual outputs with (relu <$>). A suitable fold takes us from the start of the neural network to the end:

feed :: [Float] -> [([Float], [[Float]])] -> [Float]
feed = foldl' (((relu <$>) . ) . zLayer)

-- Generate a neural network, and supply the inputs [0.1, 0.2, 0.3].
main = newBrain [3, 4, 2] >>= print . feed [0.1, 0.2, 0.3]

Lesson Plan

We train the weights and biases using gradient descent, so we need the derivative of the activation function (with its discontinuity filled in):

relu' | x < 0     = 0
      | otherwise = 1

We use the same desired output vector format as Nielsen, that is, if a given training example is labeled with the digit i, then we set the ith component to 1, and the others to 0.

However, since we consider any value greater than 1 to play the same role as 1, in our output vectors, our derivative of the cost with respect to the output value is:

dCost a y | y == 1 && a >= y = 0
          | otherwise        = a - y

Now, backpropagation requires the weighted inputs and activations of every neuron, that is, the values just before and after we run the activation function during feedforward. Because backpropagation goes through the layers in reverse order, it helps to have these values in reverse order, which leads to the following awkwardly-named utility function:

-- xv: vector of inputs
-- Returns a list of (weighted inputs, activations) of each layer,
-- from last layer to first.
revaz :: [Float] -> [([Float], [[Float]])] -> ([[Float]], [[Float]])
revaz xv = foldl' (\(avs@(av:_), zs) (bs, wms) -> let
  zs' = zLayer av (bs, wms) in (((relu <$> zs'):avs), (zs':zs))) ([xv], [])

From these values we compute the output error delta0, and backpropagate:

-- xv: vector of inputs
-- yv: vector of desired outputs
-- Returns list of (activations, deltas) of each layer in order.
deltas :: [Float] -> [Float] -> [([Float], [[Float]])] -> ([[Float]], [[Float]])
deltas xv yv layers = let
  (avs@(av:_), zv:zvs) = revaz xv layers
  delta0 = zipWith (*) (zipWith dCost av yv) (relu' <$> zv)
  in (reverse avs, f (transpose . snd <$> reverse layers) zvs [delta0]) where
    f _ [] dvs = dvs
    f (wm:wms) (zv:zvs) dvs@(dv:_) = f wms zvs $ (:dvs) $
      zipWith (*) [(sum $ zipWith (*) row dv) | row <- wm] (relu' <$> zv)

We opt for online learning, in other words, we modify our weights and biases immediately after processing a case, rather than batching the changes. Our learning rate is 0.002:

eta = 0.002

descend av dv = zipWith (-) av ((eta *) <$> dv)

learn :: [Float] -> [Float] -> [([Float], [[Float]])] -> [([Float], [[Float]])]
learn xv yv layers = let (avs, dvs) = deltas xv yv layers
  in zip (zipWith descend (fst <$> layers) dvs) $
    zipWith3 (\wvs av dv -> zipWith (\wv d -> descend wv ((d*) <$> av)) wvs dv)
      (snd <$> layers) avs dvs

When trained on the first 10000 cases, our neural network correctly classifies about 80% of the test cases.

The Numbers Game

The following program assumes the four input MNIST database files reside in the current directory. We randomly select a test case as an example and print it. We then show how training pays off with crude graphs that represent the neural network’s response to our example. Finally, we try all 10000 test cases and count the number we get right.

The neural network takes about 40 lines and compiles on a base Haskell system. The code to read the data, train the network, and print results at certain checkpoints takes about another 40 lines.

import Codec.Compression.GZip (decompress)
import qualified Data.ByteString.Lazy as BS

import Control.Monad
import Data.Functor
import Data.Ord
import Data.List
import System.Random

gauss scale = do
  x1 <- randomIO
  x2 <- randomIO
  return $ scale * sqrt (-2 * log x1) * cos (2 * pi * x2)

newBrain :: [Int] -> IO [([Float], [[Float]])]
newBrain szs@(_:ts) = zip (flip replicate 1 <$> ts) <$>
  zipWithM (\m n -> replicateM n $ replicateM m $ gauss 0.01) szs ts

relu = max 0
relu' x | x < 0      = 0
        | otherwise  = 1

zLayer as (bs, wvs) = zipWith (+) bs $ sum . zipWith (*) as <$> wvs

feed = foldl' (((relu <$>) . ) . zLayer)

revaz xs = foldl' (\(avs@(av:_), zs) (bs, wms) -> let
  zs' = zLayer av (bs, wms) in ((relu <$> zs'):avs, zs':zs)) ([xs], [])

dCost a y | y == 1 && a >= y = 0
          | otherwise        = a - y

deltas xv yv layers = let
  (avs@(av:_), zv:zvs) = revaz xv layers
  delta0 = zipWith (*) (zipWith dCost av yv) (relu' <$> zv)
  in (reverse avs, f (transpose . snd <$> reverse layers) zvs [delta0]) where
    f _ [] dvs = dvs
    f (wm:wms) (zv:zvs) dvs@(dv:_) = f wms zvs $ (:dvs) $
      zipWith (*) [sum $ zipWith (*) row dv | row <- wm] (relu' <$> zv)

eta = 0.002

descend av dv = zipWith (-) av ((eta *) <$> dv)

learn xv yv layers = let (avs, dvs) = deltas xv yv layers
  in zip (zipWith descend (fst <$> layers) dvs) $
    zipWith3 (\wvs av dv -> zipWith (\wv d -> descend wv ((d*) <$> av)) wvs dv)
      (snd <$> layers) avs dvs

getImage s n = fromIntegral . BS.index s . (n*28^2 + 16 +) <$> [0..28^2 - 1]
getX     s n = (/ 256) <$> getImage s n
getLabel s n = fromIntegral $ BS.index s (n + 8)
getY     s n = fromIntegral . fromEnum . (getLabel s n ==) <$> [0..9]

render n = let s = " .:oO@" in s !! (fromIntegral n * length s `div` 256)

main = do
  [trainI, trainL, testI, testL] <- mapM ((decompress  <$>) . BS.readFile)
    [ "train-images-idx3-ubyte.gz"
    , "train-labels-idx1-ubyte.gz"
    ,  "t10k-images-idx3-ubyte.gz"
    ,  "t10k-labels-idx1-ubyte.gz"
    ]
  b <- newBrain [784, 30, 10]
  n <- (`mod` 10000) <$> randomIO
  putStr . unlines $
    take 28 $ take 28 <$> iterate (drop 28) (render <$> getImage testI n)

  let
    example = getX testI n
    bs = scanl (foldl' (\b n -> learn (getX trainI n) (getY trainL n) b)) b [
     [   0.. 999],
     [1000..2999],
     [3000..5999],
     [6000..9999]]
    smart = last bs
    cute d score = show d ++ ": " ++ replicate (round $ 70 * min 1 score) '+'
    bestOf = fst . maximumBy (comparing snd) . zip [0..]

  forM_ bs $ putStrLn . unlines . zipWith cute [0..9] . feed example

  putStrLn $ "best guess: " ++ show (bestOf $ feed example smart)

  let guesses = bestOf . (\n -> feed (getX testI n) smart) <$> [0..9999]
  let answers = getLabel testL <$> [0..9999]
  putStrLn $ show (sum $ fromEnum <$> zipWith (==) guesses answers) ++
    " / 10000"

Let’s walk through a sample run. The program first randomly picks a reasonably legible "9'. Without any training, the 9th output is large, but sadly, so is every other output.

            .o@@@@@:
           o@@@@@@@@
          o@@@@@@@@@:
         .@@@O. .@@@.
         :@@:  :@@@@
         :@@@@@@@@@@
          O@@@@@@@@:
           :@@@@@@O
               O@@o
               @@@
              :@@O
              O@@:
             .@@@
             O@@o
            :@@@.
           .@@@o
           o@@O
           @@@:
          .@@@
           o@:

0: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
6: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
7: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
8: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
9: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

After 1000 iterations, all outputs are lower. Trouble is brewing apparently: the 9th output is tiny, and there are 6 relatively large outputs.

0: ++++++
1: +++++
2:
3: +++++++++
4: +++++++++
5: ++
6: +++++++++++
7: ++++++++++++
8: +
9: +

After 2000 more iterations, we see marked improvement: now the most likely contenders are 7 and 9, though it likes 7 more than 9.

0:
1: +++++
2:
3: +++
4: +++++++++++
5: +++++++++++
6:
7: ++++++++++++++++++++++++++++++++++
8: +++++
9: +++++++++++++++++++

By the 10000th iteration, the 9th output has caught up to the 7th. In fact, the 9th output is larger, but the graph is too coarse to show this. We wind up classifying our example correctly.

Out of the 10000 test cases, we classify 8467 cases correctly.

0:
1:
2:
3: +++++++
4:
5:
6:
7: +++++++++++++++++++++++++++++++++
8:
9: +++++++++++++++++++++++++++++++++

best guess: 9
8467 / 10000

Beginner’s Luck

This project was guided by expedience. I wanted to code a neural network from scratch, but I also wanted it running as soon as possible, so I cut corners.

I chose online learning so I could avoid code for batching deltas. My code also trains on the data in the order of appearance.

I chose ReLU because I heard it can converge faster that the sigmoid. This meant I was unable to mimic Nielsen’s code for picking weights and biases, because a ReLU neuron that is initially likely to output zero will be hard to train. Luckily, a brief web search revealed that we can set all the initial biases to 1. As for the weights, the normal distribution around 0 with standard deviation 0.01 felt about right.

I stored matrices in Haskell lists: great for economy of expression, but not so great for performance. To compensate, the code only trains for one session, and on only one sixth of the data. To further alleviate my impatience, my program gives updates on its progress.

Despite so many compromises, the above reliably produces neural networks that correctly classify roughly 75% to 85% of the test cases, taking about a minute to run on my laptop.

Haste Makes Sloth

I obtained better results by randomizing the order of the training cases and feeding them all to the network, though sometimes the results were worse. One run produced a neural net scoring 9202 out of 10000.

I made a webpage to show this off, complete with a crude editor.

I had hoped to code in Haskell exclusively, but the Haste compiler introduced too much overhead. I resorted to moving both the neural net feedforward routine and the sample image data directly into JavaScript.

The source is available in the following Git repository.


Ben Lynn blynn@cs.stanford.edu 💡