Machine Learning & MNIST Handwriting Recognition

The MNIST database, found here, is an excellent source of test data for anyone experimenting with basic machine learning algorithms. It consists of 60,0000 samples of handwritten digits (0-9) each 28×28 pixels large with an associated set of labels, and a second set of 10,000 similar images. The idea being you can develop a learning algorithm, feed it the 60,000 samples for training, then validate the result against the 10,000 images in the test set.

There are plenty of online tutorials and guides that show how to build a simple Neural Network. Most examples use Python, which I actually find a little irritating as I often find Python coding seems to end up being an experiment in expressing as much logic as possible in each line of code, meaning much of the implementation detail ends up being hidden, which isn’t always what you want when you are trying to understand the algorithms, so I’ve been setting up my own tests in C++ as an exercise in proving to myself that I actually do understand whats going on.

My first go at this go at setting up an actual Network based on the example here got me a score on the test data of 97.76%, which seems pretty amazing really, though apparently with a bit of work I should be able to do better. My network uses 3 layers, being 784, 100 and 10 wide, as per the tutorial, and I found the data mapped neatly to flat arrays which could be processed very efficiently by the CPU.

One thing I didn’t quite follow from the sample is that much of the online literature talks about using a mini-batch processing approach to speed up the training iterations, but I found that running the back propagation for every sample (called online mode) increased the overall only a little and almost always resulted in significantly faster learning. Aside from the possibilities for parallelizing the forward propagation of many batches, or possibly as a way of training against massive data sets (millions of samples) I’m not I’m really seeing the value in that approach here. In this case increasing the batch size from 1 to 10 decreased the iteration (epoch) time from 14.7 seconds down to 12.1 seconds, but also decreased the result after 30 epochs of training from 97.76% down to 95.7%.

In case anyone reading this wants to have a go at building a C++ Neural Net, the following code can be used to quickly load either the test or training data.

class MNISTData

	void endianFlip4(int32_t& v) {
		v = (v << 24) |
			((v << 8) & 0x00ff0000) |
			((v >> 8) & 0x0000ff00) |
			((v >> 24) & 0x000000ff);

	uint8_t* bytesBuffer;


	struct Image {
		uint8_t* bytes;
		uint8_t label;

	int32_t numberOfImages;
	int32_t numberOfRows;
	int32_t numberOfColumns;
	std::vector<Image> images;

	MNISTData() : bytesBuffer(nullptr) { }

		if (bytesBuffer)

	bool load(const char* p_images, const char* p_labels) {
		// open files.
		FILE* f_image = fopen(p_images, "rb");
		FILE* f_label = fopen(p_labels, "rb");
		if (f_image && f_label)
			// read out magic numbers.
			int32_t magicNumber[2];
			fread(&magicNumber[0], sizeof(int32_t), 1, f_image);
			fread(&magicNumber[1], sizeof(int32_t), 1, f_label);
			if (magicNumber[0] == 2051 && magicNumber[1] == 2049)
				// read headers.
				int32_t numberOfLabels;
				fread(&numberOfImages, sizeof(int32_t), 1, f_image);
				fread(&numberOfRows, sizeof(int32_t), 1, f_image);
				fread(&numberOfColumns, sizeof(int32_t), 1, f_image);
				fread(&numberOfLabels, sizeof(int32_t), 1, f_label);
				if (numberOfImages == numberOfLabels)
					// load the image data as a single block.
					int32_t numBytes = numberOfRows * numberOfColumns * numberOfImages;
					bytesBuffer = (uint8_t*)malloc(numBytes);
					fread(bytesBuffer, numBytes, 1, f_image);
					// load the labels as a single block.
					std::vector<uint8_t> labels(numberOfImages);
					fread(, numberOfImages, 1, f_label);
					// build the output image array.
					for (int32_t i = 0; i < numberOfImages; i++)
						images[i].bytes = bytesBuffer + (numberOfRows * numberOfColumns) * i;
						images[i].label = labels[i];
						assert(images[i].label <= 9);
		if (f_image) fclose(f_image);
		if (f_label) fclose(f_label);
		return (f_image && f_label);


The Neural net itself basically consists of a set of 1D and 2D arrays of floats. To make these arrays a little easier to manage I start by setting up a class that holds a 2D array of data, meaning the array has a width and height. The network will be made up of instances of this array. An extension to this, not shown here, is that as well as aligning the data to a 16b boundary as shown we can also align up the overall allocation size creating padding on the end, which is useful for SSE code, and we can go a step further and do that for every row of the array introducing a pitch variable that might differ from the width, sort of like how a texture might be held in memory by a rendering API. Even better we could swap out the allocations for CUDA allocations, which would be a step towards running our net on the GPU. For now I’ve shown this in it’s simplest form though…

template <class T> class type_array {
	size_t width;
	size_t height;
	size_t sizeInBytes;
	T* data;
	type_array() : width(0), sizeInBytes(0), data(nullptr) { }
	~type_array() {
		if (data) _aligned_free(data);
	void allocate(size_t _height, size_t _width) {
		width = _width;
		height = _height;
		size_t rowSizeInBytes = sizeof(T) * width;
		sizeInBytes = rowSizeInBytes * height;
		data = (T*)_aligned_malloc(sizeInBytes, 16);
	void allocate(size_t _width) {
		allocate(1, _width);
class value_array : public type_array<float> {	};

The network itself uses three layers, an input layer, an output layer and a hidden layer, made up of instances of the value_array type. In C++ we can then build a structure to represent each layer, defined and initialized with code similar to this, where the output and the set of correct answers are defined as pointers rather than arrays because we’d want to swap those in from our set of samples one by one possibly in a random order.

// input layer
struct InputLayer {					
	size_t size;
	const value_array* out;
} i;								
// hidden layer						
struct HiddenLayer {				
	size_t size;
	value_array weights;
	value_array bias;
	value_array out;
	value_array delta;
	value_array delta2;				
} h;								
// output layer						
struct OutputLayer {				
	size_t size;
	value_array weights;
	value_array bias;
	value_array out;
	value_array delta;
	value_array delta2;	
	const value_array* answer;
} o;

// input layer
i.size = inputNeurons;
i.out = nullptr;
// hidden layer
h.size = hiddenLayerNeurons;
h.weights.allocate(hiddenLayerNeurons, inputNeurons);
h.delta2.allocate(hiddenLayerNeurons, inputNeurons);
// output layer
o.size = outputNeurons;
o.weights.allocate(outputNeurons, hiddenLayerNeurons);
o.delta2.allocate(outputNeurons, hiddenLayerNeurons);
o.answer = nullptr;

After allocating the network buffers we would then initialize our arrays of weights and biases to small random numbers, say in the range -0.01 to 0.01, and from there we’d be ready to start training the network. Training the network involves swapping in samples (i.out and o.answers) one by one, evaluating (feeding forward) the network, and then backpropogating the errors up through the network to bring the weights and biases closer to values that provide a correct answer for our sample data.

Feeding forward is more or less a series of of dot products across our arrays, followed by a call to our activation function for each neuron. In this case the activation function is sigmoid, but others are available. The code would look something like this.

void feedForward() {
	// (1) forward propogate from input layer to hidden layer.
	for (size_t y = 0; y < h.weights.height; y++) {
		float weightedSum = 0.0;
		for (size_t x = 0; x < h.weights.width; x++)
			weightedSum +=[y * h.weights.width + x] * i.out->data[x];[y] = sigmoid(weightedSum +[y]);
	// (2) forward propogate from hidden layer to output layer.
	for (size_t y = 0; y < o.weights.height; y++) {
		float weightedSum = 0.0;
		for (size_t x = 0; x < o.weights.width; x++)
			weightedSum +=[y * o.weights.width + x] *[x];[y] = sigmoid(weightedSum +[y]);

Back-propogation is a little more complex. Here we work backwards evaluating the error at each level of the network, working out the direction we need to move the weights and biases in order to get a better answer. This is based on the derivative of our activation function, which provides a way for us to work out which direction we need to adjust in to move towards a correct response. At each level we are then feeding back the error in proportion to the weight of the connections, kind of like we are trying to determine how much each input connection is to blame.

The code for this looks something like this.

void backPropogation() {
	// clear deltas
	memset(, 0,;
	memset(, 0, o.delta2.sizeInBytes);
	memset(, 0,;
	memset(, 0, h.delta2.sizeInBytes);
	// calculate the sum of the deltas for the output layer
	for (size_t x = 0; x < o.size; x++) {
		float o_delta_x = ([x] - o.answer->data[x]) * sigmoidToDerivative([x]);[x] += o_delta_x;
		for (size_t y = 0; y < h.size; y++) {[x * o.delta2.width + y] += o_delta_x *[y];
	// calculate the sum of the deltas for the hidden layer
	for (size_t x = 0; x < h.size; x++) {
		float weightedSum = 0.0;
		for (size_t y = 0; y < o.size; y++)
			weightedSum +=[y] *[y * h.size + x];
		float h_delta_x = weightedSum *[x] * (1.0f -[x]);[x] += h_delta_x;
		for (size_t y = 0; y < i.size; y++) {[x * h.delta2.width + y] += h_delta_x * i.out->data[y];
	// apply corrections (scaled by the learning rate) to the weights and biases.
	for (size_t x = 0; x < o.weights.width * o.weights.height; x++)[x] -=[x] * learningRate;
	for (size_t x = 0; x < h.weights.width * h.weights.height; x++)[x] -=[x] * learningRate;
	for (size_t x = 0; x < o.bias.width * o.bias.height; x++)[x] -=[x] * learningRate;
	for (size_t x = 0; x < h.bias.width * h.bias.height; x++)[x] -=[x] * learningRate;

And that’s more or less it. If you then load the MNIST data set of 60,000 samples and 30 times over feed every sample in a random order into the feed forward function followed by a call to back-propagation (1.8M iterations in all), and then feed the 10,000 test samples through the same network using feed-forward only and check the results you hopefully get a score of a little over 97%.

Leave a Reply

Your email address will not be published. Required fields are marked *