Machine Learning & MNIST Handwriting Recognition

The MNIST database, found here, is an excellent source of test data for anyone experimenting with basic machine learning algorithms. It consists of 60,0000 samples of handwritten digits (0-9) each 28×28 pixels large with an associated set of labels, and a second set of 10,000 similar images. The idea being you can develop a learning algorithm, feed it the 60,000 samples for training, then validate the result against the 10,000 images in the test set.

There are plenty of online tutorials and guides that show how to build a simple Neural Network. Most examples use Python, which I actually find a little irritating as I often find Python coding seems to end up being an experiment in expressing as much logic as possible in each line of code, meaning much of the implementation detail ends up being hidden, which isn’t always what you want when you are trying to understand the algorithms, so I’ve been setting up my own tests in C++ as an exercise in proving to myself that I actually do understand whats going on.

My first go at this go at setting up an actual Network based on the example here got me a score on the test data of 97.76%, which seems pretty amazing really, though apparently with a bit of work I should be able to do better. My network uses 3 layers, being 784, 100 and 10 wide, as per the tutorial, and I found the data mapped neatly to flat arrays which could be processed very efficiently by the CPU.

One thing I didn’t quite follow from the sample is that much of the online literature talks about using a mini-batch processing approach to speed up the training iterations, but I found that running the back propagation for every sample (called online mode) increased the overall only a little and almost always resulted in significantly faster learning. Aside from the possibilities for parallelizing the forward propagation of many batches, or possibly as a way of training against massive data sets (millions of samples) I’m not I’m really seeing the value in that approach here. In this case increasing the batch size from 1 to 10 decreased the iteration (epoch) time from 14.7 seconds down to 12.1 seconds, but also decreased the result after 30 epochs of training from 97.76% down to 95.7%.

In case anyone reading this wants to have a go at building a C++ Neural Net, the following code can be used to quickly load either the test or training data.

class MNISTData

	void endianFlip4(int32_t& v) {
		v = (v << 24) |
			((v << 8) & 0x00ff0000) |
			((v >> 8) & 0x0000ff00) |
			((v >> 24) & 0x000000ff);

	uint8_t* bytesBuffer;


	struct Image {
		uint8_t* bytes;
		uint8_t label;

	int32_t numberOfImages;
	int32_t numberOfRows;
	int32_t numberOfColumns;
	std::vector<Image> images;

	MNISTData() : bytesBuffer(nullptr) { }

		if (bytesBuffer)

	bool load(const char* p_images, const char* p_labels) {
		// open files.
		FILE* f_image = fopen(p_images, "rb");
		FILE* f_label = fopen(p_labels, "rb");
		if (f_image && f_label)
			// read out magic numbers.
			int32_t magicNumber[2];
			fread(&magicNumber[0], sizeof(int32_t), 1, f_image);
			fread(&magicNumber[1], sizeof(int32_t), 1, f_label);
			if (magicNumber[0] == 2051 && magicNumber[1] == 2049)
				// read headers.
				int32_t numberOfLabels;
				fread(&numberOfImages, sizeof(int32_t), 1, f_image);
				fread(&numberOfRows, sizeof(int32_t), 1, f_image);
				fread(&numberOfColumns, sizeof(int32_t), 1, f_image);
				fread(&numberOfLabels, sizeof(int32_t), 1, f_label);
				if (numberOfImages == numberOfLabels)
					// load the image data as a single block.
					int32_t numBytes = numberOfRows * numberOfColumns * numberOfImages;
					bytesBuffer = (uint8_t*)malloc(numBytes);
					fread(bytesBuffer, numBytes, 1, f_image);
					// load the labels as a single block.
					std::vector<uint8_t> labels(numberOfImages);
					fread(, numberOfImages, 1, f_label);
					// build the output image array.
					for (int32_t i = 0; i < numberOfImages; i++)
						images[i].bytes = bytesBuffer + (numberOfRows * numberOfColumns) * i;
						images[i].label = labels[i];
						assert(images[i].label <= 9);
		if (f_image) fclose(f_image);
		if (f_label) fclose(f_label);
		return (f_image && f_label);


The Neural net itself basically consists of a set of 1D and 2D arrays of floats. To make these arrays a little easier to manage I start by setting up a class that holds a 2D array of data, meaning the array has a width and height. The network will be made up of instances of this array. An extension to this, not shown here, is that as well as aligning the data to a 16b boundary as shown we can also align up the overall allocation size creating padding on the end, which is useful for SSE code, and we can go a step further and do that for every row of the array introducing a pitch variable that might differ from the width, sort of like how a texture might be held in memory by a rendering API. Even better we could swap out the allocations for CUDA allocations, which would be a step towards running our net on the GPU. For now I’ve shown this in it’s simplest form though…

template <class T> class type_array {
	size_t width;
	size_t height;
	size_t sizeInBytes;
	T* data;
	type_array() : width(0), sizeInBytes(0), data(nullptr) { }
	~type_array() {
		if (data) _aligned_free(data);
	void allocate(size_t _height, size_t _width) {
		width = _width;
		height = _height;
		size_t rowSizeInBytes = sizeof(T) * width;
		sizeInBytes = rowSizeInBytes * height;
		data = (T*)_aligned_malloc(sizeInBytes, 16);
	void allocate(size_t _width) {
		allocate(1, _width);
class value_array : public type_array<float> {	};

The network itself uses three layers, an input layer, an output layer and a hidden layer, made up of instances of the value_array type. In C++ we can then build a structure to represent each layer, defined and initialized with code similar to this, where the output and the set of correct answers are defined as pointers rather than arrays because we’d want to swap those in from our set of samples one by one possibly in a random order.

// input layer
struct InputLayer {					
	size_t size;
	const value_array* out;
} i;								
// hidden layer						
struct HiddenLayer {				
	size_t size;
	value_array weights;
	value_array bias;
	value_array out;
	value_array delta;
	value_array delta2;				
} h;								
// output layer						
struct OutputLayer {				
	size_t size;
	value_array weights;
	value_array bias;
	value_array out;
	value_array delta;
	value_array delta2;	
	const value_array* answer;
} o;

// input layer
i.size = inputNeurons;
i.out = nullptr;
// hidden layer
h.size = hiddenLayerNeurons;
h.weights.allocate(hiddenLayerNeurons, inputNeurons);
h.delta2.allocate(hiddenLayerNeurons, inputNeurons);
// output layer
o.size = outputNeurons;
o.weights.allocate(outputNeurons, hiddenLayerNeurons);
o.delta2.allocate(outputNeurons, hiddenLayerNeurons);
o.answer = nullptr;

After allocating the network buffers we would then initialize our arrays of weights and biases to small random numbers, say in the range -0.01 to 0.01, and from there we’d be ready to start training the network. Training the network involves swapping in samples (i.out and o.answers) one by one, evaluating (feeding forward) the network, and then backpropogating the errors up through the network to bring the weights and biases closer to values that provide a correct answer for our sample data.

Feeding forward is more or less a series of of dot products across our arrays, followed by a call to our activation function for each neuron. In this case the activation function is sigmoid, but others are available. The code would look something like this.

void feedForward() {
	// (1) forward propogate from input layer to hidden layer.
	for (size_t y = 0; y < h.weights.height; y++) {
		float weightedSum = 0.0;
		for (size_t x = 0; x < h.weights.width; x++)
			weightedSum +=[y * h.weights.width + x] * i.out->data[x];[y] = sigmoid(weightedSum +[y]);
	// (2) forward propogate from hidden layer to output layer.
	for (size_t y = 0; y < o.weights.height; y++) {
		float weightedSum = 0.0;
		for (size_t x = 0; x < o.weights.width; x++)
			weightedSum +=[y * o.weights.width + x] *[x];[y] = sigmoid(weightedSum +[y]);

Back-propogation is a little more complex. Here we work backwards evaluating the error at each level of the network, working out the direction we need to move the weights and biases in order to get a better answer. This is based on the derivative of our activation function, which provides a way for us to work out which direction we need to adjust in to move towards a correct response. At each level we are then feeding back the error in proportion to the weight of the connections, kind of like we are trying to determine how much each input connection is to blame.

The code for this looks something like this.

void backPropogation() {
	// clear deltas
	memset(, 0,;
	memset(, 0, o.delta2.sizeInBytes);
	memset(, 0,;
	memset(, 0, h.delta2.sizeInBytes);
	// calculate the sum of the deltas for the output layer
	for (size_t x = 0; x < o.size; x++) {
		float o_delta_x = ([x] - o.answer->data[x]) * sigmoidToDerivative([x]);[x] += o_delta_x;
		for (size_t y = 0; y < h.size; y++) {[x * o.delta2.width + y] += o_delta_x *[y];
	// calculate the sum of the deltas for the hidden layer
	for (size_t x = 0; x < h.size; x++) {
		float weightedSum = 0.0;
		for (size_t y = 0; y < o.size; y++)
			weightedSum +=[y] *[y * h.size + x];
		float h_delta_x = weightedSum *[x] * (1.0f -[x]);[x] += h_delta_x;
		for (size_t y = 0; y < i.size; y++) {[x * h.delta2.width + y] += h_delta_x * i.out->data[y];
	// apply corrections (scaled by the learning rate) to the weights and biases.
	for (size_t x = 0; x < o.weights.width * o.weights.height; x++)[x] -=[x] * learningRate;
	for (size_t x = 0; x < h.weights.width * h.weights.height; x++)[x] -=[x] * learningRate;
	for (size_t x = 0; x < o.bias.width * o.bias.height; x++)[x] -=[x] * learningRate;
	for (size_t x = 0; x < h.bias.width * h.bias.height; x++)[x] -=[x] * learningRate;

And that’s more or less it. If you then load the MNIST data set of 60,000 samples and 30 times over feed every sample in a random order into the feed forward function followed by a call to back-propagation (1.8M iterations in all), and then feed the 10,000 test samples through the same network using feed-forward only and check the results you hopefully get a score of a little over 97%.

Shadow Casting Spot Lights Lights In WebGL

Your browser does not support the canvas tag. This is a static example of what would be seen.

Spot Lights

In this post I'm going to add shadows to one of my lighting shaders. I've already got point lights and directional lights working, but for the sake of making my first attempt at shadows as easy as possible I'm going to use a spot light this time. Spot lights are very similar to point lights, with the only major difference being that we need to take into account angular attenuation as well as distance attenuation. The result is that we form a cone of light.

If we assume we have an inner and outer angle for the falloff, the we can calculate the cosine of each and hand the resulting values to our shader, ddefined as the Z and W values of the light-params vector in the following code. From there, assuming we also have the light direction and L, which is the vector from the surface to the light, we can calculate a falloff factor to multiply with our existing distance falloff value, using something similar to the following code.

 float cosSpotInnerAngle = u_lightParams1.z;
 float cosSpotOuterAngle = u_lightParams1.w;
 float cosAngle = dot(, -L);
 float falloff = clamp(  (cosAngle - cosSpotOuterAngle) 
                       / (cosSpotInnerAngle - cosSpotOuterAngle), 0.0, 1.0);

Another small change is that where we use a sphere mesh for a point light, we can get away with a frustum mesh instead, scaled to fit the volume created by the light cone. This isn't required but reduces overdraw a lot.

Shadow Maps

Shadow maps store the depth of each shadow casting surface from the point of view of the light. Given this data our lighting shader will be able to compute the distance to each lit surface and along the same ray the distance to the nearest occluder, and by doing a depth comparison determine if the surface is occluded (in shadow) or not.

For a spot light we can use a perspective projection matching the shape of the light cone, and we can treat the light matrix more or less as a camera matrix for the purpose of rendering the occluders. Imagine we were treating the light itself as a camera for this step.

We render back faces, rather than front faces, as this helps to cut out self shadowing artifacts. If we rendered front faces there would be lots of surfaces where the nearest occluder and the lit surface were at the same depth, and then tiny inaccuracies in the calculations lead to errors in the results, which we don't want.

To keep things simple I'm going to allocate a 512x512 shadow map per light, but a better solution might be to start with one large shadow map and cut it up dynamically based on the number of lights active.

Shadow Calculations

Our lighting shaders already reconstruct the eye space position of each shaded fragment for lighting. To run the shadow tests we need to transform these positions into light clip space. To achieve that we can combine a few matrices and hand a matrix to the shader that encodes the full transformation. The sequence of transformations we need is...

Camera eye space -> World space (inverse camera view matrix)
World space -> Light eye space (light view matrix)
Light eye space -> Light clip space (light projection matrix)

If the matrices representing each of these is combined we end up with the required camera eye space to light clip space matrix.

The resulting coordinates provide a direct mapping into the lights shadow map. The XY values just need a scale and offset applying to them to map them to the 0 to 1 range used by UV's. The Z value can be compared directly to the values stored in the shadow map.

A single shadow map comparison will yield a binary on/off result, but we can do multiple to get a % in shadow result, which allows us to soften the edges of the shadow a little.

Our shadow calculation function looks something like this, where we also upload a bias value in the Z component of our lightParams to counter minor inaccuracies in the calculations.

float CalculateSpotShadowFactor(vec3 eyePos) {

	// Transform from eye-pos to light clip space.
	vec4 lightClipPos4 = u_eyeToLightMatrix * vec4(eyePos, 1.0);
	vec3 lightClipPos = / lightClipPos4.www;

	// Work out UV coords for the shadow-map.
	vec2 shadowMapUV = lightClipPos.xy * 0.5 + 0.5;

	// Carry out the test inside the sampler.
	float lightClipDepth = lightClipPos.z;
	float lightClipBias = u_lightParams1.y;
	float shadowMapCompare = lightClipDepth - lightClipBias;

	// Run 4 comparisons.
	vec4 lightDepth4;
	lightDepth4.x = texture2D(shadowMapSampler, shadowMapUV.xy + u_filterPattern.xy).r;
	lightDepth4.y = texture2D(shadowMapSampler, shadowMapUV.xy +;
	lightDepth4.z = texture2D(shadowMapSampler, shadowMapUV.xy - u_filterPattern.xy).r;
	lightDepth4.w = texture2D(shadowMapSampler, shadowMapUV.xy -;
	lightDepth4 = lightDepth4 * 2.0 - 1.0;

	vec4 inLight4 = vec4(1.0) - step(lightDepth4, vec4(shadowMapCompare));
	return dot(inLight4, vec4(0.25, 0.25, 0.25, 0.25));

Multiple Lights

As a finishing touch I've setup the example to use two lights rather than just one. Supporting multiple lights is easy as you just need to sum together the contributions from each light using standard additive blending.

An Attempt At Using CUDA 8 With Visual Studio 2017

I don’t think this is all that useful, but at least it documents my attempt to get CUDA to use the Visual Studio 2017 build tools, which is something at NVIDIA don’t support at this time.

If you want to see a more successful attempt at getting CUDA to work with something it wasn’t supposed to, see my earlier post relating to CUDA and Visual Studio 2015 Express Edition here.

Again, I’m happy to build from the command line. I don’t find the debugger all that useful when building highly threaded systems anyway since the idea that you can step through the code one line at a time and follow the logic of the program has limited relevance to parallel programming, and I usually end up falling back on logs or attempts to visualize what programs are doing.

1. Setup the VS 2017 Envionment

I stuck the following line of code into a batch file called env.bat so I could quickly setup the environment needed to access the build tools. Just run env.bat from the command line once this is in place…

call “D:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build” x86_amd64

2. Add vcvars64.bat

The CUDA compiler, nvcc, looks for this file before launching the various Visual Studio tools. After spending a bit of time looking at what nvcc does via Windows Process Monitor I figured out that we need to create the following path under Program Files, or wherever VS2017 was installed…

\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.10.25017\VC\bin\amd64

Note the the folder with three numbers relates to the specific version of the build tools and I expect Microsoft to change or add variations of the Build tools during the life of VS2017, so this number won’t always be correct. Microsoft do provide a way to query the location of the build tools, but as this requires the use of a COM interface it’s not that easy to integrate into a simple test like this and so I’m happy to hard code it. Once the path is there, make a file called vcvars64.bat containing the text “CALL setenv /x64”.

3. Build Something

A typical command line to CUDA under VS2015 looked like this, where we all we’ve changed for 2017 is the path to cl.exe, and we can’t set the cl-version parameter to 2017 because if we do it throws an error back at us, but this seems to at last find the correct tools.

nvcc -o main.exe –gpu-architecture=compute_50 –gpu-code=sm_50,sm_52 –machine 64 –cl-version 2015 -ccbin “D:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.10.25017\bin\HostX64\x64”

One more problem is that in my sample app I hit a compile time check in host_config.h that the version of Visual Studio was not recognized, where I found that locating and commenting out the offending line of code seemed to clear the error.

host_config.h(133): fatal error C1189: #error: — unsupported Microsoft Visual Studio version! Only the versions 2012, 2013, and 2015 are supported!

After that it seems to work, at least in my simple Hello World app.

Note that I do get a lot of warnings like this… but that they don’t appear to affect the output of the program.

math_functions.h(7691): warning: a __host__ function("fmodf") redeclared with __host__ __device__,
 hence treated as a __host__ __device__ function

Generally I wouldn’t advise doing this for any serious attempt at CUDA coding. I’m sure official Visual Studio 2017 support is just around the corner.

Sentiment Analysis of Twitter Data Using Python

I’m sure this has been covered by many people before but I wanted to explore the current state of natural language processing and have chosen as an example of this to pull some data relating to various stocks from twitter and from there explore how we can automatically evaluate average user sentiment.

Creating a Twitter Developer Account and App

The Twitter Developer site can be found here. From there you can register an app by clicking the ‘Create New App’ button and filling in the required fields. Registration of an app gives you the various keys you need to pull data automatically from Twitter. We’ll come back to how we use them a bit later.

Google Natural Language API

This seems like a really good solution. I’d trust Google to provide a best in class service here. According to Google the first 5K request are free, then you are billed $1 per month for the next million or so. I’m probably going to exceed 5K per month just testing this small test-app alone, but I can afford a dollar per month, so that’s fine.

We can setup access to this API from here. Google require billing details in order to activate this service. So far, so good, until we try to setup the billing details, only to find that Google Cloud Services are only available for business users. I’m not a business user :-(.

So I’m stuck looking for another solution.

Stanford CoreNLP

This is available from here. It takes the form of a Java app that spins up a server on your PC that local processes can then access via an API.

To run this, there is a large download, after which you need to install the JDK from here, and finally spin up the server by executing “java -mx4g -cp “*” edu.stanford.nlp.pipeline.StanfordCoreNLPServer” from the command line.

For testing, to check the service is running, you can access the service from the browser using http://localhost:9000/ which should present you with a nice UI and and various visual representations of the analysis that’s carried out.


I’m going to build a simple app using Python. There are API’s for all the services I want to use that can be accessed easily from python so it’s a great way of bringing them together. For reference I’m using python 3.5.1.

To access to Twitter data I’m using python-twitter from here. That can be installed easily, along with a module that allows my app to talk to the CoreNLP server, by executing the following commands:

python -m pip install python-twitterpython

python -m pip install pycorenlp

Working Example #1

This program will download the most recent 100 English tweets relating to Microsoft stock and print them out. Replace the XX fields with values taken from the Twitter application page earlier.

import twitter
from pprint import pprint
from pycorenlp.corenlp import StanfordCoreNLP

api = twitter.Api(

host = "http://localhost"
port = "9000"
nlp = StanfordCoreNLP(host + ":" + port)

search = api.GetSearch(term='$MSFT', lang='en', result_type='recent', count=100, max_id='')
for t in search:
 text = t.text.encode('utf-8')
 output = nlp.annotate(
        "outputFormat": "json",
        "annotators": "depparse,ner,entitymentions,sentiment"

 for s in output["sentences"]:
  print("%d: '%s': %s" % (

Working Example #1 – Results

In short the results are not great. Almost everything is classified as either Neutral or Negative for some reason. The very few that turn positive are typically garbage posts that have nothing to do with the subject at all. For example I’ve fed the app the tickers of the top 5 tech companies and in each case sentiment was heavily negative. Microsoft fared worst, and didn’t have a single positive post, though that might be because of the negative headlines surrounding the current WannaCrypt outbreak, but generally this doesn’t feel right, and doesn’t feel representative of actual user sentiment.

With a bit more research I can see that I’m not the first person to hit this problem. Apparently the Stanford NLP service is trained using movie reviews and the resulting analysis doesn’t fit twitter feeds very well.

Again, I’m looking for another solution.


VADER is a python module specifically written to perform twitter analysis. On the face of it, it’s a less impressive piece of technology when compared to the Google and Stanford offerings, but being more focused on the task in hand I’m hoping for a good result.

The library is documented on it’s GitHub page here, and can be installed by executing the following command:

python -m pip install vaderSentiment

Working Example #2

As before, this example will grab the most recent 100 posts relating to Microsoft stock and hand them to the sentiment analyzer. This time as well as gathering per post scores we also compute and print an average.

import twitter
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

api = twitter.Api(

scores = []

search = api.GetSearch(term='$MSFT', lang='en', result_type='recent', count=100, max_id='')
for t in search:
 text = t.text.encode('utf-8')

 sentence = t.text
 analyzer = SentimentIntensityAnalyzer()
 vs = analyzer.polarity_scores(sentence)
 print("{%s}" % format(str(vs)))


average = 0.0
for score in scores:
 average += score
average /= len(scores)
print("Average = %f" % average)

Working Example #2 – Results

The results are much better. The scores seem to for the most part be a genuine reflection of sentiment! So far so good…

Working Example #3 – Charting the results

As a further experiment I’m going to see what it would take to create charts from the results. There is a library for Python called matplotlib that seems to be more or less what I’m after. You can feed it a data set and with just a few commands it produces a chart of the data.

I’ve install version 1.5.1 of the module because the latest version seems to be broken and fails with missing dependencies on Freetype and PNG modules, but this version seems to work just fine and can be installed like so:

python -m pip install matplotlib==1.5.1

This code shows how we would adapt the previous example code to produce a chart. In this case we are pulling data for each day across a specified time-frame. Then for each day we produce an average of the sentiment scores. Finally we hand the data to the charting module to plot sentiment against time, with dates along the bottom.

import time
import twitter
from time import mktime
from datetime import datetime
from datetime import timedelta
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

api = twitter.Api(

start_date = "2017-05-05"
stop_date = "2017-05-13"
start = datetime.strptime(start_date, "%Y-%m-%d")
stop = datetime.strptime(stop_date, "%Y-%m-%d")

scores = []
dates = []

while start < stop:

 start = start + timedelta(days=1)  # increase day one by one
 start_p1 = start + timedelta(days=1)  # increase day one by one

 start_str = start.strftime("%Y-%m-%d")
 start_p1_str = start_p1.strftime("%Y-%m-%d")

 day_score = 0.0
 day_count = 0

 search = api.GetSearch(term='$MSFT', lang='en', result_type='recent', count=1000, max_id='',
  since=start_str, until=start_p1_str)

 for t in search:
  analyzer = SentimentIntensityAnalyzer()
  vs = analyzer.polarity_scores(t.text)

  day_score += vs['compound']
  day_count += 1

 if day_count > 0:

  day_score = day_score / day_count

  print("%s : %f" % (start_str, day_score))


x = dates
y = scores

plt.plot(x, y)

Working Example #3 – Results

The result looks something like this!

Interestingly Microsoft’s stock price climbed during this period and then fell somewhat on the 11th May. The sentiment chart doesn’t seem too far off, though in this case at best it seems to be a trailing rather than leading indicator of performance, or maybe it doesn’t mean anything at all.

Using CUDA 8 with Visual Studio Express 2015

Somehow I’ve never had any real exposure to CUDA. I’d quite like to experiment with CUDA a little and so would like to get my local environment setup to build CUDA programs.

There is an official CUDA installation guide here. On reading that I’m immediately concerned that support for the free versions of Visual Studio seems a little limited. I have Visual Studio 2015 Express Installed, which I use more as a text editor than IDE, though I occasionally use the C++ debugger, but the installer claims support for Visual Studio Community 2015 and above only, which I don’t have!

I did spend some time searching for Visual Studio Community 2015 but Microsoft make it very hard to discover and install old versions of Visual Studio. All the links on Google and MSDN point to 2017 which has no CUDA support at all at this time, and when you do finally get through to a page for downloading old versions it’s implied you need an MSDN subscription. I’m just going to go ahead and assume I can make it work. I only need the IDE for editing the source anyway. I’m happy to build and run the apps from the command line if I have to, and that approach probably results in more portable code anyway, for example if I ever want to push code into the cloud.

Next I install the toolkit. The installer makes changing the installation folders very difficult. There is no copy/paste for the default folders it displays, then the path selection dialog doesn’t allow you to type a path, and doesn’t allow you to create folders so you then have to go into Windows Explorer, create the full paths to the target folders, and go back to the installer and browse to them. Or you can just accept a big lost of space on your SSD!

During installation I am told that no valid version of Visual Studio was installed and forced to confirm I was OK with this. Again I assume I’m going to make it work somehow right?!

Once installed we need a program to build. The following seems to be a fairly standard definition of a CUDA hello world app, perfect for testing that I can compile a source file. I stick this code into a file called

#include <stdio.h>
#include <stdint.h>

__global__ void kernel(void)
	printf("Hello, world, from the GPU\n");

int main(void)
	printf("Hello, world, from the CPU\n");
	kernel << <1, 1 >> >();

Now from the Windows Command Line I run the following to setup the Visual Studio environment. Note that my installation is on D rather than C here.

call "D:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" x86_amd64 

After a bit of experimentation I found that the following command line should compile the source, though as shown here it just gives me an error message relating to vcvars64.bat,

nvcc -o main.exe 
  --machine 64 
  --cl-version 2015 
  -ccbin "D:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64"

nvcc fatal : Microsoft Visual Studio configuration file 'vcvars64.bat' could not be found for
 installation at 'D:/Program Files (x86)/Microsoft Visual Studio 14.0/VC/BIN/x86_amd64/../../..'

I’m going to assume the vcvars64.bat issue is some difference between the Express and standard versions of Visual Studio. Whatever the cause a bit of googling told me I should just create the file, which I do, putting the new file in “D:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\amd64”, and containing the string “CALL setenv /x64”. After doing that the above command line completes without error, printing the following as it does so.
   Creating library main.lib and object main.exp

As well as the lib and exp file it also made an exe! When I run that exe from the command line I get the expected “Hello World” response!