This cuts the runtime by about 70% which is nice, and it's a
better algorithm for it anyway.
I've also refactored the Convolution layer such that there's
only one actual implementation instead of two, and with that
provided a few more instances for 2D and 3D shapes in and out.
Update to the README and mnist show higher levels of composition.