Tiago Ramalho Just another AI researcher

Dimensionality reduction 101: linear algebra, hidden variables and generative models

Suppose you are faced with a high dimensional dataset and want to find some structure in the data: often there are only a few causes, but lots of different data points are generated due to noise corruption. How can we infer these causes? Here I’m going to cover the simplest method to do this inference: we will assume the data is generated by a linear transformation of the hidden causes. In this case, it is quite simple to recover the parameters of this transformation and therefore determine the hidden (or latent) variables which represent their cause.

Mathematically, let’s say the data is given by a high-dimensional (size N) vector $u$ which is generated by a low-dimensional (size M) vector $v$ via

$$u=\sum_i^M W_i v_i + \text{noise}$$

where $W_i$ are N-dimensional vectors which encode the transformation. For simplicity I will assume $u$ to have mean value 0. Let’s assume the noise is gaussian. Then the distribution for the data is

$$P(u|v) = \frac{1}{\sqrt{(2\pi)^N\left|\Sigma\right|}}\exp\left(-\frac{1}{2}(u-W\cdot v)^T \Sigma^{-1}(u-W\cdot v)\right)$$

with $\Sigma$ a diagonal matrix. This means that we assume any correlation structure in the data arises from the transformation $W$, and not due to noise correlations. This model is the main feature of the procedure known as Factor analysis.

To fit the model we need both the weights $W$ and factors $v$ which best fit the data (in the case of no prior information for $v$, this corresponds to maximising the likelihood function). It is convenient to take the logarithm of the likelihood function:

$$\log P(u|v) = -\frac{1}{2}(u-W\cdot v)^T \Sigma^{-1}(u-W\cdot v) -\frac{1}{2} \log |\Sigma| - \frac{N}{2} \log 2\pi$$

So we are looking for $\arg\max_{W,v,\Sigma} \log P(u|v)$. The obvious way to obtain the parameters is to use expectation-maximisation: we calculate what are the factors given $W$ and $\Sigma$, and then update those parameters given the new factors in such a way as to maximise the log-likelihood. Iterating this process will result in convergence to a final set of weights and factors, which will be a local minimum of the likelihood.

Let’s examine how this works with a very simple example: the Iris dataset. This is a 4 dimensional dataset, with each dimension corresponding to a physical measurement of the sepals and petals of a flower. There are 3 different flowers so we can assume that each flower’s genome will code for different physical features. So the question is whether there is a ‘genome variable’ $v$ which maps into the observed physical features $u$? Let’s load the dataset directly using sklearn:

from sklearn import datasets

iris = datasets.load_iris()
x = iris.data # physical measurements
y = iris.target # actual species for visualization

And now we can plug this directly into the Factor analysis model:

import matplotlib.pyplot as plt
from sklearn.decomposition import FactorAnalysis

model = FactorAnalysis(n_components=2)
x_reduced = model.fit_transform(x)
plt.scatter(x_reduced[:, 0], x_reduced[:, 1], c=y, cmap=plt.cm.Paired)

Let’s visualize the hidden factors for the actual data points we have by inverting the transformation. Below you can see the data with $v_1$ in the x-axis and $v_2$ in the y-axis.

Factor analysis with 2 components.

In fact, it seems that the first component $v_1$ manages to describe accurately the variation between the 3 different species. Let’s try to do Factor analysis with just one component and see what comes out (I added some randomness in the y-direction to visualize the points better).

Factor analysis with 1 component.

With 1 component we obtain an estimate for $\Sigma^{-1} = [ 0.16350346, 0.15436415, 0.00938691, 0.04101233]$ which corresponds to the variation of the data which is not explained by the linear transformation of $v$. What if we would force all diagonal elements of $\Sigma$ to be identical? In that case, we observe that the vectors $W_i$ actually form an orthogonal basis (i.e. each one is perpendicular to the next). This in turn means that when we project the data to the space of the hidden variables, each $v_i$ will be uncorrelated to the others (which does not happen with Factor analysis).

Linear algebra explains why this happens: after some calculation, we can observe that the obtained $W_i$ correspond to the principal eigenvectors (i.e. are associated to the largest eigenvalues) of the sample covariance matrix $C= \frac{1}{N}\sum_{j=1}^N (u^j - \bar{u}) (u^j-\bar{u})^T$ with the overscripts denoting sample index. This method has a specific name and is called Probabilistic Principal Components Analysis. Setting $\Sigma = 0$ we obtain the popular vanilla PCA.

Typically we sort the eigenvectors such that $W_1$ corresponds to the largest eigenvalue of $C$ meaning that the hidden variable $v_1$ explains the most variation in the observed data. Now $W_2$ must be perpendicular to $W_1$, which means that $v_2$ will explain the most remaining variation in the data, and so on. If we allow $M=N$ then it would be possible to completely reconstruct the data from $v$.

PCA with 2 components.

In the Iris dataset we can observe that the projected subspace using probabilistic PCA is different to the one found in Factor analysis. We can confirm that we found orthogonal components by printing $W_1 \cdot W_2$ and comparing it to Factor analysis:

model = PCA(n_components=2)
x_reduced = model.fit_transform(x)
print np.dot(model.components_[0], model.components_[1])
# 3.43475248243e-16

model = FactorAnalysis(n_components=2)
x_reduced = model.fit_transform(x)
print np.dot(model.components_[0], model.components_[1])
# 0.146832660483

In the case of handwritten digits, the first two components in PCA appear to capture more of the structure of the problem than the first two factors in factor analysis. However, since this is a strongly nonlinear problem, none of them can accurately separate all 10 digit classes.

Digits with FA Digits with PCA

If we look at the log-likelihood equivalent for PCA we observe a very deep connection: $\log P(u|v) = -\frac{1}{2}(u-W\cdot v)^2$. The $W, v$ that come out of PCA (i.e. maximise the log-likelihood) also minimize the least squares reconstruction error for the simplest linear model. From a data compression perspective, that means that we can compress the information contained in $u$ by saving only $W$ and $v$. This is worth it when we have a large number of samples of $u$, since $W$ is constant and each $v$ is smaller than each $u$.

With the fit distribution in hand, we can even generate new samples of $u$. We just need to draw some samples of low-dimensional $v$ and pass it through the linear model we have. Let’s see what happens when we apply this idea to the digits dataset, sampling $v$ from a gaussian approximation of the $v$ corresponding to the actual data distribution.

v = model.transform(x)
v_samples = np.random.multivariate_normal(
    np.mean(v, axis=0), np.cov(v.T), size=100)
x_samples = model.inverse_transform(v_samples)

Digits from generative distribution. Given that we are applying a linear model to a nonlinear problem, I'm really happy to be able to make out a few digits!

To finish this long post, let’s just look at what happens if we don’t assume a flat prior for $v$. Two common options are to assume $P(v)$ is a gaussian or an exponential distribution. Then the log-likelihood (getting rid of constants, as usual) becomes:

$$\log P(u|v) = -\frac{1}{2}(u-W\cdot v)^2 - \alpha v^2 - \beta |v|$$

Those two terms correspond to the often used $L_2$ and $L_1$ regularization, which for linear models have the fancy names of ridge regression and lasso respectively (not really sure why they need fancy names). Writing this in a more CS notation, we can say that our problem consists of maximising: 

$$||u-W\cdot v||^2 - \alpha ||v||^2 - \beta ||v||$$

This problem has been given the name of Sparse PCA, and we can find an implementation in scikits. Let’s see what happens when we apply it to the digits dataset.

from sklearn.decomposition import SparsePCA
from sklearn import datasets

dset = datasets.load_digits()
x = dset.data
y = dset.target

model = SparsePCA(n_components=10, alpha=0)
x_reduced = model.fit_transform(x)

print np.dot(model.components_[0], model.components_[1])

n_row = 5
n_col = 2
plt.figure(figsize=(6, 15))
for i in xrange(10):
	comp = model.components_[i]
	plt.subplot(n_row, n_col, i + 1)
	vmax = max(comp.max(), -comp.min())
	plt.imshow(comp.reshape([8,8]), cmap=plt.cm.gray, interpolation='nearest')
	plt.xticks(())
	plt.yticks(())
plt.tight_layout()
plt.savefig('digits_SPCA_rec.png')

Immediately we notice that we lose the orthogonality of the components due to the constraints. Let’s see how they look like:

10 components found by sparse PCA.

Here is a gist with all the code used for this post.

Parallel programming with opencl and python: parallel scan

This post will continue the subject of how to implement common algorithms in a parallel processor, which I started to discuss here. Today we come to the second pattern, the scan. An example is the cumulative sum, where you iterate over an array and calculate the sum of all elements up to the current one. Like reduce, the algorithm we’ll talk about is not exclusive for the sum operation, but for any binary associative operation (the max is another common example). There are two ways to do a parallel scan:  the hills steele scan, which needs $\log n$ steps; and the blelloch scan, requiring $2 \log n$ steps. The blelloch scan is useful if you have more data than processors, because it only needs to do $\mathcal{O}(n)$ operations (this quantity is also referred to as the work efficiency); while the hillis steele scan needs $\mathcal{O}(n \log n)$ operations. So let’s look at how to implement both of them with opencl kernels.

Hillis Steele

This is the simplest scan to implement, and is also quite simple to understand. It is worthwhile noting that this is an inclusive scan, meaning the first element of the output array o is described as a function of the input array a as follows:

Naturally, you could replace the summation with any appropriate operator. Rather than describing the algorithm with words, I think a picture will make it much clearer than I could:

Visualization of the hillis steele inclusive scan.

Visualization of the hillis steele inclusive scan. Credit: nvidia.

As you can see, the key is to add the value of your neighbor $2^d$ positions to your left to yourself, or just do nothing if such neighbor does not exist. This is quite simple to translate to an OpenCL kernel:

#define SWAP(a,b) {__local float *tmp=a;a=b;b=tmp;}
__kernel void scan(__global float *a,
                   __global float *r,
                   __local float *b,
                   __local float *c)
{
    uint gid = get_global_id(0);
    uint lid = get_local_id(0);
    uint gs = get_local_size(0);

    c[lid] = b[lid] = a[gid];
    barrier(CLK_LOCAL_MEM_FENCE);

    for(uint s = 1; s < gs; s <<= 1) {
        if(lid > (s-1)) {
            c[lid] = b[lid]+b[lid-s];
        } else {
            c[lid] = b[lid];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
        SWAP(b,c);
    }
    r[gid] = b[lid];
}

The for loop variable s represents the neighbor distance: it is multiplied by two every loop until we reach N neighbors. Note the use of the SWAP macro: it swaps the b and c pointers which will always denote the current step (c) and the previous step (b) without needing to copy memory.

Blelloch

The Blelloch scan is an exclusive scan, which means the sum is computed up to the current element but excluding it. In practice it means the result is the same as the inclusive scan, but shifted by one position to the right: 

The idea of the algorithm is to avoid repeating summations of the same numbers. As an example, if you look at the picture above you can see that to calculate $o_5$ we need to add together ${x_5, x_4, x_2+x_3, x_1+x_0}$. That means we are essentially repeating the calculation of $o_3$ for nothing. So to avoid this we’d need to come up with a non overlapping set of partial sums that each calculation could reuse. That’s what this algorithm does!

As before, the algorithm is better explained with a picture or two:

The up sweep portion of the Blelloch scan.

The up sweep portion of the Blelloch scan. Here the partial sums are calculated. Credit: nvidia.

The down sweep portion of the Blelloch scan.

The down sweep portion of the Blelloch scan. Here the partial sums are used to calculate the final answer.

You can see how the up sweep part consists of calculating partial sums, and the down sweep combines them in such a way that we end up with the correct results in the correct memory positions. Let’s see how to do this in OpenCL:

__kernel void scan(__global float *a,
                   __global float *r,
                   __local float *b,
                   uint n_items)
{
    uint gid = get_global_id(0);
    uint lid = get_local_id(0);
    uint dp = 1;

    b[2*lid] = a[2*gid];
    b[2*lid+1] = a[2*gid+1];

    for(uint s = n_items>>1; s > 0; s >>= 1) {
        barrier(CLK_LOCAL_MEM_FENCE);
        if(lid < s) {
            uint i = dp*(2*lid+1)-1;
            uint j = dp*(2*lid+2)-1;
            b[j] += b[i];
        }

        dp <<= 1;
    }

    if(lid == 0) b[n_items - 1] = 0;

    for(uint s = 1; s < n_items; s <<= 1) {
        dp >>= 1;
        barrier(CLK_LOCAL_MEM_FENCE);

        if(lid < s) {
            uint i = dp*(2*lid+1)-1;
            uint j = dp*(2*lid+2)-1;

            float t = b[j];
            b[j] += b[i];
            b[i] = t;
        }
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    r[2*gid] = b[2*lid];
    r[2*gid+1] = b[2*lid+1];
}

It took me a little while to wrap my head around both steps of the algorithm, but the end code is pretty similar to the Hillis Steele. There are two loops instead of one, and the indexing is a bit tricky, but I think that comparing the code to the picture it should be straightforward. You can also find a cuda version in the nvidia paper where I took the pictures from.

Parallel programming with opencl and python: parallel reduce

Once you know how to use python to run opencl kernels on your device (read Part I and Part II of this series) you need to start thinking about the programming patterns you will use. While many tasks are inherently parallel (like calculating the value of a function for N different values) and you can just straightforwardly run N copies on your processors, most interesting tasks involve dependencies in the data. For instance if you want to simply sum N numbers in the simplest possible way, the thread doing the summing needs to know about all N numbers, so you can only run one thread, leaving most of your cores unused. So what we need to come up with are clever ways to decompose the problem into individual parts which can be run in parallel, and then combine them all in the end.

This is the strong point of Intro to Parallel Programming, available free. If you really want to learn about this topic in depth you should watch the whole course. Here I will only show how I implemented some of the algorithms discussed there in OpenCL (the course is in CUDA). I’ll discuss three algorithms: reduce, scan and histogram. They show how you can use some properties of mathematical operators to decompose a long operation into a series of many small, independent operations.

Reduce is the simplest of the three. Let’s say you have to sum N numbers. On a serial computer, you’d create a temporary variable and add the value of each number to it in turn. It appears there is not much to parallelize here. The key is that addition is a binary associative operator. That means that $a + b + c + d = (a+b) + (c+d)$. So if I have two cores I can add the first two numbers on one, the last two on the other, and then sum those two intermediate results. You can convince yourself that we only need $\mathcal{O}(\log N)$ steps if we have enough parallel processing units; as opposed to $\mathcal{O}(N)$ steps in the serial case. A reduction kernel in OpenCL is straightforward, assuming N is smaller than the number of cores in a single compute unit:

__kernel void reduce(__global float *a,
                     __global float *r,
                     __local float *b)
{
    uint gid = get_global_id(0);
    uint wid = get_group_id(0);
    uint lid = get_local_id(0);
    uint gs = get_local_size(0);

    b[lid] = a[gid];
    barrier(CLK_LOCAL_MEM_FENCE);

    for(uint s = gs/2; s > 0; s >>= 1) {
        if(lid < s) {
          b[lid] += b[lid+s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    if(lid == 0) r[wid] = b[lid];
}

Full code here. First we copy the numbers into shared memory. To make sure we always access memory in a contiguous fashion, we then take the first N/2 numbers and add to them the other N/2, so now we have N/2 numbers to add up. Then we take half of those and add the next half, until there is only one number remaining. That’s the one we copy back to main memory.

Reduction stages.

Reduction stages. Image from nvidia slides.

If we have N bigger than the number of cores in a single unit, we need to call the kernel multiple times. Each core will compute its final answer, and then we run reduce again on that array of answers until we have our final number. In the python code I linked to, you can see how we enqueue the kernel twice, but with fewer threads the second time:

'''run kernels the first time for all N numbers, this will result
in N/n_threads numbers left to sum'''
evt = prg.reduce(queue, (N,), (n_threads,), a_buf, r_buf, loc_buf)
evt.wait()
print evt.profile.end - evt.profile.start

'''because I'd set N=n_threads*n_threads, there are n_threads numbers
left to sum, we sum those here, leaving 1 number'''
evt = prg.reduce(queue, (n_threads,), (n_threads,), r_buf, o_buf, loc_buf)
evt.wait()
print evt.profile.end - evt.profile.start

'''copy the scalar back to host memory'''
cl.enqueue_copy(queue, o, o_buf)

Parallel programming with opencl and python: vectors and concurrency

Last time we saw how to run some simple code on the gpu. Now let’s look at some particular aspects related to parallel programming we should be aware of. Since gpus are massively parallel processors, you’d expect you could write your kernel code for a single data piece, and by running enough copies of the kernel you’d be maximizing your device’s performance. Well, you’d be wrong! I’m going to focus on the three most obvious issues which could hamper your parallel code’s performance:

  • Each of the individual cores is actually a vector processor, which means it can perform an operation on multiple numbers at a time.
  • At some point the individual threads might need to write to the same position in memory (i.e. to accumulate a value). To make sure the result is correct, they need to take turns doing it, which means they spend time waiting for each other doing nothing.
  • Most code is limited by memory bandwidth, not compute performance. This means that the gpu can’t get the data to the processing cores as fast as they can actually perform the computation required.

Let’s look at the vector instructions first. Let’s say you need to add a bunch of floating point numbers. You could have each core add one number at a time, but in reality you can do better. Each processing core can actually perform that addition on multiple numbers at a time (currently it is common to process the equivalent of 4 floats at a time, or a 128 bit vector). You can determine the maximum size of a vector for each type by accessing the CL_DEVICE_PREFERRED_VECTOR_WIDTH_ [property](http://documen.tician.de/pyopencl/runtime.html) for a device.

To take advantage of this feature, openCL defines types such as float4, int2 etc with overloaded math operations. Let’s take a look at a map operation implementing vectors:

__kernel void sq(__global float4 *a,
__global float *c)
{
    int gid = get_global_id(0);
    float4 pix = a[gid];
    c[gid] = .299f * pix.x + .587f * pix.y + .114f * pix.z;
}

Full code here. A float4 just corresponds to a memory position with 4 floats next to each other. The data structure has 4 fields: x,y,z,w, which are just pointers to each of those memory positions. So if you want to create an array of floats to be processed by this code, just pack each set of 4 floats you want to access together contiguously. Interestingly, if you read float3 vectors from an array, openCL will still jump between every set of 4 floats, which means you lose one number and might result in nasty bugs. Either you leave every 4th position unused, as I did here, or you start doing your own pointer arithmetic. A small warning: if you choose to load random memory positions your memory bandwidth might suffer, because it is faster to read 32 (or some multiple) contiguous bytes at a time (if you are interested in knowing more about this topic, google memory coalescing).

This previous code was still just a map, where every computation result would be written to a different memory position. What if all threads want to write to the same memory position? OpenCL provides atomic operations, which make sure no other thread is writing to that memory position before doing it. Let’s compare a naive summation to a memory position with atomic summation:

import pyopencl as cl
import numpy as np

n_threads = 100000
N = 10
a = np.zeros(10).astype(np.int32)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)

mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.COPY_HOST_PTR, hostbuf=a)

prg = cl.Program(ctx, """
    __kernel void naive(__global int *a,
    int N)
    {
      int gid = get_global_id(0);
      int pos = gid % N;
      a[pos] = a[pos] + 1;
    }
    __kernel void atomics(__global int *a,
    int N)
    {
      int gid = get_global_id(0);
      int pos = gid % N;
      atomic_inc(a+pos);
    }
    """).build()

n_workers = (n_threads,)

naive_res = np.empty_like(a)
evt = prg.naive(queue, n_workers, None, a_buf, np.int32(N))
evt.wait()
print evt.profile.end - evt.profile.start
cl.enqueue_copy(queue, naive_res, a_buf)
print naive_res

a_buf = cl.Buffer(ctx, mf.COPY_HOST_PTR, hostbuf=a)
atomics_res = np.empty_like(a)
evt = prg.atomics(queue, n_workers, None, a_buf, np.int32(N))
evt.wait()
print evt.profile.end - evt.profile.start
cl.enqueue_copy(queue, atomics_res, a_buf)
print atomics_res

The first kernel runs fast, but returns the wrong result. The second kernel is way slower, but is an order of magnitude slower, because threads had to wait for each other.

1582240
[25 25 25 25 25 25 25 25 25 25]
31392000
[10000 10000 10000 10000 10000 10000 10000 10000 10000 10000]

There is really no better way to do concurrent writes, so you’ll have to live with the slow down if you absolutely need to do them. But often you can restructure your algorithm in such a way as to minimize the number of concurrent writes you need to do, which will speed up your code. Another way to avoid concurrency issues is to take advantage of the memory hierarchy in a gpu, which we’ll discuss next.

Like a cpu, there is a hierarchy of memory pools, with closer, faster pools being smaller and the off die, ‘slow memory’ being relatively large (a few GBs as of now). In openCL, these different memory spaces have the following names: private memory refers to the registers in each core, and is initialized in the kernel code itself; shared memory refers to a cache in each processing unit (which is shared by all cores within a processing unit); and global memory refers to the off die ram (there are even other memory types, but let’s stick to the basics for now).

Diagram of the memory hierarchy in a typical GPU. Credit: AMD.

Diagram of the memory hierarchy in a typical GPU. Credit: AMD.

In the previous code examples we’d always read and write from global memory, but every time we load something from there we will have to wait hundreds of clock cycles for it to be loaded. So it would be better if we’d load the initial data from global memory, used shared memory to store intermediate results and then store the final result back in global memory, to be read by the host. Let’s look at a program which blurs an image by dividing it into blocks, loading each block into a compute unit’s shared memory and having each core blur one specific pixel.

__kernel void blur(__global uchar4 *c,
            __global uchar4 *res,
            __local uchar4 *c_loc,
            uint w, uint h)
{
	uint xg = get_global_id(0);
	uint yg = get_global_id(1);
	uint xl = get_local_id(0)+1;
	uint yl = get_local_id(1)+1;
	uint wm = get_local_size(0)+2;
	uint wl = get_local_size(0);
	uint hl = get_local_size(1);
	c_loc[xl+wm*yl] = c[xg+w*yg];
	uint left = clamp(xg-1, (uint)0, w);
	if(xl==1) c_loc[0+wm*yl] = c[left+w*yg];
	uint right = clamp(xg+1, (uint)0, w);
	if(xl==wl) c_loc[(wl+1)+wm*yl] = c[right+w*yg];
	uint top = clamp(yg-1, (uint)0, h);
	if(yl==1) c_loc[xl+wm*0] = c[xg+w*top];
	uint bot = clamp(yg+1, (uint)0, h);
	if(yl==hl) c_loc[xl+wm*(hl+1)] = c[xg+w*bot];
	barrier(CLK_LOCAL_MEM_FENCE);
	uchar4 blr = c_loc[xl+wm*(yl-1)]/(uchar)5 +
	          c_loc[xl-1+wm*yl]/(uchar)5 +
	          c_loc[xl+wm*yl]/(uchar)5 +
	          c_loc[xl+1+wm*yl]/(uchar)5 +
	          c_loc[xl+wm*(yl+1)]/(uchar)5;
	res[xg+w*yg] = blr;
}

Whoa that’s a lot of lines! But they are all super simple, so let’s look at it line by line. In the function declaration you can see the __local uchar4 pointer. That points to the shared memory we are going to use. Unfortunately we cannot initialize it from the host (can only copy values to global memory) so we use lines 13-21 to read values from global memory into the local buffer, taking into account boundary conditions. In this code we distributed the threads in a 2d configuration so each thread has an id in both x and y dimensions (notice the argument for get_global_id and get_local_id denoting the dimension) so we can read off the x and y positions we want to process plus the block size directly in lines 6-12.

Once the block values have been copied to shared memory, we tell all the threads within one compute unit to wait for each other with line 22. This makes sure that everyone has loaded their corresponding values into shared memory before the code continues. It’s important to do this because each thread will read many values in the following computation. Line 23 is just the Laplace kernel which does the actual blurring. Finally we write the value back into global memory.

So how do we set up this code in pyopencl. I won’t reproduce the full code here, so let’s just look at the few important lines:

n_local = (16,12)
nn_buf = cl.LocalMemory(4*(n_local[0]+2)*(n_local[1]+2))
n_workers = (cat.size[0], cat.size[1])

prg.blur(queue, n_workers, n_local, pix_buf, pixb_buf, nn_buf, np.uint32(cat.size[0]), np.uint32(cat.size[1]))

LocalMemory is how we tell pyopencl we are going to use some shared memory in our kernel: we need to tell it how many bytes to reserve for this particular buffer. In this particular case we need block width plus two for the boundaries multiplied by block height plus boundaries, times 4 bytes for the size of a uchar4. n_workers corresponds to a tuple with the picture’s width and height, which means we launch a thread for each pixel in the image, distributed in the aforementioned 2d array. n_local corresponds to the local block size, i.e. how many threads will share a compute unit/shared memory. I encourage you to look at the full code and run it!

Next time I’ll cover some practical algorithms we can run on a gpu.

Parallel programming with opencl and python

In the next few posts I’ll cover my experiences with learning how to program efficient parallel programs on gpus using opencl. Because the machine I got was a mac pro with the top of the line gpus (7 teraflops) I needed to use opencl, which is a bit complex and confusing at first glance. It also requires a lot of boilerplate code which makes it really hard to just jump in and start experimenting. I ultimately decided to use pyopencl, which allows us to do the boring boilerplate stuff in just a few lines of python and focus on the actual parallel programs (the kernels).

First, a few pointers on what I read. A great introduction to the abstract concepts of parallel programming is the udacity course Introduction to parallel programming. They use C and CUDA to illustrate the concepts, which means you can’t directly apply what you see there on a computer with a non nvidia gpu. To learn the opencl api itself, I used the book OpenCL in Action: How to Accelerate Graphics and Computation. As for pyopencl, the documentation is a great place to start. You can also find all the python code I used in github.

I assume you know the basics of how gpus work and what they are useful for. My intention is to ‘translate’ the existing tutorials into pyopencl, which lets you start running code much sooner than any C based framework. Additionally, because we are using openCL, we can run our simple code on most computers. To start with, let’s look at how to access the data structures which contain information about the available openCL devices on our computer:

import pyopencl as cl
plat = cl.get_platforms()
plat[0].get_devices()

In a given computer, you can have different implementations of OpenCL (i.e. an amd driver,and an nvidia driver); these are known as platforms. Usually you’ll only have one platform in your computer. A platform contains the devices it is responsible for, so by querying the platform data structure we can look at all the devices in our system. The mac pro shows the following list of available devices:

[<pyopencl.Device 'Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz' on 'Apple' at 0xffffffff>, <pyopencl.Device 'ATI Radeon HD - FirePro D700 Compute Engine' on 'Apple' at 0x1021c00>, <pyopencl.Device 'ATI Radeon HD - FirePro D700 Compute Engine' on 'Apple' at 0x2021c00>]

To actually run something on these devices, we need to create a context to manage the queues of kernels which will be executed there. So say I want to run something on my first gpu. I’d create a context with:

import pyopencl as cl
plat = cl.get_platforms()
devices = plat[0].get_devices()
ctx = cl.Context([devices[1]])
ctx.get_info(cl.context_info.DEVICES)

The final line queries the device associated with the context we just created:

[<pyopencl.Device 'ATI Radeon HD - FirePro D700 Compute Engine' on 'Apple' at 0x1021c00>]

Why would we need to query the devices in a context if we put the devices there in the first place? One reason is that you can create a context without bothering to look up any platforms or devices beforehand. Just run

import pyopencl as cl
ctx = cl.create_some_context()

And you’re done! When you run the script, a prompt will ask you for a specific device out of all possible devices, or you can set an environment variable to specify which one you want by default. In the following, I’ll always use this method to create a context, but if you want more control over which devices you choose, this example is quite enlightening.

Now that we know how to access the devices, let’s take a look at how to run code there. I’ll start with a simple parallel pattern, the map. We are going to apply a function to each of the data points independently, which allows for maximal parallelism. Here is the code:

import pyopencl as cl
import numpy as np

a = np.arange(32).astype(np.float32)
res = np.empty_like(a)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)

prg = cl.Program(ctx, """
    __kernel void sq(__global const float *a,
    __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] * a[gid];
    }
    """).build()

prg.sq(queue, a.shape, None, a_buf, dest_buf)

cl.enqueue_copy(queue, res, dest_buf)

print a, res

In line 7 we create the context, as before. Then, we create a queue in line 8, which is what schedules the kernels to run on the device. Now let’s skip a few lines and look at the actual opencl code, on lines 14-21. You can see that the opencl code itself is in the c programming language, and is passed to the program object as a string. In a real project we should write this code in a .cl file separate from the python project and have the code be read from there, but for these simple examples I’ll leave the code as a string. Once the program object is initialized with some code, we call its build method to compile it to a binary native to the gpu.

You can see from the kernel’s signature that it expects to receive pointers to two memory locations. These point to the gpu’s main memory, and must be initialized before running the kernel. That’s what lines 10-12 are for: they let pyopencl know that two blocks of memory must be initialized on the gpu before the kernel is run, and if necessary, what values should be copied to that memory (the hostbuf parameter, which points to the source array on the host’s main memory). The memory is actually only allocated / copied when the kernel actually reaches the top of the queue, and before it runs.

We add the kernel to the queue in line 23, telling pyopencl which queue to add it to first; then how many instances of the kernel will be run (we want to spawn many instances to take advantage of the parallel nature of the gpu) and how they will be distributed (the None parameter, we’ll cover this in a later post);  and finally the parameters which should be passed (the memory locations). Finally in line 25 we copy the values from the memory at res back to an array in the host memory. If we did this in C we would have needed 100+ lines of code by now so I’m really happy pyopencl exists.

Finally let’s look at the kernel code itself, the actual opencl code and see what it does:

__kernel void sq(__global const float *a, __global float *c)
{
    int gid = get_global_id(0);
    c[gid] = a[gid] * a[gid];
}

Each of the tiny processors on the GPU will run a copy of this code. First we access each thread’s unique global id. This will allow the processor to identify which piece of memory it should work on. Then, it loads the value from the a array and squares it, storing it in the correct position in the c array. Simple! Next time we’ll look at some more advanced operations we can perform on data.