Column-Slicing in PyCUDA

PyCUDA’s GPUArray class is designed to be an analogue to the Numpy ndarray, but while PyCUDA is still under heavy development it is still missing some crucial functionality that would make it a real drop-in replacement for Numpy. Most importantly, slicing of GPUArrays has only been recently implemented in version 2013.1, but unfortunately, the implementation is a little lacking in that it seems to implement general slicing, but actually doesn’t. There is also potential for a quite subtle bug that may alter the content of your array if you don’t pay attention.

First, let’s initialize our GPU and create a GPUArray to work on:

>>> import numpy as np
>>> import pycuda.autoinit
>>> from pycuda import gpuarray
>>> from pycuda.curandom import rand as curand

>>> float = np.float32
>>> height = 100
>>> width = 200
>>> X = curand((height, width), float)

>>> X.flags.c_contiguous               # New GPUArray is C-contiguous
True

If we take a column-slice of this array, the returned slice is no longer a contiguous block of memory:

>>> Y = X[:,:100]
>>> Y.flags.forc                       # Array is no longer contiguous
False

Unfortunately, most of the operations in GPUArray are not implemented for non-contiguous arrays, so using the slicing operator to get a column slice actually doesn’t have much utility yet. However, what’s worse, if we get a new view on the non-contiguous array, the flag signaling that the array is non-contiguous is discarded and the view treats the memory as contiguous:

>>> Y_view = Y.view()
>>> Y_view.flags.c_contiguous          # Magically, Y_view appears contiguous now
True
>>> Y_view.get() == X.get()[:,:100]       # compare to slicing on CPU
array([[ True,  True,  True, ...,  True,  True,  True],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]], dtype=bool)

Ouch, that’s not what wanted. The GPUArray.view() function does not remember the actual memory layout of Y and therefore the data in all rows after the first is wrong.

To work around this, you can use the pycuda.driver.Memcpy2D function to copy the data to a new contiguous array. Here is a function that creates a new GPUArray and performs a memory copy:

def extract_columns(mat, start=0, stop=None):
    dtype = mat.dtype
    itemsize = np.dtype(dtype).itemsize
    N, M = mat.shape
    m = stop - start

    assert mat.flags.c_contiguous
    assert start >= 0 and start <= M and stop >= 0 and stop <= M and stop > start

    new_mat = gpuarray.empty((N, m), dtype)

    copy = drv.Memcpy2D()
    copy.set_src_device(mat.gpudata)
    copy.src_x_in_bytes = start * itemsize    # Offset of the first column in bytes
    copy.set_dst_device(new_mat.gpudata)
    copy.src_pitch = M * itemsize   # Width of a row in bytes in the source array
    copy.dst_pitch = copy.width_in_bytes = m * itemsize  # Width of sliced row
    copy.height = N
    copy(aligned=True)

    return new_mat

Now we can use the function as follows:

>>> Y = extract_columns(X, 0, 100)
>>> np.all(Y.get() == X.get()[:,:100]) # Indeed, we got the slice we wanted
True

Comments