OpenCL

Dynamic parallelism in OpenCL 2.0

2013-09-29T01:06:00.002-07:00

Provisional specifications of OpenCL 2.0 were released few months ago. One of the very interesting features is support for dynamic parallelism. In CUDA world it already exist for about a year but still only on the most expensive devices with compute capability 3.5 (Titan, GTX780; booth with chip GK110). On AMD side it a little bit different story. They didn't talk anything about dynamic parallelization but on the other side they introduced GCN 2.0 which might have support for it. In addition they introduced Mantle - a new GPU API which promises up to 9 times more draw calls than comparable API's (OpenGL, DirectX). This might smell that draw calls might be called from the GPU itself.

How will be dynamic parallelization used? Very simple. Kernels will enque kernels to a device queue:

int enqueue_kernel (
 queue_t queue,
 kernel_enqueue_flags_t flags,
 const ndrange_t ndrange,
 my_block_A);

First argument requires the queue; you can use the one from the host.

Take care as this function is asynchronous. The parent kernel will not wait for its child kernels but it will be vice versa. Second argument of enqueue_kernel will define if child kernels will start running while the parent kernel is still running (CLK_ENQUEUE_FLAGS_NO_WAIT), wait for the parent kernel to finish (CLK_ENQUEUE_FLAGS_WAIT_KERNEL), or wait only for a work-group of parent kernel to finish (CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP).

Third argument defines amount of threads to run. For example ndrange_1D(global_work_size, local_work_size) can be used.

And the last block defines the actual function to run as a kernel. It is defined by using the Block syntax:

void (^my_block_A)(void) =
 ^{ size_t id = get_global_id(0);
 b[id] += a[id];
};

Usage of enqueue_kernel seems quite easy, but there are some possible problems. How much kernels can we run in a such way? We can run out of memory. If it happens, enqueue_kernel returns CL_ENQUEUE_FAILURE (in debug mode CLK_DEVICE_QUEUE_FULL). So it seems that OpenCL code requires a lot of error handling which is not optimal for SIMD machinery.

Blender 2.67b and OpenCL is working better

2013-06-02T02:44:00.002-07:00

I just updated to new Blender 2.67b and found out that something in OpenCL changed to better. Last time I checked previous version of Blender there was not possible to select CPU as the compute device. Now it's possible. It's even possible to use combination of CPU and GPU. Take a look at the next picture:

I can use Intel Core i5 or/and AMD Radeon graphic card as compute device. This is nice.

You can see Intel Core i5 written twice. The reason is that I have installed two OpenCL implementations. One is from Intel and one is from AMD. Sadly I don't know which is from AMD and which is from Intel but most of users will not have that problem.

What if we try to run Cycles on OpenCL? Let's start with Intel Core i5 and theirs OpenCL implementation. In console we get next output:

Compiling OpenCL kernel ...
OpenCL kernel build output:
Compilation started
In file included from <built-in>:132:
<command line>:2:36: warning: ISO C99 requires whitespace after the macro name
Compilation done
Linking started
Linking done
Kernel <kernel_ocl_path_trace> was not vectorized
Kernel <kernel_ocl_tonemap> was successfully vectorized
Done.
Kernel compilation finished in 17.80s.

You can see that whole Cycles code is quite massive stuff. Compilation takes 17.8s. Guys who wrote Cycles, put a lot of work into this code. Rendering time of default cube takes: 2.9s.

What about AMD's implementation of CPU backend? Console output is in this case less verbose:

Compiling OpenCL kernel ...
Kernel compilation finished in 5.13s.

Rendering time 2.2s is what is strange as AMD's implementation takes less time than Intel's implementation on Intel's CPUs! We're using AMD APP 1214.3 and Intel SDK 2013. But I'm not alone here. Phoronix found similar results: http://www.phoronix.com/scan.php?page=article&item=amd_intel_openclsdk&num=1 .

If we select not OpenCL computing but pure CPU implementation, it takes 1.8s. It seems that some work could be done to optimize whole thing. For my opinion pure CPU implementation is not needed any more. OpenCL implementation is enough. For machines which don't have OpenCL preinstalled, default OpenCL implementation could be bundled with Blender.

I noticed that Blender caches built OpenCL kernels. Good work! On the next start of Blender, first rendering is significantly faster.

What about GPU? At first we get a lot of trivial warnings which can be ignored:

Compiling OpenCL kernel ...
OpenCL kernel build output:
"/tmp/OCLawnF6S.cl", line 16307: warning: double-precision constant is
          represented as single-precision constant because double is not
          enabled
        float phi = M_2PI_F * randv;
                    ^

"/tmp/OCLawnF6S.cl", line 16323: warning: double-precision constant is
          represented as single-precision constant because double is not
          enabled
        float phi = M_2PI_F * randv;
                    ^

"/tmp/OCLawnF6S.cl", line 16337: warning: double-precision constant is
          represented as single-precision constant because double is not
          enabled
        float phi = M_2PI_F*u2;
                    ^

"/tmp/OCLawnF6S.cl", line 22875: warning: double-precision constant is
          represented as single-precision constant because double is not
          enabled
                float phi = M_2PI_F * randu;
                            ^

"/tmp/OCLawnF6S.cl", line 23165: warning: double-precision constant is
          represented as single-precision constant because double is not
          enabled
                float phiM = M_2PI_F * randv;
                             ^

"/tmp/OCLawnF6S.cl", line 23394: warning: double-precision constant is
          represented as single-precision constant because double is not
          enabled
                float phiM = M_2PI_F * randv;
                             ^                                                                                                                                                                                                    
                                                                                                                                                                                                                                  
"/tmp/OCLawnF6S.cl", line 24051: warning: double-precision constant is                                                                                                                                                            
          represented as single-precision constant because double is not                                                                                                                                                          
          enabled                                                                                                                                                                                                                 
                float phi = M_2PI_F * randu;                                                                                                                                                                                      
                            ^                                                                                                                                                                                                     
                                                                                                                                                                                                                                  
"/tmp/OCLawnF6S.cl", line 24427: warning: double-precision constant is                                                                                                                                                            
          represented as single-precision constant because double is not                                                                                                                                                          
          enabled                                                                                                                                                                                                                 
        const float tolerance = 1e-8;                                                                                                                                                                                             
                                ^                                                                                                                                                                                                 
                                                                                                                                                                                                                                  
"/tmp/OCLawnF6S.cl", line 24497: warning: double-precision constant is                                                                                                                                                            
          represented as single-precision constant because double is not                                                                                                                                                          
          enabled                                                                                                                                                                                                                 
        return ss->alpha_*(1.0f/M_4PI_F)*(Rdr + Rdv);                                                                                                                                                                             
                                ^                                                                                                                                                                                                 
                                                                                                                                                                                                                                  
"/tmp/OCLawnF6S.cl", line 26172: warning: double-precision constant is                                                                                                                                                            
          represented as single-precision constant because double is not
          enabled
                return atan2f(y, x) / M_2PI_F + 0.5f;
                                      ^

Error:E013:Insufficient Private Resources! 

OpenCL build failed: errors in console

But at then end we get:

Error:E013:Insufficient Private Resources!

It looks like our GPU AMD Radeon 5470 is too low end. But at least it compiles all the code. It would be nice to get Cylces working on low end GPU's, but if we think further we can see that it's not worth the effort. Real Blender users will anyway use better GPU's.

The question is why it doesn't work. Is there to less local memory? Or we have to complex program? As we're talking about private resources I think that Cycles program is to complex to our GPU. It uses to much of registers or the program is to long. Maybe splitting Cycles into more smaller kernels would help. To find out the exact problem it's needed to use KernelAnalyzer from AMD APP and try to compile kernel for all GPUs.

Tutorial: Simple start with OpenCL and C++

2013-06-01T03:24:00.000-07:00

To begin programming in OpenCL is always hard. Let's try with the basic example. We want to sum two arrays together.

At first you need to install the OpenCL libraries and other files. AMD has for CPU's and their GPU's AMD APP: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/. Intel has their OpenCL libraries at http://software.intel.com/en-us/vcsource/tools/opencl-sdk. And Nvidia has everything at https://developer.nvidia.com/cuda-downloads. In some cases the graphic drivers already include all the files you need. I recommend that you continue with the next step and if anything will go wrong return to this step and install the needed OpenCL SDK toolkits.

We will program in C++11. To ease everything we will use OpenCL C++ binding 1.1 from www.khronos.org/registry/cl/api/1.1/cl.hpp . manual for this binding is available at www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdf. It might happen that cl.hpp is already installed at your computer. If not then simply download C++ binding to folder of your project. Don't forget to turn on the C++11. In case of QtCreator add next line into the .pro file:

  QMAKE_CXXFLAGS += -std=c++0x

Also don't forget to use OpenCL library. In case of QtCreator add next line into the .pro file:

  LIBS+= -lOpenCL

If you get any errors you need to adjust system variable to point to folder of OpenCL installation. You can also manually set path to OpenCL library path:

  LIBS+= -Lpath_to_openCL_libraries

Or you can simply write hard-coded path to OpenCL library:

  LIBS+=/usr/.../libOpenCL.so

Let's start with coding. We will create simple console program which will use OpenCL to sum two arrays like C=A+B. For our simple sample we will need only two headers:

#include <iostream>
#include <CL/cl.hpp>

Everything else will happen inside main function. At start we need to get one of the OpenCL platforms. This is actually a driver you had previously installed. So platform can be from Nvidia, Intel, AMD....

int main(){
    //get all platforms (drivers)
    std::vector<cl::Platform> all_platforms;
    cl::Platform::get(&all_platforms);
    if(all_platforms.size()==0){
        std::cout<<" No platforms found. Check OpenCL installation!\n";
        exit(1);
    }
    cl::Platform default_platform=all_platforms[0];
    std::cout << "Using platform: "<<default_platform.getInfo<CL_PLATFORM_NAME>()<<"\n";

Once we selected the first platform (default_platform) we will use it in the next steps. Now we need to get device of our platform. For example AMD's platform has support for multiple devices (CPU's and GPU's). We will now select the first device (default_device):

    //get default device of the default platform
    std::vector<cl::Device> all_devices;
    default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
    if(all_devices.size()==0){
        std::cout<<" No devices found. Check OpenCL installation!\n";
        exit(1);
    }
    cl::Device default_device=all_devices[0];
    std::cout<< "Using device: "<<default_device.getInfo<CL_DEVICE_NAME>()<<"\n";

Now we need to create a Context. Imagine the Context as the runtime link to the our device and platform:

    cl::Context context({default_device});

Next we need to create the program which we want to execute on our device:

    cl::Program::Sources sources;

Actual source of our program(kernel) is there:

    // kernel calculates for each element C=A+B
    std::string kernel_code=
            "   void kernel simple_add(global const int* A, global const int* B, global int* C){       "
            "       C[get_global_id(0)]=A[get_global_id(0)]+B[get_global_id(0)];                 "
            "   }                                                                               ";                                                                          ";

This code simply calculates C=A+B. As we want that one thread calculates sum of only one element, we use get_global_id(0). get_global_id(0) means get id of current thread. Id's can go from 0 to get_global_size(0) - 1. get_global_size(0) means number of threads. What is 0? 0 means first dimension. OpenCL supports running kernels on 1D, 2D and 3D problems. We will use 1D array! This means 1D problem.

Next we need our kernel sources to build. We also check for the errors at building:

    sources.push_back({kernel_code.c_str(),kernel_code.length()});

    cl::Program program(context,sources);
    if(program.build({default_device})!=CL_SUCCESS){
        std::cout<<" Error building: "<<program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device)<<"\n";
        exit(1);
    }

For arrays A, B, C we need to allocate the space on the device:

    // create buffers on the device
    cl::Buffer buffer_A(context,CL_MEM_READ_WRITE,sizeof(int)*10);
    cl::Buffer buffer_B(context,CL_MEM_READ_WRITE,sizeof(int)*10);
    cl::Buffer buffer_C(context,CL_MEM_READ_WRITE,sizeof(int)*10);

Arrays will have 10 element. We want to calculate sum of next arrays (A, B).

    int A[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    int B[] = {0, 1, 2, 0, 1, 2, 0, 1, 2, 0};

We need to copy arrays from A and B to the device. This means that we will copy arrays from the host to the device. Host represents our main. At first we need to create a queue which is the queue to the commands we will send to the our device:

    //create queue to which we will push commands for the device.
    cl::CommandQueue queue(context,default_device);

Now we can copy data from arrays A and B to buffer_A and buffer_B which represent memory on the device:

    //write arrays A and B to the device
    queue.enqueueWriteBuffer(buffer_A,CL_TRUE,0,sizeof(int)*10,A);
    queue.enqueueWriteBuffer(buffer_B,CL_TRUE,0,sizeof(int)*10,B);

Now we can run the kernel which in parallel sums A and B and writes to C. We do this with KernelFunctor which runs the kernel on the device. Take a look at the "simple_add" this is the name of our kernel we wrote before. You can see the number 10. This corresponds to number of threads we want to run (our array size is 10):

    cl::KernelFunctor simple_add(cl::Kernel(program,"simple_add"),queue,cl::NullRange,cl::NDRange(10),cl::NullRange);

Here we actually set the arguments to kernel simple_add and run the kernel:

    simple_add(buffer_A, buffer_B, buffer_C);

At the end we want to print memory C on our device. At first we need to transfer data from the device to our program (host):

    int C[10];
    //read result C from the device to array C
    queue.enqueueReadBuffer(buffer_C,CL_TRUE,0,sizeof(int)*10,C);

    std::cout<<" result: \n";
    for(int i=0;i<10;i++){
        std::cout<<C[i]<<" ";
    }

    return 0;
}

This is it. Complete code is there:

#include <iostream>
#include <CL/cl.hpp>

int main(){
    //get all platforms (drivers)
    std::vector<cl::Platform> all_platforms;
    cl::Platform::get(&all_platforms);
    if(all_platforms.size()==0){
        std::cout<<" No platforms found. Check OpenCL installation!\n";
        exit(1);
    }
    cl::Platform default_platform=all_platforms[0];
    std::cout << "Using platform: "<<default_platform.getInfo<CL_PLATFORM_NAME>()<<"\n";

    //get default device of the default platform
    std::vector<cl::Device> all_devices;
    default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
    if(all_devices.size()==0){
        std::cout<<" No devices found. Check OpenCL installation!\n";
        exit(1);
    }
    cl::Device default_device=all_devices[0];
    std::cout<< "Using device: "<<default_device.getInfo<CL_DEVICE_NAME>()<<"\n";


    cl::Context context({default_device});

    cl::Program::Sources sources;

    // kernel calculates for each element C=A+B
    std::string kernel_code=
            "   void kernel simple_add(global const int* A, global const int* B, global int* C){       "
            "       C[get_global_id(0)]=A[get_global_id(0)]+B[get_global_id(0)];                 "
            "   }                                                                               ";
    sources.push_back({kernel_code.c_str(),kernel_code.length()});

    cl::Program program(context,sources);
    if(program.build({default_device})!=CL_SUCCESS){
        std::cout<<" Error building: "<<program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device)<<"\n";
        exit(1);
    }


    // create buffers on the device
    cl::Buffer buffer_A(context,CL_MEM_READ_WRITE,sizeof(int)*10);
    cl::Buffer buffer_B(context,CL_MEM_READ_WRITE,sizeof(int)*10);
    cl::Buffer buffer_C(context,CL_MEM_READ_WRITE,sizeof(int)*10);

    int A[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    int B[] = {0, 1, 2, 0, 1, 2, 0, 1, 2, 0};

    //create queue to which we will push commands for the device.
    cl::CommandQueue queue(context,default_device);

    //write arrays A and B to the device
    queue.enqueueWriteBuffer(buffer_A,CL_TRUE,0,sizeof(int)*10,A);
    queue.enqueueWriteBuffer(buffer_B,CL_TRUE,0,sizeof(int)*10,B);


    //run the kernel
    cl::KernelFunctor simple_add(cl::Kernel(program,"simple_add"),queue,cl::NullRange,cl::NDRange(10),cl::NullRange);
    simple_add(buffer_A,buffer_B,buffer_C);

    //alternative way to run the kernel
    /*cl::Kernel kernel_add=cl::Kernel(program,"simple_add");
    kernel_add.setArg(0,buffer_A);
    kernel_add.setArg(1,buffer_B);
    kernel_add.setArg(2,buffer_C);
    queue.enqueueNDRangeKernel(kernel_add,cl::NullRange,cl::NDRange(10),cl::NullRange);
    queue.finish();*/

    int C[10];
    //read result C from the device to array C
    queue.enqueueReadBuffer(buffer_C,CL_TRUE,0,sizeof(int)*10,C);

    std::cout<<" result: \n";
    for(int i=0;i<10;i++){
        std::cout<<C[i]<<" ";
    }

    return 0;
}

Atomic operations and floating point numbers in OpenCL

2013-05-30T00:44:00.003-07:00

Many times I had questions myself why atomic operations are not supported on floating point numbers. There are two reasons for that:

floating point approximation
hardware costs

What means the first reason? OpenCL doesn't define thread scheduling so this means that the order of the threads can be arbitrary. If we would use atomics that means that order of the arithmetic operations would be arbitrary too. In case of floating points it would cause the arbitrary results too what nobody wants. You don't believe? Let's take a look at the next example:

float sum=0;
for(int i=0;i<10000000;i++){
    sum+=1.0f;
}
sum+=100000000.0f;
std::cout<<std::setprecision(20) << "sum is: "<<sum<<"\n";

float sum=0;
float sum=100000000.0f;
for(int i=0;i<10000000;i++){
    sum+=1.0f;
}
std::cout<<std::setprecision(20) << "sum is: "<<sum<<"\n";

Now the question for 1M$. What will the first cout print out and what second one? Looks like that booth should print 110000000 but this is not the case. Only the first one prints the expected result. The second cout prints 100000000. Why? Floating point numbers have 32 bits to store the numbers. To support big dynamic ranges of floating point number from −10³⁰⁸ through +10³⁰⁸ we need to store floating points as pair of mantissa and exponent. Number 100000000 from the second case can be internally stored as 1.0*10^8 (1.0 is mantissa, 10^8 is exponent). When we add small ones to very big value (1.0*10^8 + 1.0*10^0), the problem we get is how to represent the number 100000001 with only 32 bits? In our case the ones are simply ignored as small one is really not important against the big number 1.0*10^8.
In the first case we get correct result as on the line before print we simply sum 1.0*10^7 and 1.0*10^8. Then we get 11*10^7. As you might notice it seems that 9999998 + 1 can be represented with 32bit floating point number.

What about the second reason for atomic operations? The hardware costs. It's well known that integer arithmetic unit requires much less transistors than floating point arithmetic unit. Atomic arithmetic operations on the GPU can be implemented in two ways:

serialization of the memory operations
utilizing arithmetic unit in the memory controller or in the special queue

First one is simple to do but it is really slow as all threads which access the same memory location need to serialize. But at least atomic operations work.

Second way is the preferred one but it requires more transistors. To support fast atomic we need some kind of queue where we send the commands like "add value 5 to memory location XXXX". This way requires additional arithmetic units in special unit or at the memory controller.

As floating point arithmetic units are more costly there is no economical reason to include them into the special units which will be not utilized most of the time. You would probably use atomics only in rare cases, or?

Now you know why OpenCL has no atomic operations on floating point numbers. If you still like to have them you can serialize the memory access like it is done in the next code:

float sum=0;

void atomic_add_global(volatile global float *source, const float operand) {
    union {
        unsigned int intVal;
        float floatVal;
    } newVal;
    union {
        unsigned int intVal;
        float floatVal;
    } prevVal;

    do {
        prevVal.floatVal = *source;
        newVal.floatVal = prevVal.floatVal + operand;
    } while (atomic_cmpxchg((volatile global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal);
}

float sum=0;
void atomic_add_local(volatile local float *source, const float operand) {
    union {
        unsigned int intVal;
        float floatVal;
    } newVal;

    union {
        unsigned int intVal;
        float floatVal;
    } prevVal;

    do {
        prevVal.floatVal = *source;
        newVal.floatVal = prevVal.floatVal + operand;
    } while (atomic_cmpxchg((volatile local unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal);
}

I found this code one the next blog: http://suhorukov.blogspot.com/2011/12/opencl-11-atomic-operations-on-floating.html . Many thanks.

First function works on global memory the second one work on the local memory. The only difference is the global/local word.

How this code works? It uses union which mean that we have value at memory location X which can be accessed as the integer or as floating point number. Union replaces type casting of pointers.

Next you see the do while loop which actually serializes the memory access. Function atomic_cmpxchg writes sum of value at memory location X and our operand to location X. At the same time it checks if any other thread wrote at the same location. If this is the case then we need to repeat the do while loop. You can see that this approach can get very slow especially if we write to same location from many threads.

If you would like to have atomic_mul or div you can simply replace + with your operator (/, *, -).

Be warned, this is slow as we figured out before!

OpenCL in Blender 2.67

2013-05-18T02:37:00.000-07:00

Last time I wrote about Blender 2.66a and the support of OpenCL. OpenCL support is experimental but it doesn't work with AMD OpenCL implementation. What about new blender 2.67? I found out that it still doesn't work but at least some code was changed:

Compiling OpenCL kernel ...

OpenCL build failed: errors in console

"/tmp/OCLhNcF82.cl", line 24079: warning: double-precision constant is
          represented as single-precision constant because double is not enabled
        const float tolerance = 1e-8;

                                ^

"/tmp/OCLhNcF82.cl", line 24149: error: identifier "M_PI" is undefined
        return ss->alpha_*(1.0f/(4.0f*(float)M_PI))*(Rdr + Rdv);

                                             ^

"/tmp/OCLhNcF82.cl", line 30225: error: expected a ")"
        int shader, int object, int prim, float u, float v, float t, float time, int segment = ~0)

                                                                                             ^


"/tmp/OCLhNcF82.cl", line 30356: error: too few arguments in function call
        shader_setup_from_sample(kg, sd, P, Ng, I, shader, object, prim, u, v, 0.0f, TIME_INVALID);

                                                                                               ^

"/tmp/OCLhNcF82.cl", line 31558: error: too few arguments in function call
                        shader_setup_from_sample(kg, &sd, ls->P, ls->Ng, I, ls->shader, ls->object, ls->prim, u, v, t, time);

                                                 ^

4 errors detected in the compilation of "/tmp/OCLhNcF82.cl".

Internal error: clc compiler invocation failed.

This might be because of the changes on the CUDA side (CUDA and OpenCL implementation share some of the code). I still believe that OpenCL is useful for the production systems. CUDA is useful more for experimental and academic purposes. All machines don't have Nvidia stuff but most of machines have support for OpenCL at least using CPU. OpenCL is even used on tablets and phones. Also another question. Why I can't select CPU as compute device?

OpenCL and Blender (Cycles)

2013-04-10T11:58:00.003-07:00

It seems that OpenCL is not so important for Blender community (Blender 2.66a). Cycles engine works quite nice with CUDA but when you try to turn on the OpenCL support you need at first to set CYCLES_OPENCL_TEST environment variable. When done you might think that everything will work as it should, but it doesn't. When trying to render something I got next compile errors:

"/tmp/OCLpiZAxQ.cl", line 27089: error: expected a ")"

        int shader, int object, int prim, float u, float v, float t, float time, int segment = ~0)

                                                                                             ^

"/tmp/OCLpiZAxQ.cl", line 27226: error: too few arguments in function call

        shader_setup_from_sample(kg, sd, P, Ng, I, shader, object, prim, u, v, 0.0f, TIME_INVALID);

                                                                                                 ^

"/tmp/OCLpiZAxQ.cl", line 28436: error: too few arguments in function call

                        shader_setup_from_sample(kg, &sd, ls->P, ls->Ng, I, ls->shader, ls->object, ls->prim, u, v, t, time);

They are saying at http://wiki.blender.org/index.php/Dev:2.6/Source/Render/Cycles/OpenCL that drivers for OpenCL are not mature enough. But according http://www.luxrender.net/luxmark/ this is not the case. They have quite stable OpenCL renderer which can even work in GPU+CPU mode.

The problem I see with Cycles renderer is that they use to big kernel. This is no go for GPU computing in basic concept. Why? Register pressure is not equal all accross the kernel (yes I know, you can save registers to global memory too). Some sections of kernel can be executed suboptimally. Such problems might be partly solved with Dynamic parallelism but what about backward compatibility? And please don't forget that GPUs rock at SIMD (SIMT) paradigm. And should we use GPU registers more for arithmetic raw power or rather to make development easier?

Performance of atomics

2013-04-10T11:16:00.000-07:00

Atomics in OpenCL are very useful, but if they are not used carefully, severe performance penalties can appear. Let's create simple OpenCL kernel which does sum of ones utilizing atomics:

kernel void AtomicSum(global int* sum){
    atomic_add(sum,1);
}

Let's try to test this kernel running 1024x1024x128 threads:

int sum=0;
cl::Buffer bufferSum = cl::Buffer(context, CL_MEM_READ_WRITE, 1 * sizeof(float));
queue.enqueueWriteBuffer(bufferSum, CL_TRUE, 0, 1 * sizeof(int), &sum);
cl::Kernel kernel=cl::Kernel(program, "AtomicSum");
kernel.setArg(0,bufferSum);
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(1024*1024*128), cl::NullRange);
queue.finish();

queue.enqueueReadBuffer(bufferSum,CL_TRUE,0,1 * sizeof(int),&sum);
std::cout << "Sum: " << sum << "\n";

Expected sum is: 134217728.
Our test machine uses OpenCL implementation from AMD. CPU is Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz, GPU is AMD Mobility Radeon HD 5470. How much time should this code take on CPU and how much on GPU? We usually expect that operations on GPU are much faster than operations on CPU. Are they really faster? Our test returned next results:

CPU: 1.809s
GPU: 3.262s

This can be quite unexpected. Is it possible to speed up whole thing? Short answer is yes. OpenCL supports utilization of local memory (on chip) which is much faster than global memory. Let's change previous kernel AtomicSum:

kernel void AtomicSum(global int* sum){
    local int tmpSum[1];
    if(get_local_id(0)==0){
        tmpSum[0]=0;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    atomic_add(&tmpSum[0],1);
    barrier(CLK_LOCAL_MEM_FENCE);
    if(get_local_id(0)==(get_local_size(0)-1)){
        atomic_add(sum,tmpSum[0]);
    }
}

This kernel does atomic add at level of work groups by utilizing local memory. At the end each work group does atomic add on global memory (last thread). This approach lovers the access to global memory. It looks promising as the results look too:

CPU: 0.815s
GPU: 0.24s

Speedup on GPU is now more that 10x. On CPU is also not so bad. Overall this is quite a nice speedup. Can we do it even faster? Let's assume that atomic operations on local memory have significant costs to. This cost can be lowered by using more local memory, where each thread tries to do atomic add at different memory locations:

kernel void AtomicSum(global int* sum){
    local int tmpSum[4];
    if(get_local_id(0)<4){
        tmpSum[get_local_id(0)]=0;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    atomic_add(&tmpSum[get_global_id(0)%4],1);
    barrier(CLK_LOCAL_MEM_FENCE);
    if(get_local_id(0)==(get_local_size(0)-1)){
        atomic_add(sum,tmpSum[0]+tmpSum[1]+tmpSum[2]+tmpSum[3]);
    }
}

We got again nice speedup, but it's not four times faster than expected:

CPU: 0.858s
GPU: 0.173s

We found out that atomics cost quite some time. It's recommended to omit atomics on global memory. Atomics at local memory are better but they are always also not the best solution. This applies especially to GPUs, as they can run much more threads in parallel that CPUs. Global atomics on CPUs don't have so big impact on performance. This means that same code can run even faster on CPU than on GPU.

Calling kernels with many parameters

2013-04-09T11:38:00.000-07:00

Suppose we have an OpenCL kernel with 10 parameters. In order to call the kernel we need to call clSetKernelArg 10 times:

clSetKernelArg(kernel, 0, sizeof(cl_mem), &deviceMemory0);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &deviceMemory1);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &deviceMemory2);
clSetKernelArg(kernel, 3, sizeof(cl_mem), &deviceMemory3);
clSetKernelArg(kernel, 4, sizeof(cl_mem), &deviceMemory4);
clSetKernelArg(kernel, 5, sizeof(cl_mem), &deviceMemory5);
clSetKernelArg(kernel, 6, sizeof(cl_mem), &deviceMemory6);
clSetKernelArg(kernel, 7, sizeof(cl_mem), &deviceMemory7);
clSetKernelArg(kernel, 8, sizeof(cl_mem), &deviceMemory8);
clSetKernelArg(kernel, 9, sizeof(cl_mem), &deviceMemory9);

This is not so elegant solution. Official C++ binding to OpenCL, which is available at http://www.khronos.org/registry/cl/, solves most of the problems. First solution would be to simply use C++ binding:

kernel.setArg(0,deviceMemory0);
kernel.setArg(1,deviceMemory1);
kernel.setArg(2,deviceMemory2);
kernel.setArg(3,deviceMemory3);
kernel.setArg(4,deviceMemory4);
kernel.setArg(5,deviceMemory5);
kernel.setArg(6,deviceMemory6);
kernel.setArg(7,deviceMemory7);
kernel.setArg(8,deviceMemory8);
kernel.setArg(9,deviceMemory9);

As we can see, there is no need to specify the size of parameters (it's autodetected). The kernel is an object of the class cl::Kernel. deviceMemory0 to deviceMemory9 are objects of the cl::Buffer. But with this solution one problem persist. We still need to call setArg 10 times. C++ binding has solution for this too (it's not official yet ;) ). It comes with name KernelFunctor (KernelFunctorGlobal in OpenCL C++ binding 1.2):

cl::KernelFunctor kernelFunctor(kernel, queue, offset, globalSize, localSize);
kernelFunctor(deviceMemory0, deviceMemory1, deviceMemory2, deviceMemory3, deviceMemory4,
 deviceMemory5, deviceMemory6, deviceMemory7, deviceMemory8, deviceMemory9);

This is quite short now. At first we create the kernel functor and then we use operator () to set parameters to kernel and call it. If we look inside cl.hpp, we see many definitions of operator (). Number of parameters ranges from 0 to 15. Such issue can be solved with variadic templates in C++11:

inline void _setKernelParameters(cl::Kernel &k,int i){}//do nothing, terminating function

template<typename T, typename... Args>
inline void _setKernelParameters(cl::Kernel &kernel,int i, const T &firstParameter, const Args& ...restOfParameters){
    kernel.setArg(i, firstParameter);
    _setKernelParameters(kernel,i+1,restOfParameters...);
}

template<typename... Args>
inline void setKernelParameters(cl::Kernel &kernel,const Args& ...args){
    _setKernelParameters(kernel, 0, args...);//first number of parameter is 0
}

In order to set parameters to our kernel we need to use only the next line:

setKernelParameters(kernel, deviceMemory0, ...deviceMemory9);

I don't know exactly why OpenCL C++ binding doesn't use this approach. Possible reason can be backward compatibility with older compilers (C++11 is quite new thing).