OpenCL: 2013

Sunday, September 29, 2013

Dynamic parallelism in OpenCL 2.0

Provisional specifications of OpenCL 2.0 were released few months ago. One of the very interesting features is support for dynamic parallelism. In CUDA world it already exist for about a year but still only on the most expensive devices with compute capability 3.5 (Titan, GTX780; booth with chip GK110). On AMD side it a little bit different story. They didn't talk anything about dynamic parallelization but on the other side they introduced GCN 2.0 which might have support for it. In addition they introduced Mantle - a new GPU API which promises up to 9 times more draw calls than comparable API's (OpenGL, DirectX). This might smell that draw calls might be called from the GPU itself.

How will be dynamic parallelization used? Very simple. Kernels will enque kernels to a device queue:

Blender 2.67b and OpenCL is working better

I just updated to new Blender 2.67b and found out that something in OpenCL changed to better. Last time I checked previous version of Blender there was not possible to select CPU as the compute device. Now it's possible. It's even possible to use combination of CPU and GPU. Take a look at the next picture:

I can use Intel Core i5 or/and AMD Radeon graphic card as compute device. This is nice.

Tutorial: Simple start with OpenCL and C++

To begin programming in OpenCL is always hard. Let's try with the basic example. We want to sum two arrays together.

At first you need to install the OpenCL libraries and other files. AMD has for CPU's and their GPU's AMD APP: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/. Intel has their OpenCL libraries at http://software.intel.com/en-us/vcsource/tools/opencl-sdk. And Nvidia has everything at https://developer.nvidia.com/cuda-downloads. In some cases the graphic drivers already include all the files you need. I recommend that you continue with the next step and if anything will go wrong return to this step and install the needed OpenCL SDK toolkits.

Atomic operations and floating point numbers in OpenCL

Many times I had questions myself why atomic operations are not supported on floating point numbers. There are two reasons for that:

floating point approximation
hardware costs

What means the first reason? OpenCL doesn't define thread scheduling so this means that the order of the threads can be arbitrary. If we would use atomics that means that order of the arithmetic operations would be arbitrary too. In case of floating points it would cause the arbitrary results too what nobody wants. You don't believe? Let's take a look at the next example:

float sum=0;
for(int i=0;i<10000000;i++){
    sum+=1.0f;
}
sum+=100000000.0f;
std::cout<<std::setprecision(20) << "sum is: "<<sum<<"\n";

float sum=0;
float sum=100000000.0f;
for(int i=0;i<10000000;i++){
    sum+=1.0f;
}
std::cout<<std::setprecision(20) << "sum is: "<<sum<<"\n";

OpenCL in Blender 2.67

Last time I wrote about Blender 2.66a and the support of OpenCL. OpenCL support is experimental but it doesn't work with AMD OpenCL implementation. What about new blender 2.67? I found out that it still doesn't work but at least some code was changed:

Compiling OpenCL kernel ...

OpenCL build failed: errors in console

"/tmp/OCLhNcF82.cl", line 24079: warning: double-precision constant is
          represented as single-precision constant because double is not enabled
        const float tolerance = 1e-8;

                                ^

"/tmp/OCLhNcF82.cl", line 24149: error: identifier "M_PI" is undefined
        return ss->alpha_*(1.0f/(4.0f*(float)M_PI))*(Rdr + Rdv);

                                             ^

"/tmp/OCLhNcF82.cl", line 30225: error: expected a ")"
        int shader, int object, int prim, float u, float v, float t, float time, int segment = ~0)

                                                                                             ^


"/tmp/OCLhNcF82.cl", line 30356: error: too few arguments in function call
        shader_setup_from_sample(kg, sd, P, Ng, I, shader, object, prim, u, v, 0.0f, TIME_INVALID);

                                                                                               ^

"/tmp/OCLhNcF82.cl", line 31558: error: too few arguments in function call
                        shader_setup_from_sample(kg, &sd, ls->P, ls->Ng, I, ls->shader, ls->object, ls->prim, u, v, t, time);

                                                 ^

4 errors detected in the compilation of "/tmp/OCLhNcF82.cl".

Internal error: clc compiler invocation failed.

This might be because of the changes on the CUDA side (CUDA and OpenCL implementation share some of the code). I still believe that OpenCL is useful for the production systems. CUDA is useful more for experimental and academic purposes. All machines don't have Nvidia stuff but most of machines have support for OpenCL at least using CPU. OpenCL is even used on tablets and phones. Also another question. Why I can't select CPU as compute device?

Wednesday, April 10, 2013

OpenCL and Blender (Cycles)

It seems that OpenCL is not so important for Blender community (Blender 2.66a). Cycles engine works quite nice with CUDA but when you try to turn on the OpenCL support you need at first to set CYCLES_OPENCL_TEST environment variable. When done you might think that everything will work as it should, but it doesn't. When trying to render something I got next compile errors:

"/tmp/OCLpiZAxQ.cl", line 27089: error: expected a ")"

        int shader, int object, int prim, float u, float v, float t, float time, int segment = ~0)

                                                                                             ^

"/tmp/OCLpiZAxQ.cl", line 27226: error: too few arguments in function call

        shader_setup_from_sample(kg, sd, P, Ng, I, shader, object, prim, u, v, 0.0f, TIME_INVALID);

                                                                                                 ^

"/tmp/OCLpiZAxQ.cl", line 28436: error: too few arguments in function call

                        shader_setup_from_sample(kg, &sd, ls->P, ls->Ng, I, ls->shader, ls->object, ls->prim, u, v, t, time);

They are saying at http://wiki.blender.org/index.php/Dev:2.6/Source/Render/Cycles/OpenCL that drivers for OpenCL are not mature enough. But according http://www.luxrender.net/luxmark/ this is not the case. They have quite stable OpenCL renderer which can even work in GPU+CPU mode.

The problem I see with Cycles renderer is that they use to big kernel. This is no go for GPU computing in basic concept. Why? Register pressure is not equal all accross the kernel (yes I know, you can save registers to global memory too). Some sections of kernel can be executed suboptimally. Such problems might be partly solved with Dynamic parallelism but what about backward compatibility? And please don't forget that GPUs rock at SIMD (SIMT) paradigm. And should we use GPU registers more for arithmetic raw power or rather to make development easier?

Performance of atomics

Atomics in OpenCL are very useful, but if they are not used carefully, severe performance penalties can appear. Let's create simple OpenCL kernel which does sum of ones utilizing atomics:

kernel void AtomicSum(global int* sum){
    atomic_add(sum,1);
}

Let's try to test this kernel running 1024x1024x128 threads:

int sum=0;
cl::Buffer bufferSum = cl::Buffer(context, CL_MEM_READ_WRITE, 1 * sizeof(float));
queue.enqueueWriteBuffer(bufferSum, CL_TRUE, 0, 1 * sizeof(int), &sum);
cl::Kernel kernel=cl::Kernel(program, "AtomicSum");
kernel.setArg(0,bufferSum);
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(1024*1024*128), cl::NullRange);
queue.finish();

queue.enqueueReadBuffer(bufferSum,CL_TRUE,0,1 * sizeof(int),&sum);
std::cout << "Sum: " << sum << "\n";

Calling kernels with many parameters

Suppose we have an OpenCL kernel with 10 parameters. In order to call the kernel we need to call clSetKernelArg 10 times:

clSetKernelArg(kernel, 0, sizeof(cl_mem), &deviceMemory0);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &deviceMemory1);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &deviceMemory2);
clSetKernelArg(kernel, 3, sizeof(cl_mem), &deviceMemory3);
clSetKernelArg(kernel, 4, sizeof(cl_mem), &deviceMemory4);
clSetKernelArg(kernel, 5, sizeof(cl_mem), &deviceMemory5);
clSetKernelArg(kernel, 6, sizeof(cl_mem), &deviceMemory6);
clSetKernelArg(kernel, 7, sizeof(cl_mem), &deviceMemory7);
clSetKernelArg(kernel, 8, sizeof(cl_mem), &deviceMemory8);
clSetKernelArg(kernel, 9, sizeof(cl_mem), &deviceMemory9);

This is not so elegant solution. Official C++ binding to OpenCL, which is available at http://www.khronos.org/registry/cl/, solves most of the problems. First solution would be to simply use C++ binding:

kernel.setArg(0,deviceMemory0);
kernel.setArg(1,deviceMemory1);
kernel.setArg(2,deviceMemory2);
kernel.setArg(3,deviceMemory3);
kernel.setArg(4,deviceMemory4);
kernel.setArg(5,deviceMemory5);
kernel.setArg(6,deviceMemory6);
kernel.setArg(7,deviceMemory7);
kernel.setArg(8,deviceMemory8);
kernel.setArg(9,deviceMemory9);

OpenCL

Sunday, September 29, 2013

Dynamic parallelism in OpenCL 2.0

Sunday, June 2, 2013

Blender 2.67b and OpenCL is working better

Saturday, June 1, 2013