This site contains OpenCL notes, tutorials, benchmarks, news.

Wednesday, April 10, 2013

OpenCL and Blender (Cycles)

It seems that OpenCL is not so important for Blender community (Blender 2.66a). Cycles engine works quite nice with CUDA but when you try to turn on the OpenCL support you need at first to set CYCLES_OPENCL_TEST environment variable. When done you might think that everything will work as it should, but it doesn't. When trying to render something I got next compile errors:

"/tmp/", line 27089: error: expected a ")"

        int shader, int object, int prim, float u, float v, float t, float time, int segment = ~0)


"/tmp/", line 27226: error: too few arguments in function call

        shader_setup_from_sample(kg, sd, P, Ng, I, shader, object, prim, u, v, 0.0f, TIME_INVALID);


"/tmp/", line 28436: error: too few arguments in function call

                        shader_setup_from_sample(kg, &sd, ls->P, ls->Ng, I, ls->shader, ls->object, ls->prim, u, v, t, time);

They are saying at that drivers for OpenCL are not mature enough. But according this is not the case. They have quite stable OpenCL renderer which can even work in GPU+CPU mode.

The problem I see with Cycles renderer is that they use to big kernel. This is no go for GPU computing in basic concept. Why? Register pressure is not equal all accross the kernel (yes I know, you can save registers to global memory too). Some sections of kernel can be executed suboptimally. Such problems might be partly solved with Dynamic parallelism but what about backward compatibility? And please don't forget that GPUs rock at SIMD (SIMT) paradigm. And should we use GPU registers more for arithmetic raw power or rather to make development easier?

Performance of atomics

Atomics in OpenCL are very useful, but if they are not used carefully, severe performance penalties can appear. Let's create simple OpenCL kernel which does sum of ones utilizing atomics:
kernel void AtomicSum(global int* sum){

Let's try to test this kernel running 1024x1024x128 threads:
int sum=0;
cl::Buffer bufferSum = cl::Buffer(context, CL_MEM_READ_WRITE, 1 * sizeof(float));
queue.enqueueWriteBuffer(bufferSum, CL_TRUE, 0, 1 * sizeof(int), &sum);
cl::Kernel kernel=cl::Kernel(program, "AtomicSum");
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(1024*1024*128), cl::NullRange);

queue.enqueueReadBuffer(bufferSum,CL_TRUE,0,1 * sizeof(int),&sum);
std::cout << "Sum: " << sum << "\n";

Tuesday, April 9, 2013

Calling kernels with many parameters

Suppose we have an OpenCL kernel with 10 parameters. In order to call the kernel we need to call clSetKernelArg 10 times:
clSetKernelArg(kernel, 0, sizeof(cl_mem), &deviceMemory0);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &deviceMemory1);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &deviceMemory2);
clSetKernelArg(kernel, 3, sizeof(cl_mem), &deviceMemory3);
clSetKernelArg(kernel, 4, sizeof(cl_mem), &deviceMemory4);
clSetKernelArg(kernel, 5, sizeof(cl_mem), &deviceMemory5);
clSetKernelArg(kernel, 6, sizeof(cl_mem), &deviceMemory6);
clSetKernelArg(kernel, 7, sizeof(cl_mem), &deviceMemory7);
clSetKernelArg(kernel, 8, sizeof(cl_mem), &deviceMemory8);
clSetKernelArg(kernel, 9, sizeof(cl_mem), &deviceMemory9);

This is not so elegant solution. Official C++ binding to OpenCL, which is available at, solves most of the problems. First solution would be to simply use C++ binding: