Provisional specifications of OpenCL 2.0 were released few months ago. One of the very interesting features is support for dynamic parallelism. In CUDA world it already exist for about a year but still only on the most expensive devices with compute capability 3.5 (Titan, GTX780; booth with chip GK110). On AMD side it a little bit different story. They didn't talk anything about dynamic parallelization but on the other side they introduced GCN 2.0 which might have support for it. In addition they introduced Mantle - a new GPU API which promises up to 9 times more draw calls than comparable API's (OpenGL, DirectX). This might smell that draw calls might be called from the GPU itself.
How will be dynamic parallelization used? Very simple. Kernels will enque kernels to a device queue:
Sunday, September 29, 2013
Sunday, June 2, 2013
Blender 2.67b and OpenCL is working better
I just updated to new Blender 2.67b and found out that something in OpenCL changed to better. Last time I checked previous version of Blender there was not possible to select CPU as the compute device. Now it's possible. It's even possible to use combination of CPU and GPU. Take a look at the next picture:
I can use Intel Core i5 or/and AMD Radeon graphic card as compute device. This is nice.
I can use Intel Core i5 or/and AMD Radeon graphic card as compute device. This is nice.
Saturday, June 1, 2013
Tutorial: Simple start with OpenCL and C++
To begin programming in OpenCL is always hard. Let's try with the basic example. We want to sum two arrays together.
At first you need to install the OpenCL libraries and other files. AMD has for CPU's and their GPU's AMD APP: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/. Intel has their OpenCL libraries at http://software.intel.com/en-us/vcsource/tools/opencl-sdk. And Nvidia has everything at https://developer.nvidia.com/cuda-downloads. In some cases the graphic drivers already include all the files you need. I recommend that you continue with the next step and if anything will go wrong return to this step and install the needed OpenCL SDK toolkits.
At first you need to install the OpenCL libraries and other files. AMD has for CPU's and their GPU's AMD APP: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/. Intel has their OpenCL libraries at http://software.intel.com/en-us/vcsource/tools/opencl-sdk. And Nvidia has everything at https://developer.nvidia.com/cuda-downloads. In some cases the graphic drivers already include all the files you need. I recommend that you continue with the next step and if anything will go wrong return to this step and install the needed OpenCL SDK toolkits.
Thursday, May 30, 2013
Atomic operations and floating point numbers in OpenCL
Many times I had questions myself why atomic operations are not supported on floating point numbers. There are two reasons for that:
- floating point approximation
- hardware costs
float sum=0; for(int i=0;i<10000000;i++){ sum+=1.0f; } sum+=100000000.0f; std::cout<<std::setprecision(20) << "sum is: "<<sum<<"\n";
float sum=0; float sum=100000000.0f; for(int i=0;i<10000000;i++){ sum+=1.0f; } std::cout<<std::setprecision(20) << "sum is: "<<sum<<"\n";
Saturday, May 18, 2013
OpenCL in Blender 2.67
Last time I wrote about Blender 2.66a and the support of OpenCL. OpenCL support is experimental but it doesn't work with AMD OpenCL implementation. What about new blender 2.67? I found out that it still doesn't work but at least some code was changed:
Compiling OpenCL kernel ... OpenCL build failed: errors in console "/tmp/OCLhNcF82.cl", line 24079: warning: double-precision constant is represented as single-precision constant because double is not enabled const float tolerance = 1e-8; ^ "/tmp/OCLhNcF82.cl", line 24149: error: identifier "M_PI" is undefined return ss->alpha_*(1.0f/(4.0f*(float)M_PI))*(Rdr + Rdv); ^ "/tmp/OCLhNcF82.cl", line 30225: error: expected a ")" int shader, int object, int prim, float u, float v, float t, float time, int segment = ~0) ^ "/tmp/OCLhNcF82.cl", line 30356: error: too few arguments in function call shader_setup_from_sample(kg, sd, P, Ng, I, shader, object, prim, u, v, 0.0f, TIME_INVALID); ^ "/tmp/OCLhNcF82.cl", line 31558: error: too few arguments in function call shader_setup_from_sample(kg, &sd, ls->P, ls->Ng, I, ls->shader, ls->object, ls->prim, u, v, t, time); ^ 4 errors detected in the compilation of "/tmp/OCLhNcF82.cl". Internal error: clc compiler invocation failed.This might be because of the changes on the CUDA side (CUDA and OpenCL implementation share some of the code). I still believe that OpenCL is useful for the production systems. CUDA is useful more for experimental and academic purposes. All machines don't have Nvidia stuff but most of machines have support for OpenCL at least using CPU. OpenCL is even used on tablets and phones. Also another question. Why I can't select CPU as compute device?
Wednesday, April 10, 2013
OpenCL and Blender (Cycles)
It seems that OpenCL is not so important for Blender community (Blender 2.66a). Cycles engine works quite nice with CUDA but when you try to turn on the OpenCL support you need at first to set CYCLES_OPENCL_TEST environment variable. When done you might think that everything will work as it should, but it doesn't. When trying to render something I got next compile errors:
They are saying at http://wiki.blender.org/index.php/Dev:2.6/Source/Render/Cycles/OpenCL that drivers for OpenCL are not mature enough. But according http://www.luxrender.net/luxmark/ this is not the case. They have quite stable OpenCL renderer which can even work in GPU+CPU mode.
The problem I see with Cycles renderer is that they use to big kernel. This is no go for GPU computing in basic concept. Why? Register pressure is not equal all accross the kernel (yes I know, you can save registers to global memory too). Some sections of kernel can be executed suboptimally. Such problems might be partly solved with Dynamic parallelism but what about backward compatibility? And please don't forget that GPUs rock at SIMD (SIMT) paradigm. And should we use GPU registers more for arithmetic raw power or rather to make development easier?
"/tmp/OCLpiZAxQ.cl", line 27089: error: expected a ")" int shader, int object, int prim, float u, float v, float t, float time, int segment = ~0) ^ "/tmp/OCLpiZAxQ.cl", line 27226: error: too few arguments in function call shader_setup_from_sample(kg, sd, P, Ng, I, shader, object, prim, u, v, 0.0f, TIME_INVALID); ^ "/tmp/OCLpiZAxQ.cl", line 28436: error: too few arguments in function call shader_setup_from_sample(kg, &sd, ls->P, ls->Ng, I, ls->shader, ls->object, ls->prim, u, v, t, time);
They are saying at http://wiki.blender.org/index.php/Dev:2.6/Source/Render/Cycles/OpenCL that drivers for OpenCL are not mature enough. But according http://www.luxrender.net/luxmark/ this is not the case. They have quite stable OpenCL renderer which can even work in GPU+CPU mode.
The problem I see with Cycles renderer is that they use to big kernel. This is no go for GPU computing in basic concept. Why? Register pressure is not equal all accross the kernel (yes I know, you can save registers to global memory too). Some sections of kernel can be executed suboptimally. Such problems might be partly solved with Dynamic parallelism but what about backward compatibility? And please don't forget that GPUs rock at SIMD (SIMT) paradigm. And should we use GPU registers more for arithmetic raw power or rather to make development easier?
Performance of atomics
Atomics in OpenCL are very useful, but if they are not used carefully, severe performance penalties can appear. Let's create simple OpenCL kernel which does sum of ones utilizing atomics:
Let's try to test this kernel running 1024x1024x128 threads:
kernel void AtomicSum(global int* sum){ atomic_add(sum,1); }
Let's try to test this kernel running 1024x1024x128 threads:
int sum=0; cl::Buffer bufferSum = cl::Buffer(context, CL_MEM_READ_WRITE, 1 * sizeof(float)); queue.enqueueWriteBuffer(bufferSum, CL_TRUE, 0, 1 * sizeof(int), &sum); cl::Kernel kernel=cl::Kernel(program, "AtomicSum"); kernel.setArg(0,bufferSum); queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(1024*1024*128), cl::NullRange); queue.finish(); queue.enqueueReadBuffer(bufferSum,CL_TRUE,0,1 * sizeof(int),&sum); std::cout << "Sum: " << sum << "\n";
Tuesday, April 9, 2013
Calling kernels with many parameters
Suppose we have an OpenCL kernel with 10 parameters. In order to call the kernel we need to call clSetKernelArg 10 times:
This is not so elegant solution. Official C++ binding to OpenCL, which is available at http://www.khronos.org/registry/cl/, solves most of the problems. First solution would be to simply use C++ binding:
clSetKernelArg(kernel, 0, sizeof(cl_mem), &deviceMemory0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &deviceMemory1); clSetKernelArg(kernel, 2, sizeof(cl_mem), &deviceMemory2); clSetKernelArg(kernel, 3, sizeof(cl_mem), &deviceMemory3); clSetKernelArg(kernel, 4, sizeof(cl_mem), &deviceMemory4); clSetKernelArg(kernel, 5, sizeof(cl_mem), &deviceMemory5); clSetKernelArg(kernel, 6, sizeof(cl_mem), &deviceMemory6); clSetKernelArg(kernel, 7, sizeof(cl_mem), &deviceMemory7); clSetKernelArg(kernel, 8, sizeof(cl_mem), &deviceMemory8); clSetKernelArg(kernel, 9, sizeof(cl_mem), &deviceMemory9);
This is not so elegant solution. Official C++ binding to OpenCL, which is available at http://www.khronos.org/registry/cl/, solves most of the problems. First solution would be to simply use C++ binding:
kernel.setArg(0,deviceMemory0); kernel.setArg(1,deviceMemory1); kernel.setArg(2,deviceMemory2); kernel.setArg(3,deviceMemory3); kernel.setArg(4,deviceMemory4); kernel.setArg(5,deviceMemory5); kernel.setArg(6,deviceMemory6); kernel.setArg(7,deviceMemory7); kernel.setArg(8,deviceMemory8); kernel.setArg(9,deviceMemory9);
Subscribe to:
Posts (Atom)