This site contains OpenCL notes, tutorials, benchmarks, news.

Wednesday, April 10, 2013

Performance of atomics

Atomics in OpenCL are very useful, but if they are not used carefully, severe performance penalties can appear. Let's create simple OpenCL kernel which does sum of ones utilizing atomics:
kernel void AtomicSum(global int* sum){
    atomic_add(sum,1);
}

Let's try to test this kernel running 1024x1024x128 threads:
int sum=0;
cl::Buffer bufferSum = cl::Buffer(context, CL_MEM_READ_WRITE, 1 * sizeof(float));
queue.enqueueWriteBuffer(bufferSum, CL_TRUE, 0, 1 * sizeof(int), &sum);
cl::Kernel kernel=cl::Kernel(program, "AtomicSum");
kernel.setArg(0,bufferSum);
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(1024*1024*128), cl::NullRange);
queue.finish();

queue.enqueueReadBuffer(bufferSum,CL_TRUE,0,1 * sizeof(int),&sum);
std::cout << "Sum: " << sum << "\n";





Expected sum is: 134217728.  
Our test machine uses OpenCL implementation from AMD. CPU is Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz, GPU is AMD Mobility Radeon HD 5470. How much time should this code take on CPU and how much on GPU? We usually expect that operations on GPU are much faster than operations on CPU. Are they really faster? Our test returned next results:
  • CPU: 1.809s 
  • GPU: 3.262s 

This can be quite unexpected. Is it possible to speed up whole thing? Short answer is yes. OpenCL supports utilization of local memory (on chip) which is much faster than global memory. Let's change previous kernel AtomicSum:
kernel void AtomicSum(global int* sum){
    local int tmpSum[1];
    if(get_local_id(0)==0){
        tmpSum[0]=0;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    atomic_add(&tmpSum[0],1);
    barrier(CLK_LOCAL_MEM_FENCE);
    if(get_local_id(0)==(get_local_size(0)-1)){
        atomic_add(sum,tmpSum[0]);
    }
}


This kernel does atomic add at level of work groups by utilizing local memory. At the end each work group does atomic add on global memory (last thread). This approach lovers the access to global memory. It looks promising as the results look too:
  • CPU: 0.815s
  • GPU: 0.24s

Speedup on GPU is now more that 10x. On CPU is also not so bad. Overall this is quite a nice speedup. Can we do it even faster? Let's assume that atomic operations on local memory have significant costs to. This cost can be lowered by using more local memory, where each thread tries to do atomic add at different memory locations:

kernel void AtomicSum(global int* sum){
    local int tmpSum[4];
    if(get_local_id(0)<4){
        tmpSum[get_local_id(0)]=0;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    atomic_add(&tmpSum[get_global_id(0)%4],1);
    barrier(CLK_LOCAL_MEM_FENCE);
    if(get_local_id(0)==(get_local_size(0)-1)){
        atomic_add(sum,tmpSum[0]+tmpSum[1]+tmpSum[2]+tmpSum[3]);
    }
}



We got again nice speedup, but it's not four times faster than expected:
  • CPU: 0.858s
  • GPU: 0.173s

We found out that atomics cost quite some time. It's recommended to omit atomics on global memory. Atomics at local memory are better but they are always also not the best solution. This applies especially to GPUs, as they can run much more threads in parallel that CPUs. Global atomics on CPUs don't have so big impact on performance. This means that same code can run even faster on CPU than on GPU.

17 comments:

  1. Interesting post. It would be helpful if you could discuss the hardware architecture, because atomic performance varies strongly across devices. Also using 32 or 64 bit implementations has a strong effect on performance.

    ReplyDelete
  2. Thanks for Sharing This Article.this very halp for develop openCL concept.It is very so much valuable content."Nice blog I really appreciate your words,Nice post. It is really amazing and helpful.
    DevOps Training in Chennai

    DevOps Online Training in Chennai

    DevOps Training in Bangalore

    DevOps Training in Hyderabad

    DevOps Training in Coimbatore

    DevOps Training

    DevOps Online Training



    ReplyDelete
  3. Java is a good choice for many reasons. It's certainly a nicer language than PHP. You have great dev tools for Java - IDEs, build systems, CI tools, Containers, Databases both SQL and NoSQL, REST service frameworks, JSON parsers.
    https://www.acte.in/java-training-in-chennai
    https://www.acte.in/java-training-in-bangalore
    https://www.acte.in/java-training-in-hyderabad
    https://www.acte.in/java-training-in-coimbatore
    https://www.acte.in/java-training

    ReplyDelete
  4. At least for now. No. The majority of all Android Apps, libraries, tutorials and books is still Java and Kotlin is far behind. If you like to use Java for Android development then just do it.
    Java Training in Chennai

    Java Training in Bangalore

    Java Training in Hyderabad

    Java Training in Coimbatore

    Java Training

    ReplyDelete
  5. Thanks for Sharing This Article.this very halp for develop openCL concept.It is very so much valuable content."Nice blog I really appreciate your words,Nice post. It is really amazing and helpful.


    AWS Course in Chennai

    AWS Course in Bangalore

    AWS Course in Hyderabad

    AWS Course in Coimbatore

    AWS Course

    AWS Certification Course

    AWS Certification Training

    AWS Online Training

    AWS Training

    ReplyDelete
  6. Android development is a bit more complex than web development. For Android programming, the Java language is used that required more codding as compared with the iOS Swift programming. ... For example, if you want to learn iOS programming Swift then you must have a MacBook with you for the mobile app learning.thanks !!

    Android Training in Chennai

    Android Online Training in Chennai

    Android Training in Bangalore

    Android Training in Hyderabad

    Android Training in Coimbatore

    Android Training

    Android Online Training


    ReplyDelete
  7. Betway Casino - Review & Ratings | JT Hub
    Read the full Betway Casino review & find out about its 청주 출장안마 games, mobile app and 하남 출장마사지 banking options. 세종특별자치 출장마사지 See 밀양 출장샵 everything you need to know to get  Rating: 용인 출장안마 4 · ‎Review by JT Hub

    ReplyDelete
  8. V-Ray 5 for SketchUp Cracked gives you an essential collection of free ready-to-render assets and HDRI skies. Plus, it has boosted CPU denoising with Intel Open .V-Ray License Key

    ReplyDelete
  9. Movavi Video Editor 23.0.1 Crack is a great video editing software. It is still a user-friendly system for Windows. Movavi Video Editor Plus Crack

    ReplyDelete