c - OpenCL - Local Memory -
i understand whats difference between global- , local-memory in general. have problems use local-memory.
1) has considered transforming global-memory variables local-memory variables?
2) how use local-barriers?
maybe can me little example.
i tried jacobi-computation using local-memory, 0 result. maybe can give me advice.
working solution:
#define idx(_m,_i,_j) (_m)[(_i) * n + (_j)] #define u(_i, _j) idx(ul, _i, _j) __kernel void jacobi(__global value* u, __global value* f, __global value* tmp, value factor) { int = get_global_id(0); int j = get_global_id(1); int il = get_local_id(0); int jl = get_local_id(1); __local value ul[(n+2)*(n+2)]; __local value fl[(n+2)*(n+2)]; idx(ul, il, jl) = idx(u, i, j); idx(fl, il, jl) = idx(f, i, j); barrier(clk_local_mem_fence); idx(tmp, i, j) = (value)0.25 * ( u(il-1, jl) + u(il, jl-1) + u(il, jl+1) + u(il+1, jl) - factor * idx(fl, il, jl)); }
thanks.
1) query cl_device_local_mem_size value, 16kb minimum , increses different hardwares. if local variables can fit in , if re-used many times, should put them in local memory before usage. if don't, automatic usage of l2 cache when accessing global memory of gpu can still effective utiliation of cores.
if global-local copy taking important slice of time, can async work group copy while cores calculating things.
another important part is, more free local memory space means more concurrent threads per core. if gpu has 64 cores per compute unit, 64 threads can run when local memory used. when has more space, 128,192,...2560 threads can run @ same time if there no other limitations.
a profiler can show bottlenecks can consider worth try or not.
for example, naive matrix-matrix multiplication using nested loop relies on cache l1 l2 submatices can fit in local memory. maybe 48x48 submatices of floats can fit in mid-range graphics card compute unit , can used n times whole calculation before replaced next submatrix.
cl_device_local_mem_type querying can return local or global says not recommended use local memory if global.
lastly, memory space allocation(except __private) size must known @ compile time(for device, not host) because must know how many wavefronts can issued achieve max performance(and/or maybe other compiler optimizations). why no recursive function allowed opencl 1.2. can copy function , rename n times have pseudo recursiveness.
2) barriers meeting point workgroup threads in workgroup. similar cyclic barriers, stop there, wait until continuing. if local barrier, workgroup threads finish local memory operations before departing point. if want give numbers 1,2,3,4.. local array, can't sure if threads writing these numbers or written, until local barrier passed, array have final values written.
all workgroup threads must hit same barrier. if 1 cannot reach it, kernel stucks or error.
__local int localarray[64]; // not each thread. threads. // per compute unit. if(localthreadid!=0) localarray[localthreadid]=localthreadid; // 64 values written in o(1) // not sure if 2nd thread done writing, last thread if(localthreadid==0) // 1st core of each compute unit loads vram localarray[localthreadid]=globalarray[globalthreadid]; barrier(clk_local_mem_fence); // threads wait 1st thread // (maybe 1st simd or // whole 1st wavefront!) // here threads written own id local array. safe read. // except first element variable global memory // lets add value other values if(localthreadid!=0) localarrray[localthreadid]+=localarray[0];
working example(local work group size=64):
inputs: 0,1,2,3,4,0,0,0,0,0,0,..
__kernel void vecadd(__global float* x ) { int id = get_global_id(0); int idl = get_local_id(0); __local float loc[64]; loc[idl]=x[id]; barrier (clk_local_mem_fence); float distance_square_sum=0; for(int i=0;i<64;i++) { float diff=loc[idl]-loc[i]; float diff_squared=diff*diff; distance_square_sum+=diff_squared; } x[id]=distance_square_sum; }
output: 30, 74, 246, 546, 974, 30, 30, 30...
Comments
Post a Comment