ROSE Compiler Framework/OpenMP Acclerator Model Implementation
Overview
editWe are experimenting a trial implementation of the OpenMP accelerator extension in OpenMP 4.0 specification. It is not meant to be the official or complete implementation due to the fast changing nature of the specification and our resource limitations.
The implementation is based on the existing OpenMP implementation in ROSE.
Implemented Features
editThe implementation loosely follows the technical report http://www.openmp.org/mp-documents/TR1_167.pdf and later http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf .
- pragma omp _target + pragma omp parallel for
- map clauses
- reduction on GPUs
- pragma omp _target data
- collapse clause
Compile and Run
editWe incrementally release the work as part of the ROSE (EDG 4.x based version) compiler's hosted at https://github.com/rose-compiler/rose-develop
- clone it
- configure and build librose.so under buildtree/src
- build the translator roseompacc under tests/nonsmoke/functional/roseTests/ompLoweringTests/ by typing "make roseompacc"
The translator source is rose-omp-accelerator/tests/nonsmoke/functional/roseTests/ompLoweringTests/roseompacc.C
The command lines (two steps for now) to use roseompacc is exemplified in rose-omp-accelerator/tests/nonsmoke/functional/roseTests/ompLoweringTests/Makefile.am
# Experimental translation for OpenMP accelerator model directives # no final compilation for now, which requires CUDA compiler for the generated code test_omp_acc:axpy_ompacc.o matrixmultiply-ompacc.o jacobi-ompacc.o rose_axpy_ompacc.cu:roseompacc ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/axpy_ompacc.c rose_matrixmultiply-ompacc.cu:roseompacc ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/matrixmultiply-ompacc.c rose_jacobi-ompacc.cu:roseompacc ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/jacobi-ompacc.c # build executables using nvcc axpy_ompacc.out:rose_axpy_ompacc.cu nvcc $< $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@ matrixmultiply-ompacc.out:rose_matrixmultiply-ompacc.cu nvcc $< $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@ jacobi-ompacc.out:rose_jacobi-ompacc.cu nvcc $< $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@
Implementation Details
editParsing
editflex/bison
- https://github.com/rose-compiler/rose-develop/blob/master/src/frontend/SageIII/omplexer.ll
- https://github.com/rose-compiler/rose-develop/blob/master/src/frontend/SageIII/ompparser.yy
AST attribute storing parsing results
- https://github.com/rose-compiler/rose-develop/blob/master/src/frontend/SageIII/OmpAttribute.h
- https://github.com/rose-compiler/rose-develop/blob/master/src/frontend/SageIII/OmpAttribute.C
IR generation: ROSETTA, convert attribute to dedicated AST
Translation
edittranslation:
- https://github.com/rose-compiler/rose-develop/blob/master/src/midend/programTransformation/ompLowering/omp_lowering.cpp
- void transOmp_targetParallel() translate a parallel region associated with a _target directive
- void transOmp_targetLoop() translate an omp for loop associated with a _target directive
- void transOmpMapVariables() translate data mapping clauses
Internally, it invokes the outliner acting in mode of classic (passing individual parameters without any wrapping array or structure)
- Outliner::enable_classic = true;
Runtime Support
editruntime support
- https://github.com/rose-compiler/rose-develop/blob/master/src/midend/programTransformation/ompLowering/libxomp.h
- https://github.com/rose-compiler/rose-develop/blob/master/src/midend/programTransformation/ompLowering/xomp.c
- https://github.com/rose-compiler/rose-develop/blob/master/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu
- https://github.com/rose-compiler/rose-develop/blob/master/src/midend/programTransformation/ompLowering/xomp_cuda_lib_inlined.cu
Rules about what to put into each file:
- libxomp.h runtime interface to compiler writers
- xomp.c runtime support for CPU threading
- xomp_cuda_lib.cu provides the regular C functions wrapping CUDA code (without the __device__ keyword) called by the host code.
- xomp_cuda_lib_inlined.cu provides the __device__ functions called by the CUDA kernel functions (with __global__ keyword).
At the time the work being done, CUDA and/or nvcc does not have linker for device code (linking __global__ and _device_ functions from different files). We have to put some common device functions into this file. So they can be included in the same file in which the generated CUDA kernels are generated (__global__ functions can only call __device__ functions from the same file).
Device Data Management
editThe runtime provides two ways to manage device data
- explicit manage of data allocation, copy-to, copy-back, and deallocation
- automatic management using device data environment (DDE) functions
- xomp_deviceDataEnvironmentEnter(); // enter an environment , pushed to an environment stack
- xomp_deviceDataEnvironmentPrepareVariable(); // reuse data if already in parent environments, or allocate, register with the current environment stack
- xomp_deviceDataEnvironmentExit(); // auto deallocation, copy back data. pop environment stack
Examples explicit data management
int mmm() { { float *_dev_a; int _dev_a_size = sizeof(float ) * N * M; _dev_a = ((float *)(xomp_deviceMalloc(_dev_a_size))); xomp_memcpyHostToDevice(((void *)_dev_a),((const void *)a),_dev_a_size); float *_dev_b; int _dev_b_size = sizeof(float ) * M * K; _dev_b = ((float *)(xomp_deviceMalloc(_dev_b_size))); xomp_memcpyHostToDevice(((void *)_dev_b),((const void *)b),_dev_b_size); float *_dev_c; int _dev_c_size = sizeof(float ) * N * M; _dev_c = ((float *)(xomp_deviceMalloc(_dev_c_size))); xomp_memcpyHostToDevice(((void *)_dev_c),((const void *)c),_dev_c_size); /* Launch CUDA kernel ... */ int _threads_per_block_ = xomp_get_maxThreadsPerBlock(); int _num_blocks_ = xomp_get_max1DBlock(1023 - 0 + 1); OUT__1__9221__<<<_num_blocks_,_threads_per_block_>>>(_dev_a,_dev_b,_dev_c); xomp_freeDevice(_dev_a); xomp_freeDevice(_dev_b); xomp_memcpyDeviceToHost(((void *)c),((const void *)_dev_c),_dev_c_size); xomp_freeDevice(_dev_c); } return 0; }
Automated data management supporting reusing nested device data environments
int mmm() { { xomp_deviceDataEnvironmentEnter(); float *_dev_a; int _dev_a_size = sizeof(float ) * 1024 * 1024; _dev_a = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)a),_dev_a_size,1,0))); float *_dev_b; int _dev_b_size = sizeof(float ) * 1024 * 1024; _dev_b = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)b),_dev_b_size,1,0))); float *_dev_c; int _dev_c_size = sizeof(float ) * 1024 * 1024; _dev_c = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)c),_dev_c_size,1,1))); /* Launch CUDA kernel ... */ int _threads_per_block_ = xomp_get_maxThreadsPerBlock(); int _num_blocks_ = xomp_get_max1DBlock(1023 - 0 + 1); OUT__1__9221__<<<_num_blocks_,_threads_per_block_>>>(_dev_a,_dev_b,_dev_c); xomp_deviceDataEnvironmentExit(); } return 0; }
Environment variables
editControl the number of devices being used in the multi-GPU support by default
export OMP_NUM_DEVICES=5
The runtime will automatically detect this environment variable and use it to control the GPU count.
You can retrieve this count n the code, like
int GPU_N = xomp_get_num_devices(); // this function will obtain the env variable.
If OMP_NUM_DEVICES is specified, xomp_get_num_devices() will internally call int omp_get_max_devices(void) to obtain the max number of devices supported by the hardware.
Relevant commit
Loop scheduling
editRuntime loop scheduling must be provided since the naive 1-iteration-to-1-thread mapping in CUDA tutorials won't work for large iteration space. (iterations count > total GPU threads).
// Jacobi 2-D kernel: // Naive 1-iteration to 1 GPU thread scheduling: each GPU thread gets one iteration to work on. // Problem: won't scale to large iteration space. int i = blockIdx.x * blockDim.x + threadIdx.x + 1; int j = blockIdx.y * blockDim.y + threadIdx.y + 1; newa[j*m+i] = w0*a[j*m+i] + w1 * (a[j*m+i-1] + a[(j-1)*m+i] + a[j*m+i+1] + a[(j+1)*m+i]) + w2 * (a[(j-1)*m+i-1] + a[(j+1)*m+i-1] + a[(j-1)*m+i+1] + a[(j+1)*m+i+1]);
We provide two loop schedulers in the runtime so each CUDA thread may get more than 1 iterations to handle large iteration space.
- a static even scheduler: this scheduler evenly divides up the iteration space into roughly equal sized chunks. It then assigns each chunk to a CUDA thread. This scheduling policy may stress the memory too much since each thread will touch a large range of data brought in by the iteration chunk
- a round-robin scheduler: Each thread grabs one iteration (or more iterations) at a time. It is identical to OpenMP's schedule(static, chunk) policy.
Testing use of the schedulers (using corresponding CPU versions) are given at:
GPU version use example: the round-robin scheduler (now the new default scheduler)
void OUT__2__10550__(int n,int *_dev_u) { int ij; int _dev_lower, _dev_upper; // variables for adjusted loop info considering both original chunk size and step(strip) int _dev_loop_chunk_size; int _dev_loop_sched_index; int _dev_loop_stride; // 1-D thread block: int _dev_thread_num = omp_get_num_threads(); int _dev_thread_id = omp_get_thread_num(); printf ("thread count = %d, current thread id = %d\n", _dev_thread_num, _dev_thread_id); int orig_start =0; // must be correct!! int orig_end = n-1; // use inclusive bound int orig_step = 1; int orig_chunk_size = 1; XOMP_static_sched_init (orig_start, orig_end, orig_step, orig_chunk_size, _dev_thread_num, _dev_thread_id, \ & _dev_loop_chunk_size , & _dev_loop_sched_index, & _dev_loop_stride); printf ("Initialized chunk size = %d, sched indx =%d, stride = %d\n",_dev_loop_chunk_size, _dev_loop_sched_index, _dev_loop_stride); while (XOMP_static_sched_next (&_dev_loop_sched_index, orig_end, orig_step, _dev_loop_stride, _dev_loop_chunk_size, _dev_thread_num, _dev_thread_id, & _dev_lower , & _dev_upper)) { printf ("Thread ID: %d Allocated lower = %d upper = %d\n", _dev_thread_id, _dev_lower, _dev_upper); for (ij = _dev_lower ; ij <= _dev_upper; ij ++) { // using inclusive bound here _dev_u[ij] += (n - ij); } } }
Example of how to use the static-even scheduler (no longer the default scheduler in recent release due to performance drawbacks.):
// using a scheduler __global__ void OUT__1__11058__(int j,int k,float *_dev_a,float *_dev_b,float *_dev_c) { int _dev_i; long _dev_lower, _dev_upper; XOMP_accelerator_loop_default (0, MSIZE -1 , 1, &_dev_lower, &_dev_upper); for (_dev_i = _dev_lower; _dev_i<= _dev_upper; _dev_i ++) { for (j = 0; j < MSIZE; j++) { float c= 0.0; for (k = 0; k < MSIZE; k++) c += _dev_a[_dev_i * MSIZE + k] * _dev_b[k * MSIZE + j]; _dev_c[_dev_i * MSIZE + j] = c; } } }
Testing
editInput files under https://github.com/rose-compiler/rose-develop/tree/master/tests/nonsmoke/functional/CompileTests/OpenMP_tests
- axpy_ompacc.c
- axpy_ompacc2.c
- matrixmultiply-ompacc.c
- jacobi-ompacc.c
- jacobi-ompacc-opt1.c // use "_target data" region
Test directory https://github.com/rose-compiler/rose-develop/tree/master/tests/nonsmoke/functional/roseTests/ompLoweringTests
- Test translator: roseompacc.C
- Test driver: Makefile.am
- make ompacc_test within the build tree will trigger the associated testing
- Make sure you have nvidia nvcc compiler installed to generate the executables. Otherwise only cuda file will be generated and the final compilation will fail.
Some sample generated CUDA files can be found at:
publication
edit- C. Liao, Y. Yan, B. de Supinsky, D. Quinlan, B. Chapman, Early Experiences With The OpenMP Accelerator Model, IWOMP 2013, https://e-reports-ext.llnl.gov/pdf/755563.pdf