ROSE Compiler Framework/OpenMP Acclerator Model Implementation

ROSE Compiler Framework/OpenMP Support>

Overview

edit

We are experimenting a trial implementation of the OpenMP accelerator extension in OpenMP 4.0 specification. It is not meant to be the official or complete implementation due to the fast changing nature of the specification and our resource limitations.

The implementation is based on the existing OpenMP implementation in ROSE.

Implemented Features

edit

The implementation loosely follows the technical report http://www.openmp.org/mp-documents/TR1_167.pdf and later http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf .

  • pragma omp _target + pragma omp parallel for
  • map clauses
  • reduction on GPUs
  • pragma omp _target data
  • collapse clause

Compile and Run

edit

We incrementally release the work as part of the ROSE (EDG 4.x based version) compiler's hosted at https://github.com/rose-compiler/rose-develop

  • clone it
  • configure and build librose.so under buildtree/src
  • build the translator roseompacc under tests/nonsmoke/functional/roseTests/ompLoweringTests/ by typing "make roseompacc"

The translator source is rose-omp-accelerator/tests/nonsmoke/functional/roseTests/ompLoweringTests/roseompacc.C

The command lines (two steps for now) to use roseompacc is exemplified in rose-omp-accelerator/tests/nonsmoke/functional/roseTests/ompLoweringTests/Makefile.am

# Experimental translation for OpenMP accelerator model directives
# no final compilation for now, which requires CUDA compiler for the generated code
test_omp_acc:axpy_ompacc.o matrixmultiply-ompacc.o jacobi-ompacc.o
rose_axpy_ompacc.cu:roseompacc
        ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/axpy_ompacc.c 
rose_matrixmultiply-ompacc.cu:roseompacc
        ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/matrixmultiply-ompacc.c 
rose_jacobi-ompacc.cu:roseompacc
        ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/jacobi-ompacc.c 

# build executables using nvcc
axpy_ompacc.out:rose_axpy_ompacc.cu
        nvcc $<  $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@
matrixmultiply-ompacc.out:rose_matrixmultiply-ompacc.cu
        nvcc $< $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@
jacobi-ompacc.out:rose_jacobi-ompacc.cu
        nvcc $< $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@

Implementation Details

edit

Parsing

edit

flex/bison

AST attribute storing parsing results

IR generation: ROSETTA, convert attribute to dedicated AST

Translation

edit

translation:

Internally, it invokes the outliner acting in mode of classic (passing individual parameters without any wrapping array or structure)

  • Outliner::enable_classic = true;

Runtime Support

edit

runtime support

Rules about what to put into each file:

  • libxomp.h runtime interface to compiler writers
  • xomp.c runtime support for CPU threading
  • xomp_cuda_lib.cu provides the regular C functions wrapping CUDA code (without the __device__ keyword) called by the host code.
  • xomp_cuda_lib_inlined.cu provides the __device__ functions called by the CUDA kernel functions (with __global__ keyword).

At the time the work being done, CUDA and/or nvcc does not have linker for device code (linking __global__ and _device_ functions from different files). We have to put some common device functions into this file. So they can be included in the same file in which the generated CUDA kernels are generated (__global__ functions can only call __device__ functions from the same file).

Device Data Management

edit

The runtime provides two ways to manage device data

  • explicit manage of data allocation, copy-to, copy-back, and deallocation
  • automatic management using device data environment (DDE) functions
    • xomp_deviceDataEnvironmentEnter(); // enter an environment , pushed to an environment stack
    • xomp_deviceDataEnvironmentPrepareVariable(); // reuse data if already in parent environments, or allocate, register with the current environment stack
    • xomp_deviceDataEnvironmentExit(); // auto deallocation, copy back data. pop environment stack

Examples explicit data management

int mmm()
{
{
    float *_dev_a;
    int _dev_a_size = sizeof(float ) * N * M;
    _dev_a = ((float *)(xomp_deviceMalloc(_dev_a_size)));
    xomp_memcpyHostToDevice(((void *)_dev_a),((const void *)a),_dev_a_size);
    float *_dev_b;
    int _dev_b_size = sizeof(float ) * M * K;
    _dev_b = ((float *)(xomp_deviceMalloc(_dev_b_size)));
    xomp_memcpyHostToDevice(((void *)_dev_b),((const void *)b),_dev_b_size);
    float *_dev_c;
    int _dev_c_size = sizeof(float ) * N * M;
    _dev_c = ((float *)(xomp_deviceMalloc(_dev_c_size)));
    xomp_memcpyHostToDevice(((void *)_dev_c),((const void *)c),_dev_c_size);
/* Launch CUDA kernel ... */
    int _threads_per_block_ = xomp_get_maxThreadsPerBlock();
    int _num_blocks_ = xomp_get_max1DBlock(1023 - 0 + 1);
    OUT__1__9221__<<<_num_blocks_,_threads_per_block_>>>(_dev_a,_dev_b,_dev_c);
    xomp_freeDevice(_dev_a);
    xomp_freeDevice(_dev_b);
    xomp_memcpyDeviceToHost(((void *)c),((const void *)_dev_c),_dev_c_size);
    xomp_freeDevice(_dev_c);
  }
  return 0;
}

Automated data management supporting reusing nested device data environments

int mmm()
{
{
    xomp_deviceDataEnvironmentEnter();
    float *_dev_a;
    int _dev_a_size = sizeof(float ) * 1024 * 1024;
    _dev_a = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)a),_dev_a_size,1,0)));
    float *_dev_b;
    int _dev_b_size = sizeof(float ) * 1024 * 1024;
    _dev_b = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)b),_dev_b_size,1,0)));
    float *_dev_c;
    int _dev_c_size = sizeof(float ) * 1024 * 1024;
    _dev_c = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)c),_dev_c_size,1,1)));
/* Launch CUDA kernel ... */
    int _threads_per_block_ = xomp_get_maxThreadsPerBlock();
    int _num_blocks_ = xomp_get_max1DBlock(1023 - 0 + 1);
    OUT__1__9221__<<<_num_blocks_,_threads_per_block_>>>(_dev_a,_dev_b,_dev_c);
    xomp_deviceDataEnvironmentExit();
  }
  return 0;
}

Environment variables

edit

Control the number of devices being used in the multi-GPU support by default

 export OMP_NUM_DEVICES=5

The runtime will automatically detect this environment variable and use it to control the GPU count.

You can retrieve this count n the code, like

 int GPU_N = xomp_get_num_devices(); // this function will obtain the env variable. 

If OMP_NUM_DEVICES is specified, xomp_get_num_devices() will internally call int omp_get_max_devices(void) to obtain the max number of devices supported by the hardware.

Relevant commit

Loop scheduling

edit

Runtime loop scheduling must be provided since the naive 1-iteration-to-1-thread mapping in CUDA tutorials won't work for large iteration space. (iterations count > total GPU threads).

// Jacobi 2-D kernel: 
// Naive 1-iteration to 1 GPU thread scheduling: each GPU thread gets one iteration to work on.
// Problem: won't scale to large iteration space.
    int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
    int j = blockIdx.y * blockDim.y + threadIdx.y + 1;

    newa[j*m+i] = w0*a[j*m+i] +
	    w1 * (a[j*m+i-1] + a[(j-1)*m+i] +
		  a[j*m+i+1] + a[(j+1)*m+i]) +
	    w2 * (a[(j-1)*m+i-1] + a[(j+1)*m+i-1] +
		  a[(j-1)*m+i+1] + a[(j+1)*m+i+1]);


We provide two loop schedulers in the runtime so each CUDA thread may get more than 1 iterations to handle large iteration space.

  1. a static even scheduler: this scheduler evenly divides up the iteration space into roughly equal sized chunks. It then assigns each chunk to a CUDA thread. This scheduling policy may stress the memory too much since each thread will touch a large range of data brought in by the iteration chunk
  2. a round-robin scheduler: Each thread grabs one iteration (or more iterations) at a time. It is identical to OpenMP's schedule(static, chunk) policy.

Testing use of the schedulers (using corresponding CPU versions) are given at:

GPU version use example: the round-robin scheduler (now the new default scheduler)

void OUT__2__10550__(int n,int *_dev_u)
{
  int ij;
  int _dev_lower, _dev_upper;


  // variables for adjusted loop info considering both original chunk size and step(strip)
  int _dev_loop_chunk_size;
  int _dev_loop_sched_index;
  int _dev_loop_stride;

  // 1-D thread block:
  int _dev_thread_num = omp_get_num_threads();
  int _dev_thread_id = omp_get_thread_num();
  printf ("thread count = %d, current thread id = %d\n", _dev_thread_num, _dev_thread_id);

  int orig_start =0; // must be correct!!
  int orig_end = n-1; // use inclusive bound
  int orig_step = 1;
  int orig_chunk_size = 1;

  XOMP_static_sched_init (orig_start, orig_end, orig_step, orig_chunk_size, _dev_thread_num, _dev_thread_id, \
      & _dev_loop_chunk_size , & _dev_loop_sched_index, & _dev_loop_stride);

  printf ("Initialized chunk size = %d, sched indx =%d, stride = %d\n",_dev_loop_chunk_size, _dev_loop_sched_index, _dev_loop_stride);

  while (XOMP_static_sched_next (&_dev_loop_sched_index, orig_end, orig_step, _dev_loop_stride, _dev_loop_chunk_size, _dev_thread_num, _dev_thread_id, & _dev_lower
        , & _dev_upper))
  {
    printf ("Thread ID: %d Allocated lower = %d upper = %d\n", _dev_thread_id, _dev_lower, _dev_upper);
    for (ij = _dev_lower ; ij <= _dev_upper; ij ++) { // using inclusive bound here
        _dev_u[ij] += (n - ij);         
    }
  }
}

Example of how to use the static-even scheduler (no longer the default scheduler in recent release due to performance drawbacks.):

// using a scheduler
__global__ void OUT__1__11058__(int j,int k,float *_dev_a,float *_dev_b,float *_dev_c)
{
  int _dev_i;
  long _dev_lower, _dev_upper;
  XOMP_accelerator_loop_default (0, MSIZE -1 , 1, &_dev_lower, &_dev_upper);

  for (_dev_i = _dev_lower; _dev_i<= _dev_upper; _dev_i ++) 
  {
    for (j = 0; j < MSIZE; j++)
    {
      float c= 0.0;
      for (k = 0; k < MSIZE; k++)
        c += _dev_a[_dev_i * MSIZE + k] * _dev_b[k * MSIZE + j];
      _dev_c[_dev_i * MSIZE + j] = c;
    }
  }
}

Testing

edit

Input files under https://github.com/rose-compiler/rose-develop/tree/master/tests/nonsmoke/functional/CompileTests/OpenMP_tests

  • axpy_ompacc.c
  • axpy_ompacc2.c
  • matrixmultiply-ompacc.c
  • jacobi-ompacc.c
  • jacobi-ompacc-opt1.c // use "_target data" region

Test directory https://github.com/rose-compiler/rose-develop/tree/master/tests/nonsmoke/functional/roseTests/ompLoweringTests

  • Test translator: roseompacc.C
  • Test driver: Makefile.am
  • make ompacc_test within the build tree will trigger the associated testing
    • Make sure you have nvidia nvcc compiler installed to generate the executables. Otherwise only cuda file will be generated and the final compilation will fail.

Some sample generated CUDA files can be found at:

publication

edit
  NODES
HOME 1
Intern 2
languages 2
Note 1
OOP 20
os 73