OpenMP

Личный сайт Go-разработчика из Казани

OpenMP is a library used for parallel programming on shared-memory machines. OpenMP allows you to use simple high-level constructs for parallelism, while hiding the details, keeping it easy to use and quick to write. OpenMP is supported by C, C++, and Fortran.

Structure

Generally an OpenMP program will use the following structure.

  • Master: Start a Master thread, which will be used to set up the environment and initialize variables

  • Slave: Slave threads are created for sections of code which are marked by a special directive, these are the threads which will run the parallel sections.

Each thread will have its own ID which can be obtained using the function omp_get_thread_num(), but more on that later.

          __________ Slave
         /__________ Slave
        /
Master ------------- Master
        \___________ Slave
         \__________ Slave

Compiling and running OpenMP

A simple «hello world» program can be parallelized using the #pragma omp parallel directive

1#include <stdio.h> 2 3int main() { 4 #pragma omp parallel 5 { 6 printf("Hello, World!\n"); 7 } 8 return 0; 9}

Compile it like this

1# The OpenMP flat depends on the compiler 2# intel : -openmp 3# gcc : -fopenmp 4# pgcc : -mp 5gcc -fopenmp hello.c -o Hello

Running it should output

Hello, World!
...
Hello, World!

The exact number of «Hello, Worlds» depends on the number of cores of your machine, for example I got 12 my laptop.

Threads and processes

You can change the default number of threads using export OMP_NUM_THREADS=8

Here are some useful library functions in the omp.h library

1// Check the number of threads 2printf("Max Threads: %d\n", omp_get_max_threads()); 3printf("Current number of threads: %d\n", omp_get_num_threads()); 4printf("Current Thread ID: %d\n", omp_get_thread_num()); 5 6// Modify the number of threads 7omp_set_num_threads(int); 8 9// Check if we are in a parallel region 10omp_in_parallel(); 11 12// Dynamically vary the number of threads 13omp_set_dynamic(int); 14omp_get_dynamic(); 15 16// Check the number of processors 17printf("Number of processors: %d\n", omp_num_procs());

Private and shared variables

1// Variables in parallel sections can be either private or shared. 2 3/* Private variables are private to each thread, as each thread has its own 4* private copy. These variables are not initialized or maintained outside 5* the thread. 6*/ 7#pragma omp parallel private(x, y) 8 9/* Shared variables are visible and accessible by all threads. By default, 10* all variables in the work sharing region are shared except the loop 11* iteration counter. 12* 13* Shared variables should be used with care as they can cause race conditions. 14*/ 15#pragma omp parallel shared(a, b, c) 16 17// They can be declared together as follows 18#pragma omp parallel private(x, y) shared(a,b,c)

Synchronization

OpenMP provides a number of directives to control the synchronization of threads

1#pragma omp parallel { 2 3 /* `critical`: the enclosed code block will be executed by only one thread 4 * at a time, and not simultaneously executed by multiple threads. It is 5 * often used to protect shared data from race conditions. 6 */ 7 #pragma omp critical 8 data += data + computed; 9 10 11 /* `single`: used when a block of code needs to be run by only a single 12 * thread in a parallel section. Good for managing control variables. 13 */ 14 #pragma omp single 15 printf("Current number of threads: %d\n", omp_get_num_threads()); 16 17 /* `atomic`: Ensures that a specific memory location is updated atomically 18 * to avoid race conditions. */ 19 #pragma omp atomic 20 counter += 1; 21 22 23 /* `ordered`: the structured block is executed in the order in which 24 * iterations would be executed in a sequential loop */ 25 #pragma omp for ordered 26 for (int i = 0; i < N; ++i) { 27 #pragma omp ordered 28 process(data[i]); 29 } 30 31 32 /* `barrier`: Forces all threads to wait until all threads reach this point 33 * before proceeding. */ 34 #pragma omp barrier 35 36 /* `nowait`: Allows threads to proceed with their next task without waiting 37 * for other threads to complete the current one. */ 38 #pragma omp for nowait 39 for (int i = 0; i < N; ++i) { 40 process(data[i]); 41 } 42 43 /* `reduction` : Combines the results of each thread's computation into a 44 * single result. */ 45 #pragma omp parallel for reduction(+:sum) 46 for (int i = 0; i < N; ++i) { 47 sum += a[i] * b[i]; 48 } 49 50}

Example of the use of barrier

1#include <omp.h> 2#include <stdio.h> 3 4int main() { 5 6 // Current number of active threads 7 printf("Num of threads is %d\n", omp_get_num_threads()); 8 9#pragma omp parallel 10 { 11 // Current thread ID 12 printf("Thread ID: %d\n", omp_get_thread_num()); 13 14#pragma omp barrier <--- Wait here until other threads have returned 15 if(omp_get_thread_num() == 0) 16 { 17 printf("\nNumber of active threads: %d\n", omp_get_num_threads()); 18 } 19 } 20 return 0; 21}

Parallelizing Loops

It is simple to parallelise loops using OpenMP. Using a work-sharing directive we can do the following

1#pragma omp parallel 2{ 3 #pragma omp for 4 // for loop to be parallelized 5 for() ... 6}

A loop must be easily parallelisable for OpenMP to unroll and facilitate the assignment amoungst threads. If there are any data dependancies between one iteration to the next, OpenMP can’t parallelise it.

Speed Comparison

The following is a C++ program which compares parallelised code with serial code

1 2#include <iostream> 3#include <vector> 4#include <ctime> 5#include <chrono> 6#include <omp.h> 7 8int main() { 9 const int num_elements = 100000000; 10 11 std::vector<double> a(num_elements, 1.0); 12 std::vector<double> b(num_elements, 2.0); 13 std::vector<double> c(num_elements, 0.0); 14 15 // Serial version 16 auto start_time = std::chrono::high_resolution_clock::now(); 17 for (int i = 0; i < num_elements; i++) { 18 c[i] = a[i] * b[i]; 19 } 20 auto end_time = std::chrono::high_resolution_clock::now(); 21 auto duration_serial = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count(); 22 23 // Parallel version with OpenMP 24 start_time = std::chrono::high_resolution_clock::now(); 25 #pragma omp parallel for 26 for (int i = 0; i < num_elements; i++) { 27 c[i] = a[i] * b[i]; 28 } 29 end_time = std::chrono::high_resolution_clock::now(); 30 auto duration_parallel = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count(); 31 32 std::cout << "Serial execution time: " << duration_serial << " ms" << std::endl; 33 std::cout << "Parallel execution time: " << duration_parallel << " ms" << std::endl; 34 std::cout << "Speedup: " << static_cast<double>(duration_serial) / duration_parallel << std::endl; 35 36 return 0; 37}

This results in

Serial execution time: 488 ms
Parallel execution time: 148 ms
Speedup: 3.2973

It should be noted that this example is fairly contrived and the actual speedup depends on implementation and it should also be noted that serial code may run faster than parallel code due to cache preformace.

Example

The following example uses OpenMP to calculate the Mandlebrot set

1#include <iostream> 2#include <fstream> 3#include <complex> 4#include <vector> 5#include <omp.h> 6 7const int width = 2000; 8const int height = 2000; 9const int max_iterations = 1000; 10 11int mandelbrot(const std::complex<double> &c) { 12 std::complex<double> z = c; 13 int n = 0; 14 while (abs(z) <= 2 && n < max_iterations) { 15 z = z * z + c; 16 n++; 17 } 18 return n; 19} 20 21int main() { 22 std::vector<std::vector<int>> values(height, std::vector<int>(width)); 23 24 // Calculate the Mandelbrot set using OpenMP 25 #pragma omp parallel for schedule(dynamic) 26 for (int y = 0; y < height; y++) { 27 for (int x = 0; x < width; x++) { 28 double real = (x - width / 2.0) * 4.0 / width; 29 double imag = (y - height / 2.0) * 4.0 / height; 30 std::complex<double> c(real, imag); 31 32 values[y][x] = mandelbrot(c); 33 } 34 } 35 36 // Prepare the output image 37 std::ofstream image("mandelbrot_set.ppm"); 38 image << "P3\n" << width << " " << height << " 255\n"; 39 40 // Write the output image in serial 41 for (int y = 0; y < height; y++) { 42 for (int x = 0; x < width; x++) { 43 int value = values[y][x]; 44 int r = (value % 8) * 32; 45 int g = (value % 16) * 16; 46 int b = (value % 32) * 8; 47 48 image << r << " " << g << " " << b << " "; 49 } 50 image << "\n"; 51 } 52 53 image.close(); 54 std::cout << "Mandelbrot set image generated as mandelbrot_set.ppm." << std::endl; 55 56 return 0; 57}

Resources