OpenMP

OpenMP is a library used for parallel programming on shared-memory machines. OpenMP allows you to use simple high-level constructs for parallelism, while hiding the details, keeping it easy to use and quick to write. OpenMP is supported by C, C++, and Fortran.

Structure ¶

Generally an OpenMP program will use the following structure.

Master: Start a Master thread, which will be used to set up the environment and initialize variables
Slave: Slave threads are created for sections of code which are marked by a special directive, these are the threads which will run the parallel sections.

Each thread will have its own ID which can be obtained using the function omp_get_thread_num(), but more on that later.

          __________ Slave
         /__________ Slave
        /
Master ------------- Master
        \___________ Slave
         \__________ Slave

Compiling and running OpenMP ¶

A simple «hello world» program can be parallelized using the #pragma omp parallel directive

1#include <stdio.h>
2
3int main() {
4    #pragma omp parallel
5    {
6        printf("Hello, World!\n");
7    }
8    return 0;
9}

Compile it like this

1# The OpenMP flat depends on the compiler
2# intel : -openmp
3# gcc : -fopenmp
4# pgcc : -mp
5gcc -fopenmp hello.c -o Hello

Running it should output

Hello, World!
...
Hello, World!

The exact number of «Hello, Worlds» depends on the number of cores of your machine, for example I got 12 my laptop.

Threads and processes ¶

You can change the default number of threads using export OMP_NUM_THREADS=8

Here are some useful library functions in the omp.h library

 1// Check the number of threads
 2printf("Max Threads: %d\n", omp_get_max_threads());
 3printf("Current number of threads: %d\n", omp_get_num_threads());
 4printf("Current Thread ID: %d\n", omp_get_thread_num());
 5
 6// Modify the number of threads
 7omp_set_num_threads(int);
 8
 9// Check if we are in a parallel region
10omp_in_parallel();
11
12// Dynamically vary the number of threads
13omp_set_dynamic(int);
14omp_get_dynamic();
15
16// Check the number of processors
17printf("Number of processors: %d\n", omp_num_procs());

Private and shared variables ¶

 1// Variables in parallel sections can be either private or shared.
 2
 3/* Private variables are private to each thread, as each thread has its own
 4* private copy. These variables are not initialized or maintained outside
 5* the thread.
 6*/
 7#pragma omp parallel private(x, y)
 8
 9/* Shared variables are visible and accessible by all threads. By default,
10* all variables in the work sharing region are shared except the loop
11* iteration counter.
12*
13* Shared variables should be used with care as they can cause race conditions.
14*/
15#pragma omp parallel shared(a, b, c)
16
17// They can be declared together as follows
18#pragma omp parallel private(x, y) shared(a,b,c)

Synchronization ¶

OpenMP provides a number of directives to control the synchronization of threads

 1#pragma omp parallel {
 2
 3    /* `critical`: the enclosed code block will be executed by only one thread
 4     * at a time, and not simultaneously executed by multiple threads. It is
 5     * often used to protect shared data from race conditions.
 6     */
 7    #pragma omp critical
 8    data += data + computed;
 9
10
11    /* `single`: used when a block of code needs to be run by only a single
12     * thread in a parallel section. Good for managing control variables.
13     */
14    #pragma omp single
15    printf("Current number of threads: %d\n", omp_get_num_threads());
16
17    /*  `atomic`: Ensures that a specific memory location is updated atomically
18     *  to avoid race conditions.  */
19    #pragma omp atomic
20    counter += 1;
21
22
23    /* `ordered`: the structured block is executed in the order in which
24     * iterations would be executed in a sequential loop     */
25    #pragma omp for ordered
26    for (int i = 0; i < N; ++i) {
27        #pragma omp ordered
28        process(data[i]);
29    }
30
31
32    /* `barrier`: Forces all threads to wait until all threads reach this point
33     * before proceeding.  */
34    #pragma omp barrier
35
36    /* `nowait`: Allows threads to proceed with their next task without waiting
37     * for other threads to complete the current one. */
38    #pragma omp for nowait
39    for (int i = 0; i < N; ++i) {
40        process(data[i]);
41    }
42
43    /* `reduction` : Combines the results of each thread's computation into a
44     * single result.  */
45    #pragma omp parallel for reduction(+:sum)
46    for (int i = 0; i < N; ++i) {
47        sum += a[i] * b[i];
48    }
49
50}

Example of the use of barrier

 1#include <omp.h>
 2#include <stdio.h>
 3
 4int main() {
 5
 6  // Current number of active threads
 7  printf("Num of threads is %d\n", omp_get_num_threads());
 8
 9#pragma omp parallel
10  {
11      // Current thread ID
12      printf("Thread ID: %d\n", omp_get_thread_num());
13
14#pragma omp barrier <--- Wait here until other threads have returned
15      if(omp_get_thread_num() == 0)
16      {
17          printf("\nNumber of active threads: %d\n", omp_get_num_threads());
18      }
19  }
20  return 0;
21}

Parallelizing Loops ¶

It is simple to parallelise loops using OpenMP. Using a work-sharing directive we can do the following

1#pragma omp parallel
2{
3    #pragma omp for
4    // for loop to be parallelized
5    for() ...
6}

A loop must be easily parallelisable for OpenMP to unroll and facilitate the assignment amoungst threads. If there are any data dependancies between one iteration to the next, OpenMP can’t parallelise it.

Speed Comparison ¶

The following is a C++ program which compares parallelised code with serial code

 1
 2#include <iostream>
 3#include <vector>
 4#include <ctime>
 5#include <chrono>
 6#include <omp.h>
 7
 8int main() {
 9    const int num_elements = 100000000;
10
11    std::vector<double> a(num_elements, 1.0);
12    std::vector<double> b(num_elements, 2.0);
13    std::vector<double> c(num_elements, 0.0);
14
15    // Serial version
16    auto start_time = std::chrono::high_resolution_clock::now();
17    for (int i = 0; i < num_elements; i++) {
18        c[i] = a[i] * b[i];
19    }
20    auto end_time = std::chrono::high_resolution_clock::now();
21    auto duration_serial = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();
22
23    // Parallel version with OpenMP
24    start_time = std::chrono::high_resolution_clock::now();
25    #pragma omp parallel for
26    for (int i = 0; i < num_elements; i++) {
27        c[i] = a[i] * b[i];
28    }
29    end_time = std::chrono::high_resolution_clock::now();
30    auto duration_parallel = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();
31
32    std::cout << "Serial execution time: " << duration_serial << " ms" << std::endl;
33    std::cout << "Parallel execution time: " << duration_parallel << " ms" << std::endl;
34    std::cout << "Speedup: " << static_cast<double>(duration_serial) / duration_parallel << std::endl;
35
36    return 0;
37}

This results in

Serial execution time: 488 ms
Parallel execution time: 148 ms
Speedup: 3.2973

It should be noted that this example is fairly contrived and the actual speedup depends on implementation and it should also be noted that serial code may run faster than parallel code due to cache preformace.

Example ¶

The following example uses OpenMP to calculate the Mandlebrot set

 1#include <iostream>
 2#include <fstream>
 3#include <complex>
 4#include <vector>
 5#include <omp.h>
 6
 7const int width = 2000;
 8const int height = 2000;
 9const int max_iterations = 1000;
10
11int mandelbrot(const std::complex<double> &c) {
12    std::complex<double> z = c;
13    int n = 0;
14    while (abs(z) <= 2 && n < max_iterations) {
15        z = z * z + c;
16        n++;
17    }
18    return n;
19}
20
21int main() {
22    std::vector<std::vector<int>> values(height, std::vector<int>(width));
23
24    // Calculate the Mandelbrot set using OpenMP
25    #pragma omp parallel for schedule(dynamic)
26    for (int y = 0; y < height; y++) {
27        for (int x = 0; x < width; x++) {
28            double real = (x - width / 2.0) * 4.0 / width;
29            double imag = (y - height / 2.0) * 4.0 / height;
30            std::complex<double> c(real, imag);
31
32            values[y][x] = mandelbrot(c);
33        }
34    }
35
36    // Prepare the output image
37    std::ofstream image("mandelbrot_set.ppm");
38    image << "P3\n" << width << " " << height << " 255\n";
39
40    // Write the output image in serial
41    for (int y = 0; y < height; y++) {
42        for (int x = 0; x < width; x++) {
43            int value = values[y][x];
44            int r = (value % 8) * 32;
45            int g = (value % 16) * 16;
46            int b = (value % 32) * 8;
47
48            image << r << " " << g << " " << b << " ";
49        }
50        image << "\n";
51    }
52
53    image.close();
54    std::cout << "Mandelbrot set image generated as mandelbrot_set.ppm." << std::endl;
55
56    return 0;
57}