active: early 2008 download: sf.net/projects/quickmp api: link

QuickMP

Simple C++ Loop Parallelization

QuickMP (multi-processing) is a simple C++ loop parallelization API contained in a single header file. It provides automatic scalable performance based on the number of available cores. Parallelizing a loop is easy:

// Normal loop uses 1 thread.
for (int i=0; i<1000000; ++i)
{
  processData(i);
}

// Parallel loop uses 1 thread per core.
QMP_PARALLEL_FOR(i, 0, 1000000)
  processData(i);
QMP_END_PARALLEL_FOR

QuickMP is intended for shared-memory programs, similar to OpenMP.* It is cross-platform (test on Windows, Mac OS X, and Linux), it chooses thread count automatically at runtime, and its API includes scheduling options and shared data synchronization.

* OpenMP is a great standard. If your preferred compiler supports it, you should use it instead. If not, QuickMP can provide OpenMP-like functionality for any C++ compiler. (Notably, the popular free Visual C++ Express compiler does not support OpenMP as of version 2008. GCC has added OpenMP support as of version 4.2.0, but many programmers prefer older versions.)

Below are some basic usage instructions, example results, and a variety of tips and considerations.

Also see QuickTest (unit testing) and QuickProf (performance profiling).

Basic Usage

Step 1: Include quickmp.h in your code.

#include <quickmp/quickmp.h>

Step 2: Link with standard platform-specific threading libraries.

unix: link with libpthread.so (i.e. linker flag –lpthread). Also pass pthread flags to the compiler and linker (e.g. -pthread for gcc).
Windows: nothing to do, most likely. You must link with a multithreaded version of the C runtime libraries (i.e. in Visual Studio this is under C/C++ => Code Generation => Runtime Library in the project properties); however, VC++ 2005 and later have removed the single threaded versions, so this is usually not an issue.

Step 3: Parallelize your loop.

#include <quickmp/quickmp.h>
int main(void)
{
  // Same as: for (int i=0; i<1000000; ++i) {...}
  QMP_PARALLEL_FOR(i, 0, 1000000)
    processData(i);
  QMP_END_PARALLEL_FOR
}

Example Results

The following plots show performance results on a simple ray tracing program (included in QuickMP source) using various numbers of threads:

Ray tracing benchmark, two different 2-core architectures

The second example compares results from two different CPU architectures. First, note the minimal overhead required for a 1-thread parallel loop over the control (a non-QuickMP loop). Notice the excellent scalability on the dual-core Pentium D going from 1 thread to 2 threads (about 1.95X normal speed); going to 3 and 4 threads adds little benefit because there are only 2 processors. The dual-core Xeon only improves by about 20% going from 1 thread to 2, possibly because both threads are running on one hyper-threaded core, which is not quite as good as two separate cores. Going from 2 to 3 threads adds a much larger performance increase (roughly 1.8X normal speed) because at least one thread is running on each core.

The ray tracing program renders a set of randomly-positioned spheres:

General Info

By default a parallel for loop will use one thread per processor (determined at runtime). Each thread will be assigned a subset of the loop iterations.
QuickMP is intended for shared memory programming, i.e. multithreaded programs/libraries running on a single machine with multiple processors all sharing the same memory.
QuickMP is designed for data parallelism, not task parallelism. Thus, it is ideal for applications where the same operations are applied repeatedly on lots of data.
QuickMP follows a fork-join model. The idea is that the main thread encounters a distinct parallel section, forks into a bunch of worker threads which split up the work, and then joins back into a single thread again. (Internally there is a thread pool which waits in the background until needed.)
API macros are provided for general thread management (setting number of threads, getting thread indices, etc.), declaring shared variables, data synchronization (barriers and critical sections), scheduling hints for load balancing, and more. See the API reference for more details.

The Rules

Like OpenMP, QuickMP requires the loop iterations to be order-independent. One iteration cannot depend on the results from another iteration. In other words, your code should give the same result even if the iterations run backwards or in a random order.
The loop index always begins at a given starting index and counts up.
The code inside of the loop is special; it has a different scope from the surrounding code. It cannot directly access outside variables without using the shared data macros (described below). It can, however, call any function and create local variables that would be allowed within a normal for loop in the same scope.
Iteration over STL containers is possible for those with random access iterators. Elements are accessed using the loop index, either with the [] operator or iterator arithmetic (e.g. container.begin() + i).
Creating nested parallel for loops results in undefined behavior.
Creating multiple simultaneous parallel for loops (e.g. two different threads each start their own loops) results in undefined behavior.

Shared Data

There is a 2-step process for using shared variables within a loop:

int sharedValue = 8;
std::vector sharedData(100);
QMP_SHARE(sharedValue);
QMP_SHARE(sharedData);
QMP_PARALLEL_FOR(i, 0, 1000000)
  QMP_USE_SHARED(sharedValue, int);
  QMP_USE_SHARED(sharedData, std::vector);
  processData(sharedData, i, sharedValue);
QMP_END_PARALLEL_FOR

Shared variables that are modified must also be protected with a critical section:

int sum = 0;
QMP_SHARE(sum);
QMP_PARALLEL_FOR(i, 0, 1000000)
  QMP_USE_SHARED(sum, int);
  QMP_CRITICAL(0);
  ++sum;
  QMP_END_CRITICAL(0);
QMP_END_PARALLEL_FOR

Performance Considerations

Parallel for loops are most effective at a very high level in your program, not in the inner loops.
In practice runtime performance will be proportional to the number of available processors but will probably not increase linearly with more processors. (See Amdahl's Law.)
QuickMP adds minimal runtime overhead for thread management. In the single-threaded case the overhead is negligible.
Threads are created and destroyed only when the program starts and exits and when the number of threads is explicitly changed.
The parallel for loop macro provides an optional "schedule hint" argument to specify how the iterations should be distributed. "Sequential" gives each thread equal chunks of the loop range and is preferred if each iteration takes the same amount of time. "Interleaved" gives the threads alternating loop iterations and is preferred if the iterations could take different amounts of time. The following example shows how to use this option:
```
int imageWidth = 1600;
int imageHeight = 1200;
int numPixels = imageWidth * imageHeight;
QMP_SHARE(imageWidth);
QMP_SHARE(imageHeight);
QMP_SHARE(scene);
QMP_PARALLEL_FOR(i, 0, numPixels, quickmp::INTERLEAVED)
  QMP_USE_SHARED(imageWidth, int);
  QMP_USE_SHARED(imageHeight, int);
  QMP_USE_SHARED(scene, std::vector);
  Color c = fireRay(scene, imageWidth, imageHeight, i);
  Image[i] = c;
QMP_END_PARALLEL_FOR
```
Keep critical sections short to reduce thread idleness. If possible, avoid critical sections altogether.
If possible, avoid sharing data among threads to improve cache performance.
Remember that calling global functions within threads can involve shared memory (e.g. rand() uses shared global state variables).
Parallel reads from the same memory location should cause little slowdown, but if that memory changes (e.g. when one thread writes to it), all threads using that memory must update their cached copies.
CPU architecture makes a big difference, especially when data is shared among threads. Newer multicore chips are designed for better multithreaded performance (shared memory caches, faster interconnections). (An interesting example is a program with lots of shared data running on a machine with two processor dies, each containing two cores. This program will probably scale very well from 1 to 2 threads because of the fast communication between two cores on the same die. However, it might not scale as well from 2 to 3 or 4 threads, depending on the communication speed between the two dies.) Note that a shared cache can slow things down if two processors sharing the cache are both accessing totally different memory locations, and each are frequently accessing more memory than half of the cache size, causing constant cache misses for the other processor.

Miscellaneous Tips

The quickmp.h header file is rather long and includes other system headers. If desired, it can be split into a .h and .cpp file for faster compilation. The ideal split point has been marked in the header file.
Almost all C rand() implementations are not thread-safe because their internal state variables are not protected from being accessed by any thread at any time. If you are concerned about repeatable results, consider using rand_r() (a reentrant version of rand() available on some platforms) or some other reentrant random number generator.