active: early 2008 download: sf.net/projects/quickmp api: link

QuickMP

Simple C++ Loop Parallelization

QuickMP (multi-processing) is a simple C++ loop parallelization API contained in a single header file. It provides automatic scalable performance based on the number of available cores. Parallelizing a loop is easy:

// Normal loop uses 1 thread.
for (int i=0; i<1000000; ++i)
{
  processData(i);
}
// Parallel loop uses 1 thread per core.
QMP_PARALLEL_FOR(i, 0, 1000000)
  processData(i);
QMP_END_PARALLEL_FOR

QuickMP is intended for shared-memory programs, similar to OpenMP.* It is cross-platform (test on Windows, Mac OS X, and Linux), it chooses thread count automatically at runtime, and its API includes scheduling options and shared data synchronization.

* OpenMP is a great standard. If your preferred compiler supports it, you should use it instead. If not, QuickMP can provide OpenMP-like functionality for any C++ compiler. (Notably, the popular free Visual C++ Express compiler does not support OpenMP as of version 2008. GCC has added OpenMP support as of version 4.2.0, but many programmers prefer older versions.)

Below are some basic usage instructions, example results, and a variety of tips and considerations.

Also see QuickTest (unit testing) and QuickProf (performance profiling).

Basic Usage

Step 1: Include quickmp.h in your code.

#include <quickmp/quickmp.h>

Step 2: Link with standard platform-specific threading libraries.

Step 3: Parallelize your loop.

#include <quickmp/quickmp.h>
int main(int argc, char* argv[])
{
  // Same as: for (int i=0; i<1000000; ++i) {...}
  QMP_PARALLEL_FOR(i, 0, 1000000)
    processData(i);
  QMP_END_PARALLEL_FOR
  return 0;
}

Example Results

The following plots show performance results on a simple ray tracing program (included in QuickMP source) using various numbers of threads:

Ray tracing benchmark, 4 cores
Ray tracing benchmark, 4 cores
Ray tracing benchmark, two different 2-core architectures
Ray tracing benchmark, two different 2-core architectures

The second example compares results from two different CPU architectures. First, note the minimal overhead required for a 1-thread parallel loop over the control (a non-QuickMP loop). Notice the excellent scalability on the dual-core Pentium D going from 1 thread to 2 threads (about 1.95X normal speed); going to 3 and 4 threads adds little benefit because there are only 2 processors. The dual-core Xeon only improves by about 20% going from 1 thread to 2, possibly because both threads are running on one hyper-threaded core, which is not quite as good as two separate cores. Going from 2 to 3 threads adds a much larger performance increase (roughly 1.8X normal speed) because at least one thread is running on each core.

The ray tracing program renders a set of randomly-positioned spheres:

Ray tracing benchmark output image

General Info

The Rules

Shared Data

There is a 2-step process for using shared variables within a loop:

int sharedValue = 8;
std::vector sharedData(100);
QMP_SHARE(sharedValue);
QMP_SHARE(sharedData);
QMP_PARALLEL_FOR(i, 0, 1000000)
  QMP_USE_SHARED(sharedValue, int);
  QMP_USE_SHARED(sharedData, std::vector);
  processData(sharedData, i, sharedValue);
QMP_END_PARALLEL_FOR

Shared variables that are modified must also be protected with a critical section:

int sum = 0;
QMP_SHARE(sum);
QMP_PARALLEL_FOR(i, 0, 1000000)
  QMP_USE_SHARED(sum, int);
  QMP_CRITICAL(0);
  ++sum;
  QMP_END_CRITICAL(0);
QMP_END_PARALLEL_FOR

Performance Considerations

Miscellaneous Tips