QuickMP (multi-processing) is a simple C++ loop parallelization API contained in
a single header file. It provides automatic scalable performance based on the
number of available cores. Parallelizing a loop is easy:
// Normal loop uses 1 thread.
for (int i=0; i<1000000; ++i)
{
processData(i);
}
QuickMP is intended for shared-memory programs, similar to OpenMP.* It is
cross-platform (test on Windows, Mac OS X, and Linux), it chooses thread count
automatically at runtime, and its API includes scheduling options and shared
data synchronization.
* OpenMP is a great standard. If your
preferred compiler supports it, you should use it instead. If not, QuickMP can
provide OpenMP-like functionality for any C++ compiler. (Notably, the popular
free Visual C++ Express compiler does not support OpenMP as of version 2008. GCC
has added OpenMP support as of version 4.2.0, but many programmers prefer older
versions.)
Below are some basic usage instructions, example results, and a variety of tips
and considerations.
Also see QuickTest (unit testing) and
QuickProf (performance profiling).
Basic Usage
Step 1: Include quickmp.h in your code.
#include <quickmp/quickmp.h>
Step 2: Link with standard platform-specific threading libraries.
unix: link with libpthread.so (i.e. linker flag –lpthread). Also pass
pthread flags to the compiler and linker (e.g. -pthread for gcc).
Windows: nothing to do, most likely. You must link with a multithreaded
version of the C runtime libraries (i.e. in Visual Studio this is under
C/C++ => Code Generation => Runtime Library in the project properties);
however, VC++ 2005 and later have removed the single threaded versions, so
this is usually not an issue.
Step 3: Parallelize your loop.
#include <quickmp/quickmp.h>
int main(void)
{
// Same as: for (int i=0; i<1000000; ++i) {...}
QMP_PARALLEL_FOR(i, 0, 1000000)
processData(i);
QMP_END_PARALLEL_FOR
}
Example Results
The following plots show performance results on a simple ray tracing program
(included in QuickMP source) using various numbers of threads:
The second example compares results from two different CPU architectures. First,
note the minimal overhead required for a 1-thread parallel loop over the control
(a non-QuickMP loop). Notice the excellent scalability on the dual-core Pentium
D going from 1 thread to 2 threads (about 1.95X normal speed); going to
3 and 4 threads adds little benefit because there are only 2 processors. The
dual-core Xeon only improves by about 20% going from 1 thread to 2, possibly
because both threads are running on one hyper-threaded core, which is not quite
as good as two separate cores. Going from 2 to 3 threads adds a much larger
performance increase (roughly 1.8X normal speed) because at least one thread is
running on each core.
The ray tracing program renders a set of randomly-positioned spheres:
General Info
By default a parallel for loop will use one thread per processor (determined
at runtime). Each thread will be assigned a subset of the loop
iterations.
QuickMP is intended for shared memory programming, i.e. multithreaded
programs/libraries running on a single machine with multiple processors all
sharing the same memory.
QuickMP is designed for
data parallelism, not task parallelism. Thus, it is ideal for
applications where the same operations are applied repeatedly on lots of
data.
QuickMP follows a fork-join model. The idea is that the main thread
encounters a distinct parallel section, forks into a bunch of worker threads
which split up the work, and then joins back into a single thread again.
(Internally there is a thread pool which waits in the background until
needed.)
API macros are provided for general thread management (setting number of
threads, getting thread indices, etc.), declaring shared variables, data
synchronization (barriers and critical sections), scheduling hints for load
balancing, and more. See the API reference for more details.
The Rules
Like OpenMP, QuickMP requires the loop iterations to be order-independent.
One iteration cannot depend on the results from another iteration. In other
words, your code should give the same result even if the iterations run
backwards or in a random order.
The loop index always begins at a given starting index and counts up.
The code inside of the loop is special; it has a different scope from the
surrounding code. It cannot directly access outside variables without using
the shared data macros (described below). It can, however, call any function
and create local variables that would be allowed within a normal for loop in
the same scope.
Iteration over STL containers is possible for those with random access
iterators. Elements are accessed using the loop index, either with the []
operator or iterator arithmetic (e.g. container.begin() + i).
Creating nested parallel for loops results in undefined behavior.
Creating multiple simultaneous parallel for loops (e.g. two different
threads each start their own loops) results in undefined behavior.
Shared Data
There is a 2-step process for using shared variables within a loop:
int sharedValue = 8;
std::vector sharedData(100);
QMP_SHARE(sharedValue);
QMP_SHARE(sharedData);
QMP_PARALLEL_FOR(i, 0, 1000000)
QMP_USE_SHARED(sharedValue, int);
QMP_USE_SHARED(sharedData, std::vector);
processData(sharedData, i, sharedValue);
QMP_END_PARALLEL_FOR
Shared variables that are modified must also be protected with a critical
section:
int sum = 0;
QMP_SHARE(sum);
QMP_PARALLEL_FOR(i, 0, 1000000)
QMP_USE_SHARED(sum, int);
QMP_CRITICAL(0);
++sum;
QMP_END_CRITICAL(0);
QMP_END_PARALLEL_FOR
Performance Considerations
Parallel for loops are most effective at a very high level in your program,
not in the inner loops.
In practice runtime performance will be proportional to the number of
available processors but will probably not increase linearly with more
processors. (See Amdahl's Law.)
QuickMP adds minimal runtime overhead for thread management. In the
single-threaded case the overhead is negligible.
Threads are created and destroyed only when the program starts and exits and
when the number of threads is explicitly changed.
The parallel for loop macro provides an optional "schedule hint" argument to
specify how the iterations should be distributed. "Sequential" gives each
thread equal chunks of the loop range and is preferred if each iteration
takes the same amount of time. "Interleaved" gives the threads alternating
loop iterations and is preferred if the iterations could take different
amounts of time. The following example shows how to use this option:
int imageWidth = 1600;
int imageHeight = 1200;
int numPixels = imageWidth * imageHeight;
QMP_SHARE(imageWidth);
QMP_SHARE(imageHeight);
QMP_SHARE(scene);
QMP_PARALLEL_FOR(i, 0, numPixels, quickmp::INTERLEAVED)
QMP_USE_SHARED(imageWidth, int);
QMP_USE_SHARED(imageHeight, int);
QMP_USE_SHARED(scene, std::vector);
Color c = fireRay(scene, imageWidth, imageHeight, i);
Image[i] = c;
QMP_END_PARALLEL_FOR
Keep critical sections short to reduce thread idleness. If possible, avoid
critical sections altogether.
If possible, avoid sharing data among threads to improve cache
performance.
Remember that calling global functions within threads can involve shared
memory (e.g. rand() uses shared global state variables).
Parallel reads from the same memory location should cause little slowdown,
but if that memory changes (e.g. when one thread writes to it), all threads
using that memory must update their cached copies.
CPU architecture makes a big difference, especially when data is shared
among threads. Newer multicore chips are designed for better multithreaded
performance (shared memory caches, faster interconnections). (An interesting
example is a program with lots of shared data running on a machine with two
processor dies, each containing two cores. This program will probably scale
very well from 1 to 2 threads because of the fast communication between two
cores on the same die. However, it might not scale as well from 2 to 3 or 4
threads, depending on the communication speed between the two dies.) Note
that a shared cache can slow things down if two processors sharing the cache
are both accessing totally different memory locations, and each are
frequently accessing more memory than half of the cache size, causing
constant cache misses for the other processor.
Miscellaneous Tips
The quickmp.h header file is rather long and includes other system headers.
If desired, it can be split into a .h and .cpp file for faster compilation.
The ideal split point has been marked in the header file.
Almost all C rand() implementations are not thread-safe because their
internal state variables are not protected from being accessed by any thread
at any time. If you are concerned about repeatable results, consider using
rand_r() (a reentrant version of rand() available on some platforms) or some
other reentrant random number generator.