MARS - A Software Dedicated to Advanced Concrete Modeling

by **ES3** » Fri Oct 02, 2015 12:10 am

1. Introduction

MARS implements both OpenMP and MPI parallelisms. This article is aimed to clarify some basic concepts and provide a guidance to execute an OpenMP or MPI simulation.

In MARS, OpenMP is used to speed up for loop (as in C++); MPI is used to divide the model into sub-domains so multiple MPI processes (instances of MARS) can work on different sub-domains at the same time.

In this article, the MARS executable with OpenMP is called marsO, the one with both OpenMP and MPI is called marsM.

2. Terminology

In HPC environment, a node often refers to a computing unit, which is conceptually similar to a desktop workstation or a personal computer. A core is the very basic physical processing unit. For example, the Intel i5-6500 CPU has 4 cores. A node usually have multiple cores.

A processor is a physical hardware encapsulating the cores. A socket is place holder on the motherboard to connect the processor. Sometimes, a node can have multiple sockets (each socket has a processor), each processor has multiple cores. For example, in a HPC cluster we could see that a node has two sockets, each socket has a Intel Xeon E5-2667 v3 processor, each processor has 8 cores, which in total gives 16 cores on the HPC node.

A process is a running instance of a program, for example, Task Manager (on Windows) or Activity Monitor (on Mac) shows the processes currently running on the system.

A pure OpenMP execution of a program starts one process, this process spawns multiple threads. A MPI execution of a program will start multiple running instances (each instance is called a process) of the program.

3. Using OpenMP

In an OpenMP execution, the number of threads can be set using

Code: Select all: export OMP_NUM_THREADS=N

The threads are spawned by a single process, and the memory is shared by all threads in this process.

To start an OpenMP execution with 4 threads, one can use

Code: Select all: export OMP_NUM_THREADS=4 marsO input.mrs

4. Using MPI

In an MPI execution, the user can start multiple MPI processes, and each process can spawn a number of threads. This can lead to the so-called hybrid MPI execution.

In general, one should follow this rule:

Code: Select all: [number of processes requested by MPI] X [number of threads set in OpenMP] = [number of physical cores available]

For example, on a node with 8 physical cores, one can use

Code: Select all: export OMP_NUM_THREADS=4 mpirun -np 2 marsM -B input.mrs

to start a hybrid MPI execution with 2 MPI processes, 4 threads per process, or use

Code: Select all: export OMP_NUM_THREADS=1 mpirun -np 8 marsM -B input.mrs

to start pure MPI execution with 8 MPI processes, each MPI process running a single thread.

Another example on an HPC cluster. Let's say we request 4 nodes from the cluster, each node has 16 physical cores (assuming each node has as least 16 cores), so we have 64 cores available in total. To use all of these 64 cores, we can use

Code: Select all: export OMP_NUM_THREADS=4 mpirun -n 16 marsM -B input.mrs

to start a hybrid MPI execution with 16 MPI processes, 4 threads per process, or use

Code: Select all: export OMP_NUM_THREADS=1 mpirun -n 64 marsM -B input.mrs

to start pure MPI execution with 64 MPI processes, each MPI process running a single thread.

5. Performance and Scalability

There is no quick answer to how many processes or threads should we use in a MARS simulation, it totally depends on the problem. However, it is definitely not always true that the more processes/threads we use the better performance we achieve. At some point, the speedup brought by parallelism will be overtaken by the overhead introduced. The best practice to find the optimal speedup is to try different process/thread combinations and observe the wall time per step (wtime) printed out by MARS.

MARS - A Software Dedicated to Advanced Concrete Modeling

OpenMP, MPI, thread and process.

OpenMP, MPI, thread and process.

Who is online