Make sure at the execution of parallel code (with "mpirun" command) that number of processes equals to product of subdomains number along x by subdomains number along y ( "x_domains" and "y_domains" parameters in "param" input file ) :

We take in this case 256 processes, then running command is :

$ mpirun -np256 ./explicitPar
Time step too large in 'param' file - Taking convergence criterion
Time step = 2.374899779E-07
Convergence = 0.100000000 after 40595 steps
Problem size =1048576
Wall Clock = 487.038250208
Computed solution in outputPar.dat file

We use the FTCS (Forward-Time Central-Space) method which is part of Finite Difference Methdod and commonly used to solve numerically heat diffusion equation and more generally parabolic partial differential equations. This scheme is based on central difference in space and the forward Euler method in time. So discretization has 2 types : spatial and temporal. To use light notations, we write the 2D spatial discretized element $(i,j)$ at time $n$ under the form $\theta_{\,i\,,\,j}^{\,n}$.

In 2D case, we get the same relations for $y$ variable. By taking $h_{x}=size_{x}/N_{x}$ , $h_{y}=size_{y}/N_{y}$ and with convention $\theta(x_{i},y_{j})=\theta_{\,i\,,j}$, we have :

This formula is implemented in this way for the C language parallel version (cf explUtilPar.c file) ; You can notice that xs, xe, ys, ye (with "s" for "start" and "e" for "end") variables represent the coordinates of the subdomain corresponding to the current process of rank "me" :

// Perform an explicit update on the points within the domain for (i=xs[me];i<=xe[me];i++) for (j=ys[me];j<=ye[me];j++)
x[i][j] = weightx*(x0[i-1][j] + x0[i+1][j] + x0[i][j]*diagx)
+ weighty*(x0[i][j-1] + x0[i][j+1] + x0[i][j]*diagy)

Convergence criterion links the exact solution with the numerically computed solution. If we have $h_{x}=h_{y}=h$, explicit schema \eqref{eq9} is steady if :

We are going to breakdown domain (also named grid) into subdomains assigning for each one a process. Gain on runtime is achieved since computing iterative schema is done parallely and updates for the 4 sides of a given subdomain is performed with communications of surrounding processes : we called this breakdown a "Cartesian topology of processes" because subdomains are squared or rectangular. Convention taken here, for indices of 2D arrays with both Fortran90 and C languages, is (i,j)=(row,column). Below an example showing the breakdown with 8 processes along x_domains=2 and y_domains=4 :

Figure 1 : Ranks of processes with convention (row,column)$\,\equiv\,$(x_domains,y_domains)

To create this topology, we use the MPI_Cart_create function. Caution, convention imposed by MPI for this function is the opposite of our convention for 2D arrays, i.e (i,j)=(row,column)$\,\equiv\,$(y_domains,x_domains) turns into (row,column)$\,\equiv\,$(x_domains,y_domains) as it is illustrated in figure above. So one has to switch (Ox) and (Oy) axis in the following way :

With Fortran, elements of 2D array are memory aligned along columns : it is called "column major". In C language, elements are memory aligned along rows : it is qualified of "row major".

As we will see below into part 5.3, one has to exchange rows and columns between processes. C language naturally allows to handle data with row type and Fortran90 with column type. To get both functionalities for each language, we define, into function updateBound, "column" type (column_type) and "row" type (row_type) respectively for C language and Fortran90.

C language :

// Create column data type to communicate with East and West neighBors
MPI_Type_vector(xcell, 1, size_total_y, MPI_DOUBLE, &column_type);
MPI_Type_commit(&column_type);

Fortran90 :

! Create row data type to communicate with South and North neighBorscall MPI_Type_vector(ycell, 1, size_tot_x, MPI_DOUBLE_PRECISION, row_type, infompi)
call MPI_Type_commit(row_type, infompi)

From an optimization point of view, we have to make sure to iterate in loops on right indices : the most inner loop must be executed on the first index for Fortran90 and on the second one for C language.

Parallelization implies to use communications between the processes of each subdomain. As we can see in the code snippet below, for a given iteration process with rank "me", we compute the next values to the corresponding subdomain with "computeNext" function, then we send with "updateBound" rows and columns respectively to processes located in North-South and East-West of the current process "me".

// Main loop : until convergence while (!convergence) { // Increment step and time
step = step+1;
t = t+dt; // Perform one step of the explicit scheme
computeNext(x0, x, dt, hx, hy, &localDiff, me, xs, ys, xe, ye, k0); // Update the partial solution along the interface
updateBound(x0, neighBor, comm2d, column_type, me, xs, ys, xe, ye, ycell); // Sum reduction to get global difference
MPI_Allreduce(&localDiff, &result, 1, MPI_DOUBLE, MPI_SUM, comm); // Current global difference with convergence
result= sqrt(result); // Break if convergence reached or step greater than maxStep if ((result<epsilon) || (step>maxStep)) break;
}

The following figure shows an example of communication for a breaking down of global domain into 4 subdomains, which means with 4 processes :

Note that ghost cells are initialized at the beginning of code to the constant value of the edge of grid.

Below the content of "updateBound" function which defines the communications between processes :

C language :

// North/South communication// Send boundary to North and receive from South
MPI_Sendrecv(&x[xe[me]][ys[me]], ycell, MPI_DOUBLE, neighBor[N], flag,
&x[xs[me]-1][ys[me]], ycell, MPI_DOUBLE, neighBor[S], flag,
comm2d, &status);
// Send boundary to South and receive from North
MPI_Sendrecv(&x[xs[me]][ys[me]], ycell, MPI_DOUBLE, neighBor[S], flag,
&x[xe[me]+1][ys[me]], ycell, MPI_DOUBLE, neighBor[N], flag,
comm2d, &status);
// East/West communication// Send boundary to East and receive from West
MPI_Sendrecv(&x[xs[me]][ye[me]], 1, column_type, neighBor[E], flag,
&x[xs[me]][ys[me]-1], 1, column_type, neighBor[W], flag,
comm2d, &status);
// Send boundary to West and receive from East
MPI_Sendrecv(&x[xs[me]][ys[me]], 1, column_type, neighBor[W], flag,
&x[xs[me]][ye[me]+1], 1, column_type, neighBor[E], flag,
comm2d, &status);

Fortran90 :

! North/South communication! Send boundary to North and receive from Southcall MPI_Sendrecv(x(xe(me), ys(me)), 1, row_type, neighBor(N), flag, &
x(xs(me)-1, ys(me)), 1, row_type, neighBor(S), flag, &
comm2d, status, infompi)
! Send boundary to South and receive from Northcall MPI_Sendrecv(x(xs(me), ys(me)), 1, row_type, neighBor(S), flag, &
x(xe(me)+1, ys(me)), 1, row_type, neighBor(N), flag, &
comm2d, status, infompi)
! East/West communication ! Send boundary to East and receive from Westcall MPI_Sendrecv(x(xs(me), ye(me)), xcell, MPI_DOUBLE_PRECISION, neighBor(E), flag, &
x(xs(me), ys(me)-1), xcell, MPI_DOUBLE_PRECISION, neighBor(W), flag, &
comm2d, status, infompi)
! Send boundary to West and receive from Eastcall MPI_Sendrecv(x(xs(me), ys(me)), xcell, MPI_DOUBLE_PRECISION, neighBor(W), flag, &
x(xs(me), ye(me)+1), xcell, MPI_DOUBLE_PRECISION, neighBor(E), flag, &
comm2d, status, infompi)

To get an animated gif showing the temporal evolution of diffusion, we need output data files at different time intervals. We use this script generate_outputs_heat2d which will generate it. Then, we use the Matlab script generate_gif_heat2d.m to generate a gif file. Here is an animation made with a 256x96 mesh :

To evaluate the performance of the code, we do a benchmark by varying the number of processes for three different grid sizes (512^2, 1024^2, 2048^2). The script run_benchmark_heat2d allows to get execution time for each of these two parameters.

The following graph, produced with the Matlab script plot_benchmark_heat2d.m , synthesizes this benchmark :

Performance for one single process represents the execution of the sequential code. Except for N = 2048, the runtimes for a number of processes from 2 to 128 are lower than the sequential version. We can see that the best performance is done with 8 processes, regardless of the size of the mesh. This could be explained by the fact that this benchmark has been achieved with 8-core i7 processor. Indeed, Core i7 implements "Hyper-Threading" : this multitasking technique allows a physical optimization for 8 processes and a logical optimization for a higher number of processes. For a number of processes equal to 256, we find roughly the same time as with sequential code; this reflects the overhead caused by the high number of communications on the effective computation time in each subdomain of the grid.

ps : join like me the Cosmology@Home project whose aim is to refine the model that best describes our Universe