MPI Numerical solving of the 3D Heat equation

DOURNAC.ORG
	Français English

Coding > MPI Parallelization for numerically solving the 3D Heat equation

1.Execution
2.Unsteady Heat equation 3D
3.Discretization

3.1 Spatial discretization
3.2 Temporal discretization

4.Convergence criterion
5.MPI Implementation

5.1 Cartesian topology of processes
5.2 Remarks on contiguity
5.3 Communications between processes

6.Results
7.Benchmark

1.Execution :

Sequential and parallelized codes are in the following archives :

C language version : Heat3D_MPI_Solving_C.tar.gz
Fortran 90 version : Heat3D_MPI_Solving_F90.tar.gz

Caution : for previous versions before 05/23/18, bug fixed on convergence criterion (Space step $h$ mus be square powered and not third)

Make sure at the execution of parallel code (with "mpirun" command) that number of processes equals to product of subdomains number along x by subdomains number along y and along z ( "x_domains", "y_domains" and "z_domains" parameters in "param" input file ) :

$ cat param
  size_x
16
  size_y
8
  size_z
32
  x_domains
8
  y_domains
4
  z_domains
2
  MaxStep
1000000
  dt
1.0e-1
  cnv_tol
1.0e-1

We take in this case 64 processes, then running command is :

$ mpirun -np 64 ./explicitPar

Time step too large in 'param' file - Taking convergence criterion

   Time step =   3.18033787909627519E-006

   Convergence = 0.100000000 after      3333 steps

   Problem size =         4096

   Wall Clock =        2.126691

   Computed solution in "outputPar.dat" file

2.Unsteady Heat equation 3D :

The general form of Heat equation is :

\begin{equation} \dfrac{\partial T}{\partial t}=\kappa\,\Delta T\,\,\,\,\,\,\text{with}\,\,\,\,\,\Delta=\sum_{i=1}^{n}\dfrac{\partial^{2}}{\partial x_{i}^{2}}\,\,\,\,\text{the Laplacian in $n$ dimension} \label{eq1} \end{equation}

$\kappa$ coefficient is the thermal conductivity. So, 3D Heat equation can be written :

\begin{equation} \dfrac{\partial\theta}{\partial t}=\kappa\,\bigg(\dfrac{\partial^{2}\theta}{\partial x^{2}}+\dfrac{\partial^{2}\theta}{\partial y^{2}}+\dfrac{\partial^{2}\theta}{\partial z^{2}}\bigg) \label{eq2} \end{equation}

3.Discretization :

Discretization for solving unsteady 3D Heat equation has 2 types : spatial and temporal.

3.1 Spatial discretization :

We start by discretize the right member, i.e in spatial domain. For this, we use the two following Taylor series :

\begin{equation} \theta_{m+1}=\theta(x_{m}+h)=\theta(x_{m})+h\,\theta'(x_{m})+\dfrac{h^{2}}{2}\,\theta''(x_{m})+\dfrac{h^{3}}{6}\,\theta^{'''}(x_{m})+O(h^{4}) \label{eq3} \end{equation} \begin{equation} \theta_{m-1}=\theta(x_{m}-h)=\theta(x_{m})-h\,\theta'(x_{m})+\dfrac{h^{2}}{2}\,\theta''(x_{m})-\dfrac{h^{3}}{6}\,\theta^{'''}(x_{m})+O(h^{4}) \label{eq4} \end{equation}

By adding the relations \eqref{eq3} and \eqref{eq4}, we get an expression of second derivate at $x_{m}$ point :

\begin{equation} \theta''(x_{m})=\dfrac{\theta_{m+1}-2\theta_{m}+\theta_{m-1}}{h^{2}}+O(h^{2}) \label{eq5} \end{equation}

We get the same relations for $y$ variable. By taking $h_{x}=size_{x}/N_{x}$, $h_{y}=size_{y}/N_{y}$, $h_{z}=size_{z}/N_{z}$ and with convention $\theta(x_{m},y_{m},z_{m})=\theta[i][j][k]$, we have :

\begin{align} \dfrac{\partial^{2}\theta}{\partial x^{2}}+\dfrac{\partial^{2}\theta}{\partial y^{2}}+\dfrac{\partial^{2}\theta}{\partial z^{2}} &= \dfrac{\theta[i+1][j][k]-2\theta[i][j][k]+\theta[i-1][j][k]}{h_{x}^{2}}\label{eq6}\\ & +\dfrac{\theta[i][j+1][k]-2\theta[i][j][k]+\theta[i][j-1][k]}{h_{y}^{2}}\notag \\ &+\dfrac{\theta[i][j][k+1]-2\theta[i][j][k]+\theta[i][j][k-1]}{h_{z}^{2}} \notag \end{align}

3.2 Temporal discretization :

Left member of equation \eqref{eq2} can be expressed :

\begin{equation} \dfrac{\partial \theta(x_{m},y_{m},z_{m},t_{n})}{\partial t}=\dfrac{\theta(x_{m},y_{m},z_{m},t_{n+1})-\theta(x_{m},y_{m},z_{m},t_{n})}{\Delta t} \label{eq7} \end{equation} \begin{equation} \Rightarrow \,\,\,\,\dfrac{\partial \theta_{n}(x_{m},y_{m},z_{m})}{\partial t}=\dfrac{\theta_{n+1}[i][j][k]-\theta_{n}[i][j][k]}{\Delta t} \label{eq8} \end{equation}

Thus, by combining equations \eqref{eq2}, \eqref{eq6} and \eqref{eq8}, recurrence formula equals to :

\begin{align} \theta_{n+1}[i][j][k] &= \theta_{n}[i][j][k]+\kappa \Delta t\bigg[\dfrac{\theta_{n}[i+1][j][k]-2\theta_{n}[i][j][k]+\theta_{n}[i-1][j][k]}{h_{x}^{2}}\label{eq9}\\ &+\dfrac{\theta_{n}[i][j+1][k]-2\theta_{n}[i][j][k]+\theta_{n}[i][j-1][k]}{h_{y}^{2}}\notag \\ &+\dfrac{\theta_{n}[i][j][k+1]-2\theta_{n}[i][j][k]+\theta_{n}[i][j][k-1]}{h_{z}^{2}} \notag \bigg] \end{align}

This formula is implemented in this way for the C language parallel version (cf explUtilPar.c file) ; You can notice that xs, xe, ys, ye, zs, ze (with "s" for "start" and "e" for "end") variables represent the coordinates of the subdomain corresponding to the current process of rank "me" :

   // Factors for explicit scheme
   diagx = -2.0 + hx*hx/(2*k0*dt);
   diagy = -2.0 + hy*hy/(2*k0*dt);
   diagz = -2.0 + hz*hz/(2*k0*dt);
   weightx = k0 * dt/(hx*hx);
   weighty = k0 * dt/(hy*hy);
   weightz = k0 * dt/(hz*hz);

   // Perform an explicit update on the points within the domain
   for (i=xs[me];i<=xe[me];i++)
      for (j=ys[me];j<=ye[me];j++)
         for (k=zs[me];k<=ze[me];k++)
            x[i][j][k] = weightx*(x0[i-1][j][k] + x0[i+1][j][k] + x0[i][j][k]*diagx)
                       + weighty*(x0[i][j-1][k] + x0[i][j+1][k] + x0[i][j][k]*diagy)
                       + weightz*(x0[i][j][k-1] + x0[i][j][k+1] + x0[i][j][k]*diagz);

4.Convergence criterion :

Convergence criterion links the exact solution with the numerically computed solution. If we have $h_{x}=h_{y}=h_{z}=h$, explicit schema \eqref{eq9} is steady if :

\begin{equation} \Delta t\leqslant \dfrac{h^{2}}{8\kappa} \label{eq10} \end{equation}

This stability condition gives an estimation for choosing the time step $\Delta t$ :

\begin{equation} \Delta t=\dfrac{1}{8\kappa}\,\text{min}(h_{x},h_{y},h_{z})^{2} \label{11} \end{equation}

5.MPI Implementation :

We discuss in this part about the way that numerical solving is structured and implemented using "Message Passing Interface" (MPI) API.

5.1 Cartesian topology of processes :

We are going to breakdown domain (also named grid) into subdomains assigning for each one a process. Gain on runtime is achieved since computing iterative schema is done parallely and updates for the 4 sides of a given subdomain is performed with communications of surrounding processes : we called this breakdown a "Cartesian topology of processes" because subdomains are squared or rectangular. Convention taken here, for indices of 2D arrays with both Fortran90 and C languages, is (i,j)=(row,column). Below an example showing the breakdown with 16 processes along x_domains=2, y_domains=4 and z_domains=2 :

Figure 1 : Ranks of processes with convention (row,column)$\,\equiv\,$(x_domains,y_domains) and z_domains dimension

5.2 Remarks on contiguity :

With Fortran, elements of 2D array are memory aligned along columns : it is called "column major". In C language, elements are memory aligned along rows : it is qualified of "row major".

5.3 Communications between processes :

Parallelization is done using communications between the processes of each subdomain. As we can see in the code snippet below, for a given iteration process with rank "me", we compute the next values to the corresponding subdomain with "computeNext" function, then we send with "updateBound" each face of cubes (see figure below), i.e matrixes, to processes located in North-South, East-West and Front-Back of the current process "me".

   // Until convergence
   while(!convergence)
   {
     step = step + 1;
     t = t + dt ;

     // Perform one step of the explicit scheme
     computeNext(x0, x, dt, &resLoc, hx, hy, hz, me, xs, ys, zs, xe, ye, ze, k0);

     // Update the partial solution along the interface
     updateBound(x0, neighBor, comm3d, column_type, me, xs, ys, zs, xe, ye, ze,
                 xcell, ycell, zcell);

     // Sum reduction to get error
     MPI_Allreduce(&resLoc, &result, 1, MPI_DOUBLE, MPI_SUM, comm);

     // Current error
     result= sqrt(result);

     // Break if convergence reached or step greater than maxStep
     if ((result<epsilon) || (step>maxStep)) break;
   }

The following figure shows an example of communication for a breaking down of global domain into 8 subdomains, which means with 8 processes. Actually, only one plane of 4 processes is visible, one has to imagine a second parallel plane to this one in order to represent all 8 subdomains :

**Figure 2 : Example of communications between 8 processes**

Like in 2D case, note that ghost cells are initialized at the beginning of code to the constant value of the edge of the grid.

6.Results :

To get an animated gif showing the temporal evolution of diffusion, we need output data files at different time intervals. We use this script generate_outputs_heat3d which will generate it. Then, we use the Matlab script generate_gif_heat3d.m to generate a gif file. Here is an animation of a slice plot made with a 32x64x16 mesh :

**Figure 3 : Animated Gif showing temperature evolution on domain**

7.Benchmark :

To evaluate the performance of the code, we do a benchmark by varying the number of processes for three different grid sizes (64^3, 128^3 and 256^3). The script run_performances_heat3d allows to get execution time for each of these two parameters.

The following graph, produced with the Matlab script plot_performances_heat3d.m , synthesizes this benchmark :

**Figure 4 : SpeedUp as a function of processes number and size of domain**

Performance for one single process represents the execution of the sequential code. For a number of processes greater than 1 (parallelized versions), the runtimes are lower than the sequential version. We can see that the best performance is done with 8 processes, regardless of the size of the mesh. This could be explained by the fact that this benchmark has been achieved with 8-core i7 processor. Indeed, Core i7 implements "Hyper-Threading" : this multitasking technique allows a physical optimization for 8 processes and a logical optimization for a higher number of processes.

ps : join like me the Cosmology@Home project whose aim is to refine the model that best describes our Universe