DASHMM

The Dynamic Adaptive System for Hierarchical Multipole Methods is an easy-to-use, extensible library implementing multipole methods. DASHMM uses a dynamic adaptive runtime system (HPX-5) to improve the efficiency and scalability of multipole method calculations. DASHMM is extendable through user-defined methods and expansions allowing a domain scientist to implement the details of their particular problem, while allowing the library to handle the effective parallelization of the computation.

The current version of DASHMM (1.2.0) operates on both shared and distributed memory architectures. It includes implementations of three multipole methods: Barnes-Hut, classical Fast Multipole Method (FMM), a variant of FMM that uses exponential expansions (referred to as FMM97), and permits easy extension to other multipole or multipole-like methods. It provides built-in Laplace, Yukawa and Helmholtz kernels, covering a wide range of application use-cases. DASHMM also comes with three demonstration codes that exercise the basic interface, demonstrates how DASHMM can be used in a time-stepping scheme, and how a user might register their own expansion with the library.



Shown below is a strong scaling test on the Cori supercomputer. The library was tested for all three kernels using the FMM97 method (Laplace in blue, Yukawa in yellow, and Helmholtz in red) for two different source and target distributions (inside the volume of a cube with square marks, and on the surface of a sphere with circular marks). The left panel shows the absolute time for the evaluation and the right panel shows the speedup relative to one node.



For more details, please see https://www.crest.iu.edu/projects/dashmm/.

LULESH

Proxy applications representing kernels of scientific computation toolkits have been created and implemented in multiple programming models by the Department of Energy's co-design centers, in part, to directly compare performance characteristics and to explore and quantify their benefits. One of these proxy applications is LULESH (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics) [1], which has already played a key role in comparing and contrasting programmability and performance among key programming models. In Karlin et al.[2], a comparison of LULESH among OpenMP, MPI, MPI+OpenMP, CUDA, Chapel, Charm++, Liszt, and Loci was presented with a focus on programmer productivity, performance, and ease of optimizations. HPX-5 has also implemented LULESH help further comparisons between these programming models.

The LULESH algorithm simulates the Sedov blast wave problem in three dimensions. The problem solution is spherically- symmetric and the code solves the problem in a parallelepiped region, shown in Figure 1. In the figure, symmetric boundary conditions are imposed on the colored faces such that the normal components of the velocities are always zero; free boundary conditions are imposed on the remaining boundaries.



The LULESH algorithm is implemented as a hexahedral mesh-based code with two centerings. Element centering stores thermodynamic variables such as energy and pressure. Nodal centering stores kinematics values such as positions and velocities. The simulation is run via time integration using a Lagrange leapfrog algorithm. There are three main computational phases within each time step: advance node quantities, advance element quantities, and calculate time constraints. There are three communication patterns, each regular, static, and uniform: face adjacent, 26 neighbor, and 13 neighbor communications, illustrated below:



As a regular, uniform, and static mini-application, LULESH is well suited for a programming model like MPI, showing very few inefficiencies in computational phase diagrams such as the following, obtained using vampirtrace:



In this figure, white phases indicate waiting for communication while red phases indicate computation. LULESH computational phases are dominated by computation.

The HPX-5 version of LULESH comes in two flavors. The first, known as the parcels implementation, does not take advantage of parallel for loop opportunities and is more directly comparable to the MPI only implementation of LULESH. The other flavor, known as the ompparcels version, does take advantage of parallel for loops and is more comparable to the MPI-OpenMP implementation of LULESH. Both use command line arguments to control the behavior of the mini-application. The number of domains, controlled by the -n option, controls how many domains of local grids will be simulated. This number must be a perfect cube. The -x option indicates the number of points cubed in each domain. The ompparcels version is intended to be run where there is only one domain per locality, thereby necessitating that the local grid size be scaled appropriately by the number of cores per locality. More specific instructions can be found by invoking help on the application executable.

[1] Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH);. Available from: https://codesign.llnl.gov/lulesh.php

[2] Karlin I, Bhatele A, Keasler J, Chamberlain BL, Cohen J, DeVito Z, et al. Exploring Traditional and Emerging Parallel Programming Models using a Proxy Application. In: Proc. of the 27-th IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2013.

libPXGL

Following the philosophy of design of Standard Template Library(STL) and the trail of evolution of Parallel Boost graph library(PBGL), libPXGL is the beginning of an endeavor to develop a next-generation graph library based on HPX-5.

evolution.jpg

Our first effort is invested in implementing graph algorithms for solving the Single Source Shortest Path(SSSP) problem as SSSP is a good representative of a class of irregular graph problems. Given a graph and a source, the SSSP problem asks to find out the shortest possible distances from the source to all other vertices in the graph. For example, the graph in the figure(a) has five vertices s, t, x, y, and z. The distances between adjacent vertices are shown on the directed edges. Let us choose vertex s as the source. Figure(b) shows the shortest path from s to all other vertices.

example.jpg

Four graph algorithms are implemented in HPX. In particular, we have implemented chaotic, Delta-stepping, Distributed control and k-level asynchronous algorithms. In what follows, we give a brief description of how each algorithm works.

In the chaotic algorithm, the distances from the source are updated asynchronously without imposing any order. There is no barrier requirement for algorithm progress and the distances are guaranteed to reach the minimum value at the end of the algorithm.

In the Delta-stepping algorithm, the distance space is divided into several intervals. Each interval is treated as a bucket. The algorithm proceeds by processing all vertices in each bucket without imposing any order. Once processing is done for the current bucket, the algorithm can progress to the next bucket. Between each bucket processing, there is an explicit barrier on which all localities need to wait before proceeding to the next bucket. The following figure is a high level schematic of the delta-stepping algorithm.

 Delta-stepping algorithm
Distributed Control is an approach to achieve optimistic parallelism by removing any requirement for synchronization. In doing so, it also thrives for local ordering based on thread-local priority queues on each locality to minimize redundant work done. The following figure is a high level schematic of distributed control algorithm.

 Distributed control algorithm

The fourth algorithm is based on the k-level asynchronous paradigm. In this approach, the algorithm progresses k-level wise. Within each k-level, the algorithm can progress asynchronously. For example if k=2, then in the first step, the algorithm will process all the vertices within the reach of the source by at most 2 steps without imposing any order. The next step will process all the vertices reachable with atmost 4 steps and so on.

The algorithms for SSSP can be run with the 9th DIMACS challenge input graphs and problem specification files. The DIMACS distribution contains the generators and the makefiles necessary for generating all the DIMACS input files. Additionally, the algorithms can also be run with Graph 500 input graphs. There is an implementation of Graph 500 specification-based graph generator included in HPX-5 to generate the graphs at different scale as well as the problem instances.

Wavelet Adaptive Multiresolution Representation (Wavelet AMR)

While the wavelet transformation has historically figured prominently in compression algorithms, it has become a crucial tool in solving partial differential equations. Hierarchical adaptive wavelet is well suited for solving partial differential equations where localized structures develop intermittently in the computational domain. The advantage of using wavelets in this is that it requires far fewer collocation points than other comparable algorithms while the wavelet amplitude provide a direct measure of the local approximation error at each point, thereby providing a built-in refinement criterion. The HPX-5 implementation of the Wavelet-AMR spawn threads for each collocation point while managing the sparse grid of collocation points inside the global address space. There are three systems of equations implemented in Wavelet-AMR: The Einstein Equations, the relativistic magnetohydrodynamics equations, and the Euler equations. Its communication patterns are irregular, dynamic, and nonuniform. This application comes in three flavors: serial, cilk, and HPX-5.

Here is a movie solving a point blast wave using the Euler equations

Here is a movie solving a point blast wave in cylindrical coordinates using the relativistic magnetohydrodynamics (RMHD) equations

To appreciate the sparse grid of collocation points used for the RMHD point blast wave evolution, the collocation points at one timestep are shown below:

High Performance Conjugate Gradients (HPCG)

The HPX-5 HPCG is a comprehensive port of the MPI-OpenMP HPCG reference implementation. It closely follows the MPI-OpenMP implementation but using active messages and HPX-5 semantics. High Performance Conjugate Gradients (HPCG) is a benchmark project aimed to create a more relevant metric for ranking HPC systems. Alterations to sparse matrix-vector multiply include RDMA based put and get approaches have been added. Also added is a new demonstration of an HPX+MPI legacy support modality where HPX and MPI coexist. In this modality, some kernels use HPX whilst others use MPI in order to demonstrate a path forward for porting large legacy applications.

Its communication patterns are regular, static, and nonuniform. Includes:

  • Sparse matrix-vector multiplication
  • Sparse triangular solve
  • Global dot products
  • Multigrid preconditioned conjugate gradient

CoMD

The CoMD proxy application is a classical molecular dynamics proxy application with OpenCL, OpenMP, and MPI reference implementations for comparison. It was developed as part of the ExMatEx Codesign center. Information on CoMD can be found on the CoMD Proxy Application website: http://www.exmatex.org/comd.html. The beta release of CoMD-HPX is feature complete though not yet performance oriented.

Coming soon...

PICSAR

Particle-In-Cell Scalable Application Resource (PICSAR) is a fortran based Particle-In-Cell (PIC) code with an MPI+OpenMP reference implementation. The algorithm does not include Poisson solves thereby eliminating a key algorithmic global barrier that many other PIC codes have. PICSAR is developed at Berkeley National Laboratory.