# Multinode Multi-GPU Two-Electron Integrals: Code Generation Using the Regent Language

K.G. Johnson, S. Mirchandaney, E. Hoag, A. Heirich, A. Aiken and T.J. Martinez*J. Chem. Theory Comput.*2022

The computation of two-electron repulsion integrals (ERIs) is often the most expensive step of integral-direct self-consistent field methods. Formally it scales as O(N4), where N is the number of Gaussian basis functions used to represent the molecular wave function. In practice, this scaling can be reduced to O(N2) or less by neglecting small integrals with screening methods. The contributions of the ERIs to the Fock matrix are of Coulomb (J) and exchange (K) type and require separate algorithms to compute matrix elements efficiently. We previously implemented highly efficient GPU-accelerated J-matrix and K-matrix algorithms in the electronic structure code TeraChem. Although these implementations supported the use of multiple GPUs on a node, they did not support the use of multiple nodes. This presents a key bottleneck to cutting-edge ab initio simulations of large systems, e.g., excited state dynamics of photoactive proteins. We present our implementation of multinode multi-GPU J- and K-matrix algorithms in TeraChem using the Regent programming language. Regent directly supports distributed computation in a task-based model and can generate code for a variety of architectures, including NVIDIA GPUs. We demonstrate multinode scaling up to 45 GPUs (3 nodes) and benchmark against hand-coded TeraChem integral code. We also outline our metaprogrammed Regent implementation, which enables flexible code generation for integrals of different angular momenta.

In situ Visualization with Task-based Parallelism

(In Proceedings of In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, Denver, CO, USA, November 12–17, 2017 (ISAV’17),

- This short paper describes an experimental prototype of in situ visualization in a task-based parallel programming framework. A set of reusable visualization tasks were composed with an existing simulation. The visualization tasks include a local OpenGL renderer, a parallel image compositor, and a display task. These tasks were added to an existing fluid-particle-radiation simulation and weak scaling tests were run on up to 512 nodes of the Piz Daint supercomputer. Benchmarks showed that the visualization components scaled and did not reduce the simulation throughput. The compositor latency increased logarithmically with increasing node count.

Multipass Shader Partitioning by Dynamic Programming (Graphics Hardware 2005)

- The multi pass shader partitioning problem was defined in a pair of papers by Chan et al [1,2]. The first successful solution to this problem was presented by Riffel et al [3]. The current paper presents another solution based on dynamic programming and argues that it is scalable [4]. After publication of the paper errors were found in the data presented to support this argument. Nonetheless the algorithm is semi-scalable as claimed. Dynamic Programming is commonly used for instruction selection and this problem is an instance of instruction selection. If scalability failed on this problem it would also fail on instruction selection.

*.*

*Efficient Partitioning of Fragment Shaders for Multipass Rendering on Programmable Graphics Hardware*. Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot, and Pat Hanrahan.

[2] Proceedings of Graphics Hardware (2004)

*.*

*Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware*

*.*Tim Foley, Mike Houston and Pat Hanrahan.

[3] Proceedings of Graphics Hardware (2004)

*.*

*Mio: Fast Multipass Partitioning via Priority-Based Instruction Scheduling*

*.*Andrew Riffel, Aaron E. Lefohn, Kiril Vidimce, Mark Leone, and John D. Owens.

[4] Proceedings of Graphics Hardware (2005).

*Optimal Automatic Multi-pass Shader Partitioning by Dynamic Programming.*

*A. Heirich.*

Competitive Analysis of Load Balancing Strategies for Parallel Ray Tracing, with James Arvo [5].

This paper gives a fundamental formula for predicting workload imbalance as a result of static load balancing strategies like tiling and randomization in parallel ray tracing. The results can equally be applied to any parallel algorithm for graphics or image processing. It predicts that these strategies will fail at large numbers of computers, and for NTSC resolution images this was true at 128-way parallelism and above. The solution to this failure is dynamic load balancing such as the diffusion strategy in [6,7].

[5] The Journal of Supercomputing. *A Competitive Analysis of Load Balancing Strategies for Parallel Ray Tracing**. *A. Heirich and J. Arvo, vol. 12, no. 1/2, pp. 57-68 (1998).
[6] The International Journal for Foundations of Computer Science (1997). *A Scalable Diffusion Algorithm for Dynamic Mapping and Load Balancing on Networks of Arbitrary Topology*. A. Heirich, vol. 8, no. 3, September 1997, pp. 329-346.
[7] In proceedings of the International Conference on Parallel Processing (1995). *A Parabolic Load Balancing Method**.* A. Heirich and S. Taylor, vol. III, pp. 192-202. Winner of "outstanding paper of the year". With thanks to Andrew Conley.
[8] Analysis of scalable algorithms for dynamic load balancing and mapping with application to photo-realistic rendering. (Dissertation)

Movie: Scalable Interactive Volume Rendering

Scalable Graphics Cluster Supercomputer Architecture

These papers describe work at HP/Compaq to build a commodity based scalable graphics architecture. The results include world record setting performance and scalability on volume rendering [10] and a commercial product the HP Scalable Visualization Array. A similar project was developed by Stoll et al [13] called Lightning-2. Although it was intended as a sort-last architecture Lightning-2 was not scalable and could not support volume rendering or applications that require ordered blending.

__[__9] Parallel Computing (2003). *Distributed Rendering of Interactive Soft Shadows**. *M. Isard, M. Shand and A. Heirich. Parallel Computing, vol. 29, no. 3, March 2003, pp. 311-323.
[10] IEEE Visualization 2002. Workshop on commodity-based visualization clusters (presentation October 27, 2002). *Alpha/Depth Acquisition Through DVI**. *A. Heirich, M. Shand, E. Oertli, G. Lupton and P. Ezolt.
[11] IEEE Parallel and Large-Data Visualization and Graphics Symposium (2001). *Scalable Interactive Volume Rendering Using Off-the-Shelf Components.** *S. Lombeyda, L. Moll, M. Shand, D. Breen and A. Heirich.
[12] IEEE Parallel Visualization and Graphics Symposium (1999). *Scalable Distributed Visualization Using Off-the-Shelf Components*. A. Heirich and L. Moll.
[13] IEEE Symposium on Field Programmable Custom Computing Machines (1999). *Sepia: Scalable 3D Compositing Using PCI Pamette**. *L. Moll, A. Heirich, and M. Shand.
[14] In Proceedings of ACM SIGGRAPH (2001). *Lightning-2: a high-performance display subsystem for PC clusters**. *G. Stoll, M. Eldridge, D. Patterson, A. Webb, S. Berman, R. Levy, C. Caywood, M. Taveira, S. Hunt and P. Hanrahan.
[15] *Scalability in large-data scientific visualization**.* A. Heirich. Pittsburgh Supercomputing Center (2002).

- with Maneesh Agrawala, Ravi Ramamoorthi and Laurent Moll.

[16] Proceedings of ACM SIGGRAPH (2000). *Efficient image-based methods for rendering soft shadows**. *M. Agrawala, R. Ramamoorthi, L. Moll and A. Heirich.

Neuronal Simulations

[17] In Connectionist Models: Proceedings of the 1990 Summer School (Eds. Touretzky, Elman, Sejnowski & Hinton, pp. 369-374). *Neuronal Signal Strength Is Enhanced By Rhythmic Firing*. A. Heirich & C. Koch.