# A selection of papers on numerical linear algebra and parallel computing published in 2019

#### Andrea

This year has been particularly exciting for the fields of numerical linear algebra and parallel computing. I managed to add 209 entries to my bibliography, which means a new paper about every 1.7 days on average! To me, this is a big step up compared to last year’s 56 papers.

This year, I slightly changed my base Google Scholar query to

algorithm sparse method parallel performance matrix numerical optimization graph

and I supplemented the query’s results with interesting publications I came across.

## Some general trends

• Machine/deep learning is still (by far) the most popular application in the fields of numerical linear algebra and parallel computing.
• Several papers discuss BLAS-level operations on sparse matrices and show that the community is still looking for good ways to tailor data structures to applications. Parallel graph algorithms are often used in this context to optimise sparse representations.
• Machine learning is also being used to select the best data structure to use for a given application. I find this quite interesting, although I think there’s value in understanding why some data structures are more suited than others in specific cases.
• Mixed-precision algorithms are gaining traction quite quickly, especially on GPUs. Also, the new IEEE standard for floating-point operations was released this year. I’m looking forward to seeing the impact this will have on GPU architectures and the robustness of numerical software.

## List of publications, with abstracts

A. Abdelfattah, S. Tomov, and J. Dongarra, “Towards half-precision computation for complex matrices: A case study for mixed-precision solvers on GPUs,” Innovative Computing Laboratory, University of Tennessee, Tech. Rep., 2019.

Abstract: The use of low-precision computations is popular in accelerating machine learning and artiﬁcial intelligence (AI) applications. Hardware architectures, such as high-end graphics processing units (GPUs), now support native 16-bit ﬂoating-point arithmetic (i.e., half-precision). While half precision provides a natural 2×/4× speedup against the performance of single/double precisions, respectively, modern GPUs are equipped with hardware accelerators that further boost the FP16 performance. These accelerators, known as tensor cores (TCs), have a theoretical peak performance that is 8×/16× faster than FP32/FP64 performance, respectively. Such a high level of performance has encouraged researchers to harness the compute power of TCs outside AI applications. This paper presents a mixed-precision dense linear solver (Ax = b) for complex matrices using the GPU’s TC units. Unlike similar eﬀorts that have discussed accelerating Ax = b in real FP16 arithmetic, this paper focuses on complex FP16 precisions. The developed solution uses a “half-complex” precision to accelerate the solution of Ax = b while maintaining complex FP32 precision accuracy. The proposed solver requires the development of a high-performance mixed-precision matrix multiplication (CGEMM-FP16) that accepts half-complex inputs, and uses the TCs’ full-precision products and FP32 accumulations for the computation. We discuss two designs and their performance. Similar to the way fast GEMMs power the performance of LAPACK, the mixed-precision CGEMMFP16 can enable the development of mixed-precision LAPACK algorithms. We illustrate this by integrating both CGEMM-FP16s into the development of mixed-precision LU factorizations of complex matrices. Finally, an iterative reﬁnement solver is used to deliver complex FP32 accuracy using a preconditioned GMRES solver. Our experiments, conducted on V100 GPUs, show that the mixed-precision solver can be up to 2.5× faster than a full single-complex precision solver.

A. Abdullahi Hassan, V. Cardellini, P. D’Ambra, D. di Seraﬁno, and S. Filippone, “Eﬃcient algebraic multigrid preconditioners on clusters of GPUs,” Parallel Processing Letters, vol. 29, no. 01, p. 1 950 001, 2019. doi: 10.1142/S0129626419500014.

Abstract: Many scientiﬁc applications require the solution of large and sparse linear systems of equations using Krylov subspace methods; in this case, the choice of an eﬀective preconditioner may be crucial for the convergence of the Krylov solver. Algebraic MultiGrid (AMG) methods are widely used as preconditioners, because of their optimal computational cost and their algorithmic scalability. The wide availability of GPUs, now found in many of the fastest supercomputers, poses the problem of implementing eﬃciently these methods on high-throughput processors. In this work we focus on the application phase of AMG preconditioners, and in particular on the choice and implementation of smoothers and coarsest-level solvers capable of exploiting the computational power of clusters of GPUs. We consider block-Jacobi smoothers using sparse approximate inverses in the solve phase associated with the local blocks. The choice of approximate inverses instead of sparse matrix factorizations is driven by the large amount of parallelism exposed by the matrix-vector product as compared to the solution of large triangular systems on GPUs. The selected smoothers and solvers are implemented within the AMG preconditioning framework provided by the MLD2P4 library, using suitable sparse matrix data structures from the PSBLAS library. Their behaviour is illustrated in terms of execution speed and scalability, on a test case concerning groundwater modelling, provided by the Jülich Supercomputing Center within the Horizon 2020 Project EoCoE.

S. Acer, A. Yaşar, S. Rajamanickam, M. Wolf, and Ü. V. Çatalyürek, “Scalable triangle counting on distributed-memory systems,” in Proceedings of the IEEE High Performance Extreme Computing Conference, ser. HPEC ’19, Sep. 2019, pp. 1–5. doi: 10.1109/HPEC.2019.8916302.

Abstract: Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the “Static Graph Challenge”. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk to exploit the beneﬁts of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate a speed up of 2.7× on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated.

H. Adoni, W. Yves, T. Nahhal, M. Krichen, B. Aghezzaf, and A. Elbyed, “A survey of current challenges in partitioning and processing of graph-structured data in parallel and distributed systems,” Distributed and Parallel Databases, Nov. 2019. doi: 10.1007/s10619-019-07276-9.

Abstract: One of the concepts that attracts attention since entering of big data era is the graph-structured data. Suitable frameworks to handle such data would face several constraints, especially scalability, partitioning challenges, processing complexity and hardware conﬁgurations. Unfortunately, although several works deal with big data issues, there is a lack of literature review concerning the challenges related to query answering on large-scale graph data. In this survey paper, we review current problems related to the partitioning and processing of graph-structured data. We discuss existing graph processing systems and provide some insights to know how to choose the right system for parallel and distributed processing of large-scale graph data. Finally, we survey current open challenges in this ﬁeld.

A. Agarwal, J. Peng, and O. Milenkovic, “Online convex matrix factorization with representative regions,” in Proceddings of the 33rd Conference on Neural Information Processing Systems, ser. NeurIPS’ 19, Vancouver, CA, 2019.

Abstract: Matrix factorization (MF) is a versatile learning method that has found wide applications in various data-driven disciplines. Still, many MF algorithms do not adequately scale with the size of available datasets and/or lack interpretability. To improve the computational eﬃciency of the method, an online (streaming) MF algorithm was proposed in [1]. To enable data interpretability, a constrained version of MF, termed convex MF, was introduced in [2]. In the latter work, the basis vectors are required to lie in the convex hull of the data samples, thereby ensuring that every basis can be interpreted as a weighted combination of data samples. No current algorithmic solutions for online convex MF are known as it is challenging to ﬁnd adequate convex bases without having access to the complete dataset. We address both problems by proposing the ﬁrst online convex MF algorithm that maintains a collection of constant-size sets of representative data samples needed for interpreting each of the basis [2] and has the same almost sure convergence guarantees as the online learning algorithm of [1]. Our proof techniques combine random coordinate descent algorithms with specialized quasi-martingale convergence analysis. Experiments on synthetic and real world datasets show signiﬁcant computational savings of the proposed online convex MF method compared to classical convex MF. Since the proposed method maintains small representative sets of data samples needed for convex interpretations, it is related to a body of work in theoretical computer science, pertaining to generating point sets [3], and in computer vision, pertaining to archetypal analysis [4]. Nevertheless, it diﬀers from these lines of work both in terms of the objective and algorithmic implementations.

E. Agullo, L. Giraud, and L. Poirel, “Robust preconditioners via generalized eigenproblems for hybrid sparse linear solvers,” SIAM Journal on Matrix Analysis and Applications, 2019. [Online]. Available: https://hal.inria.fr/hal-02074474/document.

Abstract: The solution of large sparse linear systems is one of the most time consuming kernels in many numerical simulations. The domain decomposition community has developed many eﬃcient and robust methods in the last decades. While many of these solvers fall into the abstract Schwarz (aS) framework, their robustness has originally been demonstrated on a case-by-case basis. In this paper, we propose a bound for the condition number of all deﬂated aS methods provided that the coarse grid consists of the assembly of local components that contain the kernel of some local operators. We show that classical results from the literature on particular instances of aS methods can be retrieved from this bound. We then show that such a coarse grid correction can be explicitly obtained algebraically via generalized eigenproblems, leading to a condition number independent of the number of domains. This result can be readily applied to retrieve or improve the bounds previously obtained via generalized eigenproblems in the particular cases of Neumann-Neumann (NN), Additive Schwarz (AS) and optimized Robin but also generalizes them when applied with approximate local solvers. Interestingly, the proposed methodology turns out to be a comparison of the considered particular aS method with generalized versions of both NN and AS for tackling the lower and upper part of the spectrum, respectively. We furthermore show that the application of the considered grid corrections in an additive fashion is robust in the AS case although it is not robust for aS methods in general. In particular, the proposed framework allows for ensuring the robustness of the AS method applied on the Schur complement (AS/S), either with deﬂation or additively, and with the freedom of relying on an approximate local Schur complement. Numerical experiments illustrate these statements.

C. Alappat, G. Hager, O. Schenk, J. Thies, A. Basermann, A. R. Bishop, H. Fehske, and G. Wellein, “A recursive algebraic coloring technique for hardware-eﬃcient symmetric sparse matrix-vector multiplication,” 2019. arXiv: arXiv:1907.06487.

Abstract: The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today’s multicore platforms with up to 100 cores is diﬃcult due to the need to manage conﬂicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation, which eliminates the shortcomings of previous coloring methods in terms of hardware eﬃciency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially and behaves in accordance with the rooﬂine model. Outliers are discussed and analyzed in detail. While we focus on SymmSpMV in this paper, our algorithm and software is applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring.

J. I. Aliaga, E. Dufrechou, P. Ezzatti, and E. S. Quintana-Ortí, “Accelerating the task/data-parallel version of ILUPACK’s BiCG in multi-CPU/GPU conﬁgurations,” Parallel Computing, 2019, issn: 0167-8191. doi: 10.1016/j.parco.2019.02.005. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167819118301777.

Abstract: ILUPACK is a valuable tool for the solution of sparse linear systems via iterative Krylov subspace-based methods. Its relevance for the solution of real problems has motivated several eﬀorts to enhance its performance on parallel machines. In this work we focus on exploiting the task-level parallelism derived from the structure of the BiCG method, in addition to the data-level parallelism of the internal matrix computations, with the goal of boosting the performance of a GPU (graphics processing unit) implementation of this solver. First, we revisit the use of dual-GPU systems to execute independent stages of the BiCG concurrently on both accelerators, while leveraging the extra memory space to improve the data access patterns. In addition, we extend our ideas to compute the BiCG method eﬃciently in multicore platforms with a single GPU. In this line, we study the possibilities oﬀered by hybrid CPU-GPU computations, as well as a novel synchronization-free sparse triangular linear solver. The experimental results with the new solvers show important acceleration factors with respect to the previous data-parallel CPU and GPU versions.

M. Anastos, A. Lamaison, R. Steiner, and T. Szabó, “Majority colorings of sparse digraphs,” Nov. 5, 2019. arXiv: ArXIv:1911.01954 [math.CO].

Abstract: A majority coloring of a directed graph is a vertex-coloring in which every vertex has the same color as at most half of its out-neighbors. Kreutzer, Oum, Seymour, van der Zypen and Wood proved that every digraph has a majority 4-coloring and conjectured that every digraph admits a majority 3-coloring. We verify this conjecture for digraphs with chromatic number at most 6 or dichromatic number at most 3. We obtain analogous results for list coloring: We show that every digraph with list chromatic number at most 6 or list dichromatic number at most 3 is majority 3-choosable. We deduce that digraphs with maximum out-degree at most 4 or maximum degree at most 7 are majority 3-choosable. On the way to these results we investigate digraphs admitting a majority 2-coloring. We show that every digraph without odd directed cycles is majority 2-choosable. We answer an open question posed by Kreutzer et al. negatively, by showing that deciding whether a given digraph is majority 2-colorable is NP-complete. Finally we deal with a fractional relaxation of majority coloring proposed by Kreutzer et al. and show that every digraph has a fractional majority 3.9602-coloring. We show that every digraph with minimum out-degree Ω has a fractional majority (2 + 𝜀)-coloring.

E. Anzalone, M. Capra, R. Peloso, M. Martina, and G. Masera, “Low-power hardware accelerator for sparse matrix convolution in deep neural network,” 2019.

Abstract: Deep Neural Networks (DNN) have reached an outstanding accuracy in the past years, often going beyond human abilities. Nowadays, DNNs are widely used in many Artiﬁcial Intelligence (AI) applications such as computer vision, natural language processing and autonomous driving. However, these incredible performance come at a high computational cost, requiring complex hardware platforms. Therefore, the need for dedicated hardware accelerators able to drastically speed up the execution by preserving a low-power attitude arise. This paper presents innovative techniques able to tackle matrix sparsity in convolutional DNNs due to non-linear activation functions. Developed architectures allow to skip unnecessary operations, like zero multiplications, without sacriﬁcing accuracy or throughput and improving the energy eﬃciency. Such improvement could enhance the performance of embedded limited-budget battery applications, where cost-eﬀective hardware, accuracy and duration are critical to expanding the deployment of AI.

H. Anzt, Y.-C. Chen, T. Cojean, J. Dongarra, G. Flegar, P. Nayak, E. S. Quintana-Ortí, Y. M. Tsai, and W. Wang, “Towards continuous benchmarking: An automated performance evaluation framework for high performance software,” in Proceedings of the Platform for Advanced Scientiﬁc Computing Conference, ser. PASC ’19, Zurich, Switzerland: ACM, 2019, pp. 1–11, isbn: 978-1-4503-6770-7. doi: 10.1145/3324989.3325719.

Abstract: We present an automated performance evaluation framework that enables an automated workﬂow for testing and performance evaluation of software libraries. Integrating this component into an ecosystem enables sustainable software development, as a community eﬀort, via a web application for interactively evaluating the performance of individual software components. The performance evaluation tool is based exclusively on web technologies, which removes the burden of downloading performance data or installing additional software. We employ this framework for the Ginkgo software ecosystem, but the framework can be used with essentially any software project, including the comparison between diﬀerent software libraries. The Continuous Integration (CI) framework of Ginkgo is also extended to automatically run a benchmark suite on predetermined HPC systems, store the state of the machine and the environment along with the compiled binaries, and collect results in a publicly accessible performance data repository based on Git. The Ginkgo performance explorer (GPE) can be used to retrieve the performance data from the repository, and visualizes it in a web browser. GPE also implements an interface that allows users to write scripts, archived in a Git repository, to extract particular data, compute particular metrics, and visualize them in many diﬀerent formats (as speciﬁed by the script). The combination of these approaches creates a workﬂow which enables performance reproducibility and software sustainability of scientiﬁc software. In this paper, we present example scripts that extract and visualize performance data for Ginkgo’s SpMV kernels that allow users to identify the optimal kernel for speciﬁc problem characteristics.

H. Anzt, T. Cojean, and E. Kühn, “Towards a new peer review concept for scientiﬁc computing ensuring technical quality, software sustainability, and result reproducibility,” ser. PAMM ’19 1, vol. 19, 2019. doi: 10.1002/pamm.201900490. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/pamm.201900490.

Abstract: In this position paper we argue for implementing an alternative peer review process for scientiﬁc computing contributions that promotes high quality scientiﬁc software developments as fully-recognized conference submission. The idea is based on leveraging the code reviewers’ feedback on scientiﬁc software contributions to community software developments as a third-party review involvement. Providing open access to this technical review would complement the scientiﬁc review of the contribution, eﬃciently reduce the workload of the undisclosed reviewers, improve the algorithm implementation quality and software sustainability, and ensure full reproducibility of the reported results. Using this process creates incentives to publish scientiﬁc algorithms in open source software – instead of designing prototype algorithms with the unique purpose of publishing a paper. In addition, the comments and suggestions of the community being archived in the versioning control systems ensure that also community reviewers are receiving credit for the review contributions – unlike reviewers in the traditional peer review process. Finally, it reﬂects the particularity of the scientiﬁc computing community using conferences rather than journals as the main publication venue.

H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S. Quintana-Orti, “Adaptive precision in block-jacobi preconditioning for iterative sparse linear system solvers,” Concurrency and Computation: Practice and Experience, vol. 31, no. 6, e4460, 2019. doi: 10.1002/cpe.4460.

Abstract: We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block-Jacobi preconditioner in diﬀerent precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the eﬀects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth-bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem’s data in memory. Given this observation, we propose a model that quantiﬁes the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a ﬂoating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential beneﬁts of the adaptive block-Jacobi preconditioning scheme.

H. Anzt and G. Flegar, “Are we doing the right thing? a critical analysis of the academic HPC community,” in 2019 IEEE International Parallel and Distributed Processing Symposium Workshops, ser. IPDPSW’19, IEEE, May 2019. doi: 10.1109/ipdpsw.2019.00122.

Abstract: Like in any other research ﬁeld, academically surviving in the High Performance Computing (HPC) community generally requires to publish papers, in the bast case many of them and in high-ranked journals or at top-tier conferences. As a result, the number of scientiﬁc papers published each year in this relatively small community easily outnumbers what a single researcher can read. At the same time, many of the proposed and analyzed strategies, algorithms, and hardware-optimized implementations never make it beyond the prototype stage, as they are abandoned once they served the single purpose of yielding (another) publication. In a time and ﬁeld where high-quality manpower is a scarce resource, this is extremely ineﬃcient. In this position paper we promote a radical paradigm shift towards accepting high-quality software patches to community software packages as legitimate conference contributions. In consequence, the reputation and appointability of researchers is no longer based on the classical scientiﬁc metrics, but on the quality and documentation of open source software contributions – eﬀectively improving and accelerating the collaborative development of community software.

H. Anzt, G. Flegar, T. Grützmacher, and E. S. Quintana-Ortí, “Toward a modular precision ecosystem for high-performance computing,” The International Journal of High Performance Computing Applications, 2019. doi: 10.1177/1094342019846547.

Abstract: With the memory bandwidth of current computer architectures being signiﬁcantly slower than the (ﬂoating point) arithmetic performance, many scientiﬁc computations only leverage a fraction of the computational power in today’s high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between ﬂoating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientiﬁc applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a “modular precision ecosystem” that allows for more ﬂexibility in terms of customized data access. For memory-bounded scientiﬁc applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm – two applications that are popular in the communities and at the same characteristic representatives for the ﬁeld of numerical linear algebra and data analytics, respectively.

S. Apers and R. de Wolf, “Quantum speedup for graph sparsiﬁcation, cut approximation and laplacian solving,” Nov. 17, 2019. arXiv: arXiv:1911.07306 [quant-ph].

Abstract: Graph sparsiﬁcation underlies a large number of algorithms, ranging from approximation algorithms for cut problems to solvers for linear systems in the graph Laplacian. In its strongest form, ”spectral sparsiﬁcation” reduces the number of edges to near-linear in the number of nodes, while approximately preserving the cut and spectral structure of the graph. The breakthrough work by Benczúr and Karger (STOC’96) and Spielman and Teng (STOC’04) showed that sparsiﬁcation can be done optimally in time near-linear in the number of edges of the original graph.

In this work we show that quantum algorithms allow to speed up spectral sparsiﬁcation, and thereby many of the derived algorithms. Given adjacency-list access to a weighted graph with n nodes and m edges, our algorithm outputs an 𝜖-spectral sparsiﬁer in time (∕𝜖). We prove that this is tight up to polylog-factors. The algorithm builds on a string of existing results, most notably sparsiﬁcation algorithms by Spielman and Srivastava (STOC’08) and Koutis and Xu (TOPC’16), a spanner construction by Thorup and Zwick (STOC’01), a single-source shortest-paths quantum algorithm by Dürr et al. (ICALP’04) and an eﬃcient k-wise independent hash construction by Christiani, Pagh and Thorup (STOC’15). Combining our sparsiﬁcation algorithm with existing classical algorithms yields the ﬁrst quantum speedup, roughly from (m) to (), for approximating the max cut, min cut, min st-cut, sparsest cut and balanced separator of a graph. Combining our algorithm with a classical Laplacian solver, we demonstrate a similar speedup for Laplacian solving, for approximating eﬀective resistances, cover times and eigenvalues of the Laplacian, and for spectral clustering.

A. Argueta and D. Chiang, “Accelerating sparse matrix operations in neural networks on graphics processing units,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, IT, 2019, pp. 6215–6224.

Abstract: Graphics Processing Units (GPUs) are commonly used to train and evaluate neural networks eﬃciently. While previous work in deep learning has focused on accelerating operations on dense matrices/tensors on GPUs, eﬀorts have concentrated on operations involving sparse data structures. Operations using sparse structures are common in natural language models at the input and output layers, because these models operate on sequences over discrete alphabets. We present two new GPU algorithms: one at the input layer, for multiplying a matrix by a few-hot vector (generalizing the more common operation of multiplication by a one-hot vector) and one at the output layer, for a fused softmax and top-N selection (commonly used in beam search). Our methods achieve speedups over state-of-theart parallel GPU baselines of up to 7× and 50×, respectively. We also illustrate how our methods scale on diﬀerent GPU architectures.

Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth, “Lower bounds for non-convex stochastic optimization,” Dec. 5, 2019. arXiv:1912.02365 [math.OC].

Abstract: We lower bound the complexity of ﬁnding 𝜖-stationary points (with gradient norm at most 𝜖) using stochastic ﬁrst-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least 𝜖−4 queries to ﬁnd an 𝜖 stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of 𝜖−3 queries, establishing the optimality of recently proposed variance reduction techniques.

A. G. Artemov. (2019). Sparse approximate matrix multiplication in a fully recursive distributed task-parallel framework. arXiv:1906.0814.

Abstract: In this paper we consider parallel implementations of approximate multiplication of large matrices with exponential decay of elements. Such matrices arise in computations related to electronic structure calculations and some other ﬁelds of science. Commonly, sparsity is introduced by truncation of input matrices. In turn, the sparse approximate multiplication algorithm [M. Challacombe and N. Bock, arXiv preprint 1011.3534, 2010] performs truncation of sub-matrix products. We consider these two methods and their combination, i.e. truncation of both input matrices and sub-matrix products. Implementations done using the Chunks and Tasks programming model and library [E. H. Rubensson and E. Rudberg, Parallel Comput., 40:328343, 2014] are presented and discussed. The absolute error asymptotic behavior is derived. A comparison between the three methods in terms of performance is done on a model problem. The algorithms are also applied to matrices coming from large chemical systems with ≈106 atoms.

T. Augustine, J. Sarma, L.-N. Pouchet, and G. Rodríguez, “Generating piecewise-regular code from irregular structures,” in Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI 2019, Phoenix, AZ, USA: ACM, 2019, pp. 625–639, isbn: 978-1-4503-6712-7. doi: 10.1145/3314221.3314615.

Abstract: Irregular data structures, as exempliﬁed with sparse matrices, have proved to be essential in modern computing. Numerous sparse formats have been investigated to improve the overall performance of Sparse Matrix-Vector multiply (SpMV). But in this work we propose instead to take a fundamentally diﬀerent approach: to automatically build sets of regular sub-computations by mining for regular sub-regions in the irregular data structure. Our approach leads to code that is specialized to the sparsity structure of the input matrix, but which does not need anymore any indirection array, thereby improving SIMD vectorizability. We particularly focus on small sparse structures (below 10M nonzeros), and demonstrate substantial performance improvements and compaction capabilities compared to a classical CSR implementation and Intel MKL IE’s SpMV implementation, evaluating on 200+ diﬀerent matrices from the SuiteSparse repository.

A. Ayala, S. Tomov, X. Luo, H. Shaiek, A. Haidar, G. Bosilca, and J. Dongarra, “Impacts of multi-GPU MPI collective communications on large FFT computation,” in Proceedings of the Workshop on Exascale MPI (ExaMPI), ser. SC’19, Denver, CO, 2019.

Abstract: Most applications targeting exascale, such as those part of the Exascale Computing Project (ECP), are designed for heterogeneous architectures and rely on the Message Passing Interface (MPI) as their underlying parallel programming model. In this paper we analyze the limitations of collective MPI communication for the computation of fast Fourier transforms (FFTs), which are relied on heavily for large-scale particle simulations. We present experiments made at one of the largest heterogeneous platforms, the Summit supercomputer at ORNL. We discuss communication models from state-of-the-art FFT libraries, and propose a new FFT library, named HEFFTE (Highly Eﬃcient FFTs for Exascale), which supports heterogeneous architectures and yields considerable speedups compared with CPU libraries, while maintaining good weak as well as strong scalability.

T. Ayall, H. Duan, and C. Liu, “Edge property based stream order reduce the performance of stream edge graph partition,” Journal of Physics: Conference Series, vol. 1395, pp. 1–8, Nov. 2019. doi: 10.1088/1742-6596/1395/1/012010.

Abstract: The graph data exist everywhere in various disciplines such as in social network, biological network, chemical compound, and computer vision, etc. Currently, the size of a graph data have dramatically increased; all disciplines have extracted knowledge from a graph by partitioning and distributing a graph into diﬀerent clusters node using the distributed graph processing system or graph database, however the graph partition has reduced the performance of those system. Even if the stream edge graph partition has shown better partition quality than vertex graph partition for skew degree distribution of a graph and support big graph partition, the stream order has aﬀected the quality of stream edge graph partition. In this study,we propose two edge properties based stream order models such as TFB (Tree edges First then Backward edges follow), BFT (Backward edges First then Tree edges follow) and study the eﬀect of stream order on stream edge graph partition algorithms. The result shows that TFB and BFT models signiﬁcantly aﬀect the quality of stream edge partition except hashing and all algorithms show best quality of partition on Random order than other order such as TFB, BFT, BFS (Breadth First Search), and DFS (Depth First Search).

V. Balaji and B. Lucia, “Combining data duplication and graph reordering to accelerate parallel graph processing,” in Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, Phoenix, AX, USA, Jun. 2019.

Abstract: Performance of single-machine, shared memory graph processing is aﬀected by expensive atomic updates and poor cache locality. Data duplication, a popular approach to eliminate atomic updates by creating thread-local copies of shared data, incurs extreme memory overheads due to the large sizes of typical input graphs. Even memory-eﬃcient duplication strategies that exploit the power-law structure common to many graphs (by duplicating only the highly-connected ”hub” vertices) suﬀer from overheads for having to dynamically identify the hub vertices. Degree Sorting, a popular graph reordering technique that re-assigns hub vertices consecutive IDs in a bid to improve spatial locality, is eﬀective for single-threaded graph applications but suﬀers from increased false sharing in parallel executions.
The main insight of this work is that the combination of data duplication and Degree Sorting eliminates the overheads of each optimization. Degree Sorting improves the eﬃciency of data duplication by assigning hub vertices consecutive IDs which enables easy identiﬁcation of the hub vertices. Additionally, duplicating the hub vertex data eliminates false sharing in Degree Sorting since each thread updates its local copy of the hub vertex data. We evaluate this mutually-enabling combination of power-law-speciﬁc data duplication and Degree Sorting in a system called RADAR. RADAR improves performance by eliminating atomic updates for hub vertices and improving the cache locality of graph applications, providing speedups of up to 166x (1.88x on average) across diﬀerent graph applications and input graphs.

G. Ballard, J. Demmel, I. Dumitriu, and A. Rusciano, “A generalized randomized rank-revealing factorization,” 2019. arXiv:1909.06524 [math.NA].

Abstract: We introduce a Generalized Randomized QR-decomposition that may be applied to arbitrary products of matrices and their inverses, without needing to explicitly compute the products or inverses. This factorization is a critical part of a communication-optimal spectral divide-and-conquer algorithm for the nonsymmetric eigenvalue problem. In this paper, we establish that this randomized QR-factorization satisﬁes the strong rank-revealing properties. We also formally prove its stability, making it suitable in applications. Finally, we present numerical experiments which demonstrate that our theoretical bounds capture the empirical behavior of the factorization.

S. Barratt and S. Boyd. (2019). Least squares auto-tuning. arXiv:1904.05460.

Abstract: Least squares is by far the simplest and most commonly applied computational method in many ﬁelds. In almost all applications, the least squares objective is rarely the true objective. We account for this discrepancy by parametrizing the least squares problem and automatically adjusting these parameters using an optimization algorithm. We apply our method, which we call least squares auto-tuning, to data ﬁtting.

S. Behnezhad, S. Brandt, M. Derakhshan, M. Fischer, M. Hajiaghayi, R. M. Karp, and J. Uitto, “Massively parallel computation of matching and mis in sparse graphs,” in Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, ser. PODC ’19, Toronto ON, Canada: ACM, 2019, pp. 481–490, isbn: 978-1-4503-6217-7. doi: 10.1145/3293611.3331609.

Abstract: The Massively Parallel Computation (MPC) model serves as a common abstraction of many modern large-scale parallel computation frameworks and has recently gained a lot of importance, especially in the context of classic graph problems. In this work, we mainly consider maximal matching and maximal independent set problems in the MPC model. These problems are known to admit eﬃcient MPC algorithms if the space available per machine is near-linear in the number n of nodes. This is not only often signiﬁcantly more than what we can aﬀord, but also allows for easy if not trivial solutions for sparse graphs – which are common in real-world large-scale graphs. We are, therefore, interested in the low-memory MPC model, where the space per machine is restricted to be strongly sublinear, that is, nδ for any constant 0 < δ < 1.
We parametrize our algorithms by the arboricity λ of the input graph. Our key ingredient is a degree reduction technique that reduces these problems in graphs with arboricity λ to the corresponding problems in graphs with maximum degree poly(λ,logn) in O(log2logn) rounds, giving rise to O( ⋅ loglogλ + log2logn)-round algorithms.
Our result is particularly interesting for graphs with polylogn arboricity as for such graphs, we get O(log2logn)-round algorithms. This covers most natural families of sparse graphs and almost exponentially improves over previous algorithms that all required log O(1) n rounds in this regime of MPC.
Finally, our maximal matching algorithm can be employed to obtain a (1 + 𝜖)-approximate maximum cardinality matching, a (2 + 𝜖)-approximate maximum weighted matching, as well as a 2-approximate minimum vertex cover in essentially the same number of rounds.

J. Bento and S. Ioannidis, “A family of tractable graph metrics,” Applied Network Science, vol. 4, no. 1, p. 107, Nov. 15, 2019, issn: 2364-8228. doi: 10.1007/s41109-019-0219-z.

Abstract: Important data mining problems such as nearest-neighbor search and clustering admit theoretical guarantees when restricted to objects embedded in a metric space. Graphs are ubiquitous, and clustering and classiﬁcation over graphs arise in diverse areas, including, e.g., image processing and social networks. Unfortunately, popular distance scores used in these applications, that scale over large graphs, are not metrics and thus come with no guarantees. Classic graph distances such as, e.g., the chemical distance and the Chartrand-Kubiki-Shultz distance are arguably natural and intuitive, and are indeed also metrics, but they are intractable: as such, their computation does not scale to large graphs. We deﬁne a broad family of graph distances, that includes both the chemical and the Chartrand-Kubiki-Shultz distances, and prove that these are all metrics. Crucially, we show that our family includes metrics that are tractable. Moreover, we extend these distances by incorporating auxiliary node attributes, which is important in practice, while maintaining both the metric property and tractability.

M. Bernaschi, M. Carrozzo, A. Franceschini, and C. Janna, “A dynamic pattern factored sparse approximate inverse preconditioner on graphics processing units,” SIAM Journal on Scientiﬁc Computing, vol. 41, no. 3, pp. C139–C160, 2019. doi: 10.1137/18M1197461.

Abstract: One of the most time-consuming tasks in the procedures for the numerical study of PDEs is the solution to linear systems of equations. To that purpose, iterative solvers are viewed as a promising alternative to direct methods on high-performance computers since, in theory, they are almost perfectly parallelizable. Their main drawback is the need of ﬁnding a suitable preconditioner to accelerate convergence. The factorized sparse approximate inverse (FSAI), mainly in its adaptive form, has proven to be an eﬀective parallel preconditioner for several problems. In the present work, we report about two novel ideas to dynamically compute, on graphics processing units (GPUs), the FSAI sparsity pattern, which is the main task in its setup. The ﬁrst approach, borrowed from the CPU implementation, uses a global array as a nonzero indicator, whereas the second one relies on a merge-sort procedure of multiple arrays. We will show that the second approach requires signiﬁcantly less memory and overcomes issues related to the limited global memory available on GPUs. Numerical tests prove that the GPU implementation of FSAI allows for an average speed-up of 7.5 over a parallel CPU implementation. Moreover, we will show that the preconditioner computation is still feasible using single precision arithmetic with a further 20% reduction of the setup cost. Finally, the strong scalability of the overall approach in shown in a multi-GPU setting.

D. Bertsimas and B. Stellato, “Online mixed-integer optimization in milliseconds,” 2019. arXiv:1907.02206.

Abstract: We propose a method to solve online mixed-integer optimization (MIO) problems at very high speed using machine learning. By exploiting the repetitive nature of online optimization, we are able to greatly speedup the solution time. Our approach encodes the optimal solution into a small amount of information denoted as strategy using the Voice of Optimization framework proposed in [BS18]. In this way the core part of the optimization algorithm becomes a multiclass classiﬁcation problem which can be solved very quickly. In this work we extend that framework to real-time and high-speed applications focusing on parametric mixed-integer quadratic optimization (MIQO). We propose an extremely fast online optimization algorithm consisting of a feedforward neural network (NN) evaluation and a linear system solution where the matrix has already been factorized. Therefore, this online approach does not require any solver nor iterative algorithm. We show the speed of the proposed method both in terms of total computations required and measured execution time. We estimate the number of ﬂoating point operations (ﬂops) required to completely recover the optimal solution as a function of the problem dimensions. Compared to state-of-the-art MIO routines, the online running time of our method is very predictable and can be lower than a single matrix factorization time. We benchmark our method against the state-ofthe-art solver Gurobi obtaining from two to three orders of magnitude speedups on benchmarks with real-world data.

M. Besta, S. Weber, L. Gianinazzi, R. Gerstenberger, A. Ivanov, Y. Oltchik, and T. Hoeﬂer, “Slim graph: Practical lossy graph compression for approximate graph processing, storage, and analytics,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’19, Denver, Colorado: ACM, 2019, 35:1–35:25, isbn: 978-1-4503-6229-0. doi: 10.1145/3295500.3356182.

Abstract: We propose Slim Graph: the ﬁrst programming model and framework for practical lossy graph compression that facilitates high-performance approximate graph processing, storage, and analytics. Slim Graph enables the developer to express numerous compression schemes using small and programmable compression kernels that can access and modify local parts of input graphs. Such kernels are executed in parallel by the underlying engine, isolating developers from complexities of parallel programming. Our kernels implement novel graph compression schemes that preserve numerous graph properties, for example connected components, minimum spanning trees, or graph spectra. Finally, Slim Graph uses statistical divergences and other metrics to analyze the accuracy of lossy graph compression. We illustrate both theoretically and empirically that Slim Graph accelerates numerous graph algorithms, reduces storage used by graph datasets, and ensures high accuracy of results. Slim Graph may become the common ground for developing, executing, and analyzing emerging lossy graph compression schemes.

P. Blanchard, D. J. Higham, and N. J. Higham, “Accurate computation of the log-sum-exp and softmax functions,” Sep. 8, 2019. arXiv:1909.03469 [math.NA].

Abstract: Evaluating the log-sum-exp function or the softmax function is a key step in many modern data science algorithms, notably in inference and classiﬁcation. Because of the exponentials that these functions contain, the evaluation is prone to overﬂow and underﬂow, especially in low precision arithmetic. Software implementations commonly use alternative formulas that avoid overﬂow and reduce the chance of harmful underﬂow, employing a shift or another rewriting. Although mathematically equivalent, these variants behave diﬀerently in ﬂoating-point arithmetic. We give rounding error analyses of diﬀerent evaluation algorithms and interpret the error bounds using condition numbers for the functions. We conclude, based on the analysis and numerical experiments, that the shifted formulas are of similar accuracy to the unshifted ones and that the shifted softmax formula is typically more accurate than a division-free variant.

P. Blanchard, N. J. Higham, F. Lopez, T. Mary, and S. Pranesh, “Mixed precision block fused multiply-add: Error analysis and application to GPU tensor cores,” 2019. MIMS EPrint: MIMS EPrint:2019.18. [Online]. Available: http://eprints.maths.manchester.ac.uk/2727/1/paper.pdf.

Abstract: Block low-rank (BLR) matrices possess a blockwise low-rank property that can be exploited to reduce the complexity of numerical linear algebra algorithms. The impact of these lowrank approximations on the numerical stability of the algorithms in ﬂoating-point arithmetic has not previously been analyzed. We present rounding error analysis for the solution of a linear system by LU factorization of BLR matrices. Assuming that a stable pivoting scheme is used, we prove backward stability: the relative backward error is bounded by a modest constant times 𝜖, where the low-rank threshold 𝜖 is the parameter controlling the accuracy of the blockwise low-rank approximations. In addition to this key result, our analysis oﬀers three new insights into the numerical behavior of BLR algorithms. First, we compare the use of a global or local low-rank threshold and ﬁnd that a global one should be preferred. Second, we show that performing intermediate recompressions during the factorization can signiﬁcantly reduce its cost without compromising numerical stability. Third, we consider diﬀerent BLR factorization variants and determine the compress–factor–update (CFU) variant to be the best. Tests on a wide range of matrices from various real-life applications show that the predictions from the analysis are realized in practice.

T. Blaß and M. Philippsen, “Which graph representation to select for static graph-algorithms on a CUDA-capable GPU,” in Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, ser. GPGPU ’19, Providence, RI, USA: ACM, 2019, pp. 22–31, isbn: 978-1-4503-6255-9. doi: 10.1145/3300053.3319416.

Abstract: GPUs seem to be ideal for algorithms that work in parallel. A number of ways to represent graphs in GPU memory are known. But so far there are no guidelines to select the representation that is likely to result in the best performance.
This a comprehensive study investigates for CUDA-capable GPUs how diﬀerent graph representations inﬂuence the performance of highly optimized graph processing algorithms that traverse the graphs without modifying them. We evaluate three diﬀerent graph exchange formats and how eﬃciently they can be imported into eight graph data structures. We use ten state-of-the-art benchmarks that employ diﬀerent traversals pattern. We evaluate them on 19 input graphs with diﬀerent characteristics. The measurements show that there is not a single best data structure; the runtime performance can vary up to a factor of 2 between two representations.
The main contribution is a set of rules that helps in picking the best-performing graph representation for a given situation.

M. Bollhöfer, O. Schenk, R. Janalík, S. Hamm, and K. Gullapalli, “State-of-the-art sparse direct solvers,” in. 2019. arXiv:1907.05309.

Abstract: In this chapter we will give an insight into modern sparse elimination methods. These are driven by a preprocessing phase based on combinatorial algorithms which improve diagonal dominance, reduce ﬁll-in and improve concurrency to allow for parallel treatment. Moreover, these methods detect dense submatrices which can be handled by dense matrix kernels based on multi-threaded level-3 BLAS. We will demonstrate for problems arising from circuit simulation how the improvement in recent years have advanced direct solution methods signiﬁcantly.

A. Boulmier, F. Raynaud, N. Abdennadher, and B. Chopard, “On the beneﬁts of anticipating load imbalance for performance optimization of parallel applications,” Sep. 16, 2019. arXiv:1909.07168 [cs.DC].

B. Brock, Y. Chen, J. Yan, J. Owens, A. Buluç, and K. Yelick, “RDMA vs. RPC for implementing distributed data structures,” Oct. 4, 2019. arXiv:1910.02158 [cs.DC].

Abstract: Distributed data structures are key to implementing scalable applications for scientiﬁc simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributed data structure, e.g., accessing a hash table bucket or distributed queue, rather than global operations in which all processors collectively exchange information. We look at the trade-oﬀs between the two styles through microbenchmarks and a performance model that approximates the cost of each. The RDMA operations have direct hardware support in the network and therefore lower latency and overhead, while the RPC operations are more expressive but higher cost and can suﬀer from lack of attentiveness from the remote side. We also run experiments to compare the real-world performance of RDMA- and RPC-based data structure operations with the predicted performance to evaluate the accuracy of our model, and show that while the model does not always precisely predict running time, it allows us to choose the best implementation in the examples shown. We believe this analysis will assist developers in designing data structures that will perform well on current network architectures, as well as network architects in providing better support for this class of distributed data structures.

P. Burkhardt, “Optimal algebraic breadth-ﬁrst search for sparse graphs,” 2019. arXiv:1906.03113v1.

Abstract: There has been a rise in the popularity of algebraic methods for graph algorithms given the development of the GraphBLAS library and other sparse matrix methods. These are useful in practice because many graph algorithms are amenable to sparse matrix multiplication. An exemplar for these approaches is Breadth-First Search (BFS). The algebraic BFS algorithm is simply a recursion of matrix-vector multiplications with the n×n adjacency matrix. Despite many redundant operations over nonzeros that ultimately lead to suboptimal performance, the algebraic BFS is appealing for practical implementations because it is simple and embarrassingly parallel. By using highly tuned matrix libraries it can be faster in practice than the theoretically optimal combinatorial algorithm. Therefore an optimal algebraic BFS should be of keen interest especially if it is easily integrated with existing matrix methods.
Current methods, notably in the GraphBLAS, use a Sparse Matrix Sparse Vector (SpMSpV) multiplication in which the input vector is kept in a sparse representation in each step of the BFS. But simply applying SpMSpV in BFS does not lead to optimal runtime. Each nonzero in the vector must be masked in subsequent steps. This has been an area of recent recent in GraphBLAS and other libraries. While in theory these masking methods are asymptotically optimal on sparse graphs, many add work that leads to suboptimal runtime. We give a new optimal, algebraic BFS for sparse graphs that is also a constant factor faster than theoretically optimal SpMSpV methods. We show how to eliminate redundant operations so an element in the adjacency matrix is operated upon no more than once, thus taking O(m) operations for a graph with O(m) edges.
Our method multiplies progressively smaller submatrices of the adjacency matrix at each step. The matrix remains unchanged, rather we are masking the rows and columns in the matrix that corresponds to previously visited vertices. The input vector in each step is also eﬀectively masked so it is a sparse vector. Thus our method multiplies a sparse submatrix by a sparse vector in decreasing size each step. Our sequential algebraic BFS algorithm takes O(m) algebraic operations on a sparse graph as opposed to O(mn) operations of other sparse matrix approaches. Our analysis closes a gap in the literature.

A. Buttari, D. Orban, D. Ruiz, and D. Titley-Peloquin, “A tridiagonalization method for symmetric saddle-point systems,” SIAM Journal on Scientiﬁc Computing, vol. 41, no. 5, S409–S432, 2019. doi: 10.1137/18M1194900.

Abstract: We propose an iterative method for the solution of symmetric saddle-point systems that exploits the orthogonal tridiagonalization method of Saunders, Simon, and Yip (1988). By contrast with methods based on the Golub and Kahan (1965) bidiagonalization process, our method takes advantage of two initial vectors and splits the system into the sum of a least-squares and a least-norm problem. Our method typically requires fewer operator-vector products than MINRES, yet performs a comparable amount of work per iteration and has comparable storage requirements.

A. Buttari, S. Hauberg, and C. Kodsi, “Parallel QR factorization of block-tridiagonal matrices,” Institut de recherche en informatique de Toulouse (IRIT), Tech. Rep., 2019.

Abstract: In this work, we deal with the QR factorization of block-tridiagonal matrices, where the blocks are dense and rectangular. This work is motivated by a novel method for computing geodesics over Riemannian manifolds. If blocks are reduced sequentially along the diagonal, only limited parallelism is available. We propose a matrix permutation approach based on the Nested Dissection method which improves parallelism at the cost of additional computations and storage. We provide a detailed analysis of the approach showing that this extra cost is bounded. Finally, we present an implementation for shared memory systems relying on task parallelism and the use of a runtime system. Experimental results support the conclusions of our analysis and show that the proposed approach leads to good performance and scalability.

J. Camacho, A. K. Smilde, E. Saccenti, and J. A. Westerhuis, “All sparse pca models are wrong, but some are useful. part i: Computation of scores, residuals and explained variance,” 2019. arXiv:1907.03989.

Abstract: Sparse Principal Component Analysis (sPCA) is a popular matrix factorization approach based on Principal Component Analysis (PCA) that combines variance maximization and sparsity with the ultimate goal of improving data interpretation. When moving from PCA to sPCA, there are a number of implications that the practitioner needs to be aware of. A relevant one is that scores and loadings in sPCA may not be orthogonal. For this reason, the traditional way of computing scores, residuals and variance explained that is used in the classical PCA cannot directly be applied to sPCA models. This also aﬀects how sPCA components should be visualized. In this paper we illustrate this problem both theoretically and numerically using simulations for several state-ofthe-art sPCA algorithms, and provide proper computation of the diﬀerent elements mentioned. We show that sPCA approaches present disparate and limited performance when modeling noisefree, sparse data. In a follow-up paper, we discuss the theoretical properties that lead to this problem.

E. C. Carson, “An adaptive s-step conjugate gradient algorithm with dynamic basis updating,” Aug. 12, 2019. arXiv: 1908.04081v1 [math.NA].

Abstract: The adaptive s-step CG algorithm is a solver for sparse, symmetric positive deﬁnite linear systems designed to reduce the synchronization cost per iteration while still achieving a user-speciﬁed accuracy requirement. In this work, we improve the adaptive s-step conjugate gradient algorithm by use of iteratively updated estimates of the largest and smallest Ritz values, which give approximations of the largest and smallest eigenvalues of A, using a technique due to Meurant and Tichý [G. Meurant and P. Tichý, Numer. Algs. (2018), pp. 1–32]. The Ritz value estimates are used to dynamically update parameters for constructing Newton or Chebyshev polynomials so that the conditioning of the s-step bases can be continuously improved throughout the iterations. These estimates are also used to automatically set a variable related to the ratio of the sizes of the error and residual, which was previously treated as an input parameter. We show through numerical experiments that in many cases the new algorithm improves upon the previous adaptive s-step approach both in terms of numerical behavior and reduction in number of synchronizations.

Y.-J. Chang, M. Fischer, M. Ghaﬀari, J. Uitto, and Y. Zheng, “The complexity of Δ + 1 coloring in congested clique, massively parallel computation, and centralized local computation,” in Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, ser. PODC ’19, Toronto ON, Canada: ACM, 2019, pp. 471–480, isbn: 978-1-4503-6217-7. doi: 10.1145/3293611.3331607.

Abstract: In this paper, we present new randomized algorithms that improve the complexity of the classic (Δ+1)-coloring problem, and its generalization (Δ+1)-list-coloring, in three well-studied models of distributed, parallel, and centralized computation: Distributed Congested Clique: We present an O(1)-round randomized algorithm for (Δ+1)-list-coloring in the congested clique model of distributed computing. This settles the asymptotic complexity of this problem. It moreover improves upon the O(log⋆Δ)-round randomized algorithms of Parter and Su [DISC’18] and O((loglogΔ) ⋅ log⋆Δ)-round randomized algorithm of Parter [ICALP’18].
Massively Parallel Computation: We present a randomized (Δ + 1)-list-coloring algorithm with round complexity O() in the Massively Parallel Computation (MPC) model with strongly sublinear memory per machine. This algorithm uses a memory of O(nα) per machine, for any desirable constant α > 0, and a total memory of Õ(m), where m is the number of edges in the graph. Notably, this is the ﬁrst coloring algorithm with sublogarithmic round complexity, in the sublinear memory regime of MPC. For the quasilinear memory regime of MPC, an O(1)-round algorithm was given very recently by Assadi et al. [SODA’19].
Centralized Local Computation: We show that (Δ + 1)-list-coloring can be solved by a randomized algorithm with query complexity ΔO(1)…O(logn), in the centralized local computation model. The previous state of the art for (Δ + 1)-list-coloring in the centralized local computation model are based on simulation of known LOCAL algorithms. The deterministic O( + log⋆n)-round LOCAL algorithm of Fraigniaud et al. [FOCS’16] can be implemented in the centralized local computation model with query complexity ΔO()…O(log⋆n); the randomized O(log⋆Δ) + 2O()-round LOCAL algorithm of Chang et al. [STOC’18] can be implemented in the centralized local computation model with query complexity ΔO(log ⋆ Δ)…O(logn).

A. Charara, D. Keyes, and H. Ltaief, “Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs,” ACM Transactions on Mathematcal Software, vol. 45, no. 2, 15:1–15:28, May 2019, issn: 0098-3500. doi: 10.1145/3267101.

Abstract: Batched dense linear algebra kernels are becoming ubiquitous in scientiﬁc applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.

A. Chehade and Z. Shi, “The sparse reverse of principal component analysis for fast low-rank matrix completion,” Oct. 4, 2019. arXiv:1910.02155 [cs.LG].

Abstract: Matrix completion constantly receives tremendous attention from many research ﬁelds. It is commonly applied for recommender systems such as movie ratings, computer vision such as image reconstruction or completion, multi-task learning such as collaboratively modeling time-series trends of multiple sensors, and many other applications. Matrix completion techniques are usually computationally exhaustive and/or fail to capture the heterogeneity in the data. For example, images usually contain a heterogeneous set of objects, and thus it is a challenging task to reconstruct images with high levels of missing data. In this paper, we propose the sparse reverse of principal component analysis for matrix completion. The proposed approach maintains smoothness across the matrix, produces accurate estimates of the missing data, converges iteratively, and it is computationally tractable with a controllable upper bound on the number of iterations until convergence. The accuracy of the proposed technique is validated on natural images, movie ratings, and multisensor data. It is also compared with common benchmark methods used for matrix completion.

W. Cheng and Y.-H. Dai, “An active set newton-cg method for ℓ1 optimization,” Applied and Computational Harmonic Analysis, 2019, issn: 1063-5203. doi: 10.1016/j.acha.2019.08.005. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1063520318300459.

Abstract: In this paper, we investigate the active set identiﬁcation technique of ISTA and provide some good properties. An active set Newton-CG method is then proposed for ℓ~1~ optimization. Under appropriate conditions, we show that the proposed method is globally convergent with some nonmonotone line search. The numerical comparisons with several state-of-art methods demonstrate the eﬃciency of the proposed method.

B. Choi, A. Christlieb, and Y. Wang, “Multiscale high-dimensional sparse fourier algorithms for noisy data,” 2019. arXiv:1907.03692.

Abstract: We develop an eﬃcient and robust high-dimensional sparse Fourier algorithm for noisy samples. Earlier in the paper Multi-dimensional sublinear sparse Fourier algorithm (2016) [3], an eﬃcient sparse Fourier algorithm with O(dslogs) average-case runtime and O(ds) sampling complexity under certain assumptions was developed for signals that are s-sparse and bandlimited in the d-dimensional Fourier domain, i.e. there are at most s energetic frequencies and they are in [−N∕2,N∕2)d ∩ ℤd. However, in practice the measurements of signals often contain noise, and in some cases may only be nearly sparse in the sense that they are well approximated by the best s Fourier modes. In this paper, we propose a multiscale sparse Fourier algorithm for noisy samples that proves to be both robust against noise and eﬃcient.

D. Choi, J.-G. Jang, and U. Kang, “S3cmtf: Fast, accurate, and scalable method for incomplete coupled matrix-tensor factorization,” PloS one, vol. 14, no. 6, 2019, issn: 1932-6203. doi: 10.1371/journal.pone.0217316.

Abstract: How can we extract hidden relations from a tensor and a matrix data simultaneously in a fast, accurate, and scalable way? Coupled matrix-tensor factorization (CMTF) is an important tool for this purpose. Designing an accurate and eﬃcient CMTF method has become more crucial as the size and dimension of real-world data are growing explosively. However, existing methods for CMTF suﬀer from lack of accuracy, slow running time, and limited scalability. In this paper, we propose S3CMTF, a fast, accurate, and scalable CMTF method. In contrast to previous methods which do not handle large sparse tensors and are not parallelizable, S3CMTF provides parallel sparse CMTF by carefully deriving gradient update rules. S3CMTF asynchronously updates partial gradients without expensive locking. We show that our method is guaranteed to converge to a quality solution theoretically and empirically. S3CMTF further boosts the performance by carefully storing intermediate computation and reusing them. We theoretically and empirically show that S3CMTF is the fastest, outperforming existing methods. Experimental results show that S3CMTF is up to 930× faster than existing methods while providing the best accuracy. S3CMTF shows linear scalability on the number of data entries and the number of cores. In addition, we apply S3CMTF to Yelp rating tensor data coupled with 3 additional matrices to discover interesting patterns.

D. Cifuentes and A. Moitra, “Polynomial time guarantees for the Burer-Monteiro method,” Dec. 3, 2019. arXiv:1912.01745 [math.OC].

Abstract: The Burer-Monteiro method is one of the most widely used techniques for solving large-scale semideﬁnite programs (SDP). The basic idea is to solve a nonconvex program in Y , where Y is an n×p matrix such that X = Y Y T. In this paper, we show that this method can solve SDPs in polynomial time in an smoothed analysis setting. More precisely, we consider an SDP whose domain satisﬁes some compactness and smoothness assumptions, and slightly perturb the cost matrix and the constraints. We show that if p ≳, where m is the number of constraints and η > 0 is any ﬁxed constant, then the Burer-Monteiro method can solve SDPs to any desired accuracy in polynomial time, in the setting of smooth analysis. Our bound on p approaches the celebrated Barvinok-Pataki bound in the limit as η goes to zero, beneath which it is known that the nonconvex program can be suboptimal. Previous analyses were unable to give polynomial time guarantees for the Burer-Monteiro method, since they either assumed that the criticality conditions are satisﬁed exactly, or ignored the nontrivial problem of computing an approximately feasible solution. We address the ﬁrst problem through a novel connection with tubular neighborhoods of algebraic varieties. For the feasibility problem we consider a least squares formulation, and provide the ﬁrst guarantees that do not rely on the restricted isometry property.

W. Cook, “Computing in combinatorial optimization,” in Lecture Notes in Computer Science, W. G. Steﬀen B., Ed. Springer, Cham., 2019, vol. 10000, isbn: 978-3-319-91908-9. doi: 10.1007/978-3-319-91908-9˙3.

Abstract: Research in combinatorial optimization successfully combines diverse ideas drawn from computer science, mathematics, and operations research. We give a tour of this work, focusing on the early development of the subject and the central role played by linear programming. The paper concludes with a short wish list of future research directions.

S. Cools, J. Cornelis, P. Ghysels, and W. Vanroose, “Improving strong scaling of the conjugate gradient method for solving large linear systems using global reduction pipelining,” in Proceedings of the 2019 EuroMPI conference, ser. EuroMPI’19, 2019. arXiv:1905.06850.

Abstract: This paper presents performance results comparing MPI-based implementations of the popular Conjugate Gradient (CG) method and several of its communication hiding (or “pipelined”) variants. Pipelined CG methods are designed to eﬃciently solve SPD linear systems on massively parallel distributed memory hardware, and typically display signiﬁcantly improved strong scaling compared to classic CG. This increase in parallel performance is achieved by overlapping the global reduction phase (MPI_Iallreduce) required to compute the inner products in each iteration by (chieﬂy local) computational work such as the matrix-vector product as well as other global communication. This work includes a brief introduction to the deep pipelined CG method for readers that may be unfamiliar with the speciﬁcs of the method. A brief overview of implementation details provides the practical tools required for implementation of the algorithm. Subsequently, easily reproducible strong scaling results on the US Department of Energy (DoE) NERSC machine “Cori” (Phase I – Haswell nodes) on up to 1024 nodes with 16 MPI ranks per node are presented using an implementation of p(l)-CG that is available in the open source PETSc library. Observations on the staggering and overlap of the asynchronous, non-blocking global communication phases with communication and computational kernels are drawn from the experiments.

M. Croci, M. B. Giles, and P. E. Farrell, “Multilevel quasi monte carlo methods for elliptic PDEs with random ﬁeld coeﬃcients via fast white noise sampling,” 2019. arXiv:1911.12099 [math.NA].

Abstract: When solving partial diﬀerential equations with random ﬁelds as coeﬃcients the eﬃcient sampling of random ﬁeld realisations can be challenging. In this paper we focus on the fast sampling of Gaussian ﬁelds using quasi-random points in a ﬁnite element and multilevel quasi Monte Carlo (MLQMC) setting. Our method uses the SPDE approach combined with a new fast (ML)QMC algorithm for white noise sampling. We express white noise as a wavelet series expansion that we divide in two parts. The ﬁrst part is sampled using quasi-random points and contains a ﬁnite number of terms in order of decaying importance to ensure good QMC convergence. The second part is a correction term which is sampled using standard pseudo-random numbers. We show how the sampling of both terms can be performed in linear time and memory complexity in the number of mesh cells via a supermesh construction, yielding an overall linear cost. Furthermore, our technique can be used to enforce the MLQMC coupling even in the case of non-nested mesh hierarchies. We demonstrate the eﬃcacy of our method with numerical experiments.

İ. Çuğu and M. Manguoğlu, “A parallel multithreaded sparse triangular linear system solver,” Computers & Mathematics with Applications, 2019, issn: 0898-1221. doi: 10.1016/j.camwa.2019.09.012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0898122119304602.

Abstract: We propose a parallel sparse triangular linear system solver based on the Spike algorithm. Sparse triangular systems are required to be solved in many applications. Often, they are a bottleneck due to their inherently sequential nature. Furthermore, typically many successive systems with the same coeﬃcient matrix and with diﬀerent right hand side vectors are required to be solved. The proposed solver decouples the problem at the cost of extra arithmetic operations as in the banded case. Compared to the banded case, there are extra savings due to the sparsity of the triangular coeﬃcient matrix. We show the parallel performance of the proposed solver against the state-of-the-art parallel sparse triangular solver in Intel’s Math Kernel Library (MKL) on a multicore architecture. We also show the eﬀect of various sparse matrix reordering schemes. Numerical results show that the proposed solver outperforms MKL’s solver in ∼80% of cases by a factor of 2.47, on average.

S. Das, J. Demmel, K. Fountoulakis, L. Grigori, and M. W. Mahoney, “Parallel and communication avoiding least angle regression,” 2019. arXiv:1905.11340.

Abstract: We are interested in parallelizing the Least Angle Regression (LARS) algorithm for ﬁtting linear regression models to high-dimensional data. We consider two parallel and communication avoiding versions of the basic LARS algorithm. The two algorithms apply to data that have diﬀerent layout patterns (one is appropriate for row-partitioned data, and the other is appropriate for column-partitioned data), and they have diﬀerent asymptotic costs and practical performance. The ﬁrst is bLARS, a block version of LARS algorithm, where we update b columns at each iteration. Assuming that the data are row-partitioned, bLARS reduces the number of arithmetic operations, latency, and bandwidth by a factor of b. The second is Tournament-bLARS (T-bLARS), a tournament version of LARS, in which case processors compete, by running several LARS computations in parallel, to choose b new columns to be added into the solution. Assuming that the data are column-partitioned, T-bLARS reduces latency by a factor of b. Similarly to LARS, our proposed methods generate a sequence of linear models. We present extensive numerical experiments that illustrate speed-ups up to 25× compared to LARS.

T. A. Davis, M. Aznaveh, and S. Kolodziej, “Write quick, run fast: Sparse deep neural network in 20 minutes of development time via suitesparse:graphblas,” in Proceedings of the 23rd IEEE Conference on High Performance Extreme Computing, ser. HPEC’19, Waltham, MA, USA, 2019.

Abstract: SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written in GraphBLAS achieve high performance with minimal development time. Using GraphBLAS, it took a mere 20 minutes to write a ﬁrst-cut computational kernel that solves the Sparse Deep Neural Network Graph Challenge. Understanding the problem description and ﬁle format, writing code to read in the ﬁles that deﬁne the problem, and comparing our results with the reference solution took a full day. The kernel consists of a single for-loop around 4 lines of code, all of which are calls to GraphBLAS, and it worked perfectly the ﬁrst time it was compiled. The sequential performance of the GraphBLAS solution is 3× to 5× faster than the MATLAB reference implementation. OpenMP parallelism gives an additional 10× to 15× speedup on a 20-core Intel processor, 17× on an IBM Power8 system, and 20× on a Power9 system, for the largest problems. Since SuiteSparse:GraphBLAS does not yet employ MPI, this was added at the application level, a development eﬀort that took one week, primarily because of diﬃculties in resolving a load-balancing issue in the MPI-based parallel algorithm.

D. Davydov and M. Kronbichler, “Algorithms and data structures for matrix-free ﬁnite element operators with mpi-parallel sparse multi-vectors,” Jul. 2019. arXiv:1907.01005 [cs.MS].

Abstract: Traditional solution approaches for problems in quantum mechanics scale as O(M3), where M is the number of electrons. Various methods have been proposed to address this issue and obtain linear scaling O(M). One promising formulation is the direct minimization of energy. Such methods take advantage of physical localization of the solution, namely that the solution can be sought in terms of non-orthogonal orbitals with local support. In this work a numerically eﬃcient implementation of sparse parallel vectors within the open-source ﬁnite element library deal.II is proposed. The main algorithmic ingredient is the matrix-free evaluation of the Hamiltonian operator by cell-wise quadrature. Based on an a-priori chosen support for each vector we develop algorithms and data structures to perform (i) matrix-free sparse matrix multivector products (SpMM), (ii) the projection of an operator onto a sparse sub-space (inner products), and (iii) post-multiplication of a sparse multivector with a square matrix. The node-level performance is analyzed using a rooﬂine model. Our matrix-free implementation of ﬁnite element operators with sparse multivectors achieves the performance of 157 GFlop/s on Intel Cascade Lake architecture. Strong and weak scaling results are reported for a typical benchmark problem using quadratic and quartic ﬁnite element bases.

T. J. Deakin, S. N. McIntosh-Smith, J. Price, A. Poenaru, P. R. Atkinson, C. Popa, and J. Salmon, “Performance portability across diverse computer architectures,” English, in 2019 IEEE International Workshop on Performance, Portability and Productivity in HPC, ser. P3HPC ’19, United States: Institute of Electrical and Electronics Engineers (IEEE), Sep. 25, 2019.

Abstract: Previous studies into performance portability have typically analysed a single application (and its various implementations) in isolation. In this study we explore the wider landscape of performance portability by considering a number of applications from across the space of dwarfs, written in multiple parallel programming models, and across a diverse set of architectures. We apply rigorous performance portability metrics, as deﬁned by Pennycook et al [1]. We believe this is the broadest and most rigorous performance portability study to date, representing a far reaching exploration of the state of performance portability that is achievable today. We will present a summary of the performance portability of each application and programming model across our diverge range of twelve computer architectures, including six diﬀerent server CPUs from ﬁve diﬀerent vendors, ﬁve diﬀerent GPUs from two diﬀerent vendors, and one vector architecture. We will conclude with an analysis of the performance portability of key programming models in general, across diﬀerent application spaces as well across diﬀering architectures, allowing us to comment on more general performance portability principles.

G. V. Demirci and C. Aykanat, “Scaling sparse matrix-matrix multiplication in the accumulo database,” Distributed and Parallel Databases, Jan. 28, 2019, issn: 1573-7578. doi: 10.1007/s10619-019-07257-y.

Abstract: We propose and implement a sparse matrix-matrix multiplication (SpGEMM) algorithm running on top of Accumulo’s iterator framework which enables high performance distributed parallelism. The proposed algorithm provides write-locality while ingesting the output matrix back to database via utilizing row-by-row parallel SpGEMM. The proposed solution also alleviates scanning of input matrices multiple times by making use of Accumulo’s batch scanning capability which is used for accessing multiple ranges of key-value pairs in parallel. Even though the use of batch-scanning introduces some latency overheads, these overheads are alleviated by the proposed solution and by using node-level parallelism structures. We also propose a matrix partitioning scheme which reduces the total communication volume and provides a balance of workload among servers. The results of extensive experiments performed on both real-world and synthetic sparse matrices show that the proposed algorithm scales signiﬁcantly better than the outer-product parallel SpGEMM algorithm available in the Graphulo library. By applying the proposed matrix partitioning, the performance of the proposed algorithm is further improved considerably.

J. Demmel, L. Grigori, and A. Rusciano, “An improved analysis and uniﬁed perspective on deterministic and randomized low rank matrix approximations,” Oct. 1, 2019. arXiv:1910.00223 [math.NA].

Abstract: We introduce a Generalized LU-Factorization (GLU) for low-rank matrix approximation. We relate this to past approaches and extensively analyze its approximation properties. The established deterministic guarantees are combined with sketching ensembles satisfying Johnson-Lindenstrauss properties to present complete bounds. Particularly good performance is shown for the sub-sampled randomized Hadamard transform (SRHT) ensemble. Moreover, the factorization is shown to unify and generalize many past algorithms. It also helps to explain the eﬀect of sketching on the growth factor during Gaussian Elimination.

L. Dhulipala, C. McGuﬀey, H. Kang, Y. Gu, G. E. Blelloch, P. B. Gibbons, and J. Shun, “Semi-asymmetric parallel graph algorithms for NVRAMs,” Oct. 27, 2019. arXiv:1910.12310 [cs.DC].

Abstract: Emerging non-volatile main memory (NVRAM) technologies provide novel features for large-scale graph analytics, combining byte-addressability, low idle power, and improved memory-density. Systems are likely to have an order of magnitude more NVRAM than traditional memory (DRAM), allowing large graph problems to be solved eﬃciently at a modest cost on a single machine. However, a signiﬁcant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be signiﬁcantly more expensive than NVRAM reads. In this paper, we propose an approach to parallel graph analytics in which the graph is stored as a read-only data structure (in NVRAM), and the amount of mutable memory is kept proportional to the number of vertices. Similar to the popular semi-external and semi-streaming models for graph analytics, the approach assumes that the vertices of the graph ﬁt in a fast read-write memory (DRAM), but the edges do not. In NVRAM systems, our approach eliminates writes to the NVRAM, among other beneﬁts. We present a model, the Parallel Semi-Asymmetric Model (PSAM), to analyze algorithms in the setting, and run experiments on a 48-core NVRAM system to validate the eﬀectiveness of these algorithms. To this end, we study over a dozen graph problems. We develop parallel algorithms for each that are eﬃcient, often work-optimal, in the model. Experimentally, we run all of the algorithms on the largest publicly-available graph and show that our PSAM algorithms outperform the fastest prior algorithms designed for DRAM or NVRAM. We also show that our algorithms running on NVRAM nearly match the fastest prior algorithms running solely in DRAM, by eﬀectively hiding the costs of repeatedly accessing NVRAM versus DRAM.

D. Ernst, G. Hager, J. Thies, and G. Wellein, “Performance engineering for a tall & skinny matrix multiplication kernel on GPUs,” 2019. arXiv:1905.03136.

Abstract: General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall and skinny matrices, which are much taller than wide. Nvidia’s current CUBLAS implementation delivers only a fraction of the potential performance (as given by the rooﬂine model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate diﬀerent approaches of parallelization and thread distribution, and devise a ﬂexible, conﬁgurable mapping scheme. A code generation approach enables a simultaneously ﬂexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU.

X. Dong, L. Liu, P. Zhao, G. Li, J. Li, X. Wang, and X. Feng, “Acorns: A framework for accelerating deep neural networks with input sparsity,” in Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’19, Sep. 2019, pp. 178–191. doi: 10.1109/PACT.2019.00022.

Abstract: Deep neural networks have been employed in a broad range of applications, including face detection, natural language processing, and autonomous driving. Yet, the neural networks with the capability to tackle real-world problems are intrinsically expensive in computation, hindering the usage of these models. Sparsity in the input data of neural networks provides an optimizing opportunity. However, harnessing the potential performance improvement on modern CPU faces challenges raised by sparse computations of the neural network, such as cache-unfriendly memory accesses and eﬃcient sparse kernel implementation. In this paper, we propose Acorns, a framework to accelerate deep neural networks with input sparsity. In Acorns, sparse input data is organized into our designed sparse data layout, which allows memory-friendly access for kernels in neural networks and opens the door for many performance-critical optimizations. Upon that, Acorns generates eﬃcient sparse kernels for operators in neural networks from kernel templates, which combine directions that express speciﬁc optimizing transformations to be performed, and straightforward code that describes the computation. Comprehensive evaluations demonstrate Acorns can outperform state-of-the-art baselines by signiﬁcant speedups. On the real-world detection task in autonomous driving, Acorns demonstrates 1.8-22.6× performance improvement over baselines. Speciﬁcally, the generated programs achieve 1.8-2.4× speedups over Intel MKL-DNN, 3.0-8.8× speedups over TensorFlow, and 11.1-13.2× speedups over Intel MKL-Sparse.

C. Donnat and S. Holmes, “Convex hierarchical clustering for graph-structured data,” Nov. 8, 2019. arXiv:1911.03417 [stat.AP].

Abstract: Convex clustering is a recent stable alternative to hierarchical clustering. It formulates the recovery of progressively coalescing clusters as a regularized convex problem. While convex clustering was originally designed for handling Euclidean distances between data points, in a growing number of applications, the data is directly characterized by a similarity matrix or weighted graph. In this paper, we extend the robust hierarchical clustering approach to these broader classes of similarities. Having deﬁned an appropriate convex objective, the crux of this adaptation lies in our ability to provide: (a) an eﬃcient recovery of the regularization path and (b) an empirical demonstration of the use of our method. We address the ﬁrst challenge through a proximal dual algorithm, for which we characterize both the theoretical eﬃciency as well as the empirical performance on a set of experiments. Finally, we highlight the potential of our method by showing its application to several real-life datasets, thus providing a natural extension to the current scope of applications of convex clustering.

E. Dufrechou, P. Ezzatti, and E. S. Quintana-Orti, “Automatic selection of sparse triangular linear system solvers on GPUs through machine learning techniques,” in Proceeding of the 31st International Symposium on Computer Architecture and High Performance Computing, ser. SBAC-PAD ’19, Oct. 2019, pp. 41–47. doi: 10.1109/SBAC-PAD.2019.00020.

Abstract: The solution of sparse triangular linear systems is often the most time-consuming stage of preconditioned iterative methods to solve general sparse linear systems, where it has to be applied several times for the same sparse matrix. For this reason, its computational performance has a strong impact on a wide range of scientiﬁc and engineering applications, which has motivated the study of its eﬃcient execution on massively parallel platforms. In this sense, several methods have been proposed to tackle this operation on graphics processing units (GPUs), which can be classiﬁed under either the level-set or the self-scheduling paradigms. The results obtained from the experimental evaluation of the diﬀerent methods suggest that both paradigms perform well for certain problems but poorly for others. Additionally, the relation between the properties of the linear systems and the performance of the diﬀerent solvers is not evident a-priori. In this context, techniques that allow to predict inexpensively which is be the best solver for a particular linear system can lead to important runtime reductions. Our approach leverages machine learning techniques to select the best sparse triangular solver for a given linear system, with focus on the case where a small number of triangular systems has to be solved for the same matrix. We study the performance of several methods using diﬀerent features derived from the sparse matrices, obtaining models with more than 80% of accuracy and acceptable prediction speed. These results are an important advance towards the automatic selection of the best GPU solver for a given sparse triangular linear system, and the characterization of the performance of these kernels.

A. Dumitrasc, P. Leleux, and U. Rüde, “Block partitioning of sparse rectangular matrices,” ser. PAMM ’19 1, vol. 19, 2019. doi: 10.1002/pamm.201900287.

Abstract: Abstract We present a means of reordering large, sparse rectangular matrices such that their nonzeros are closer to the diagonal. This enables a block-partitioning which is useful in parallel contexts. We use the Reverse Cuthill-McKee (RCM) algorithm on the adjacency matrix of the associated bipartite graph. The resulting, reordered matrix has a block bidiagonal structure.

A. Elafrou, G. Goumas, and N. Koziris, “Conﬂict-free symmetric sparse matrix-vector multiplication on multicore architectures,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’19, Denver, Colorado: ACM, 2019, 48:1–48:15, isbn: 978-1-4503-6229-0. doi: 10.1145/3295500.3356148.

Abstract: Exploiting the numeric symmetry in sparse matrices to reduce their memory footprint is very tempting for optimizing the memory-bound Sparse Matrix-Vector Multiplication (SpMV) kernel. Despite being very beneﬁcial for serial computation, storing the upper or lower triangular part of the matrix introduces race conditions in the updates to the output vector in a parallel execution. Previous work has suggested using local, per-thread vectors to circumvent this problem, introducing a work-ineﬃcient reduction step that limits the scalability of SpMV. In this paper, we address this issue with Conﬂict-Free Symmetric (CFS) SpMV, an optimization strategy that organizes the parallel computation into phases of conﬂict-free execution. We identify such phases through graph coloring and propose heuristics to improve the coloring quality for SpMV in terms of load balancing and locality to the input and output vectors. We evaluate our approach on two multicore shared-memory systems and demonstrate improved performance over the state-of-the-art.

A. Elgohary, M. Boehm, P. J. Haas, F. R. Reiss, and B. Reinwald, “Compressed linear algebra for declarative large-scale machine learning,” Communications of the ACM, vol. 62, no. 5, pp. 83–91, Apr. 2019, issn: 0001-0782. doi: 10.1145/3318221.

Abstract: Large-scale Machine Learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications. Hence, it is crucial for performance to ﬁt the data into single-node or distributed main memory to enable fast matrix-vector operations. General-purpose compression struggles to achieve both good compression ratios and fast decompression for block-wise uncompressed operations. Therefore, we introduce Compressed Linear Algebra (CLA) for lossless matrix compression. CLA encodes matrices with lightweight, value-based compression techniques and executes linear algebra operations directly on the compressed representations. We contribute eﬀective column compression schemes, cache-conscious operations, and an eﬃcient sampling-based compression algorithm. Our experiments show good compression ratios and operations performance close to the uncompressed case, which enables ﬁtting larger datasets into available memory. We thereby obtain signiﬁcant end-to-end performance improvements.

E. Elsen, M. Dukhan, T. Gale, and K. Simonyan, “Fast sparse convnets,” Nov. 21, 2019. arXiv:1911.09723 [cs.CV].

Abstract: Historically, the pursuit of eﬃcient inference has been one of the driving forces behind research into new deep learning architectures and building blocks. Some recent examples include: the squeeze-and-excitation module, depthwise separable convolutions in Xception, and the inverted bottleneck in MobileNet v2. Notably, in all of these cases, the resulting building blocks enabled not only higher eﬃciency, but also higher accuracy, and found wide adoption in the ﬁeld. In this work, we further expand the arsenal of eﬃcient building blocks for neural network architectures; but instead of combining standard primitives (such as convolution), we advocate for the replacement of these dense primitives with their sparse counterparts. While the idea of using sparsity to decrease the parameter count is not new, the conventional wisdom is that this reduction in theoretical FLOPs does not translate into real-world eﬃciency gains. We aim to correct this misconception by introducing a family of eﬃcient sparse kernels for ARM and WebAssembly, which we open-source for the beneﬁt of the community as part of the XNNPACK library. Equipped with our eﬃcient implementation of sparse primitives, we show that sparse versions of MobileNet v1, MobileNet v2 and EﬃcientNet architectures substantially outperform strong dense baselines on the eﬃciency-accuracy curve. On Snapdragon 835 our sparse networks outperform their dense equivalents by 1.3 − 2.4× – equivalent to approximately one entire generation of MobileNet-family improvement. We hope that our ﬁndings will facilitate wider adoption of sparsity as a tool for creating eﬃcient and accurate deep learning architectures.

S. Enayati and O. Y. Özaltın, “Optimal inﬂuenza vaccine distribution with equity,” European Journal of Operational Research, Nov. 2019. doi: 10.1016/j.ejor.2019.11.025.

Abstract: This paper is concerned with the optimal inﬂuenza vaccine distribution in a heterogeneous population consisting of multiple subgroups. We employ a compartmental model for inﬂuenza transmission and formulate a mathematical program to minimize the number of vaccine doses distributed to eﬀectively extinguish an emerging outbreak in its early stages. We propose an equity constraint to help public health authorities consider fairness when making vaccine distribution decisions. We develop an exact solution approach that generates a vaccine distribution policy with a solution quality guarantee. We perform sensitivity analyses on key epidemic parameters in order to illustrate the application of the proposed model. We then analyze the scalability of the solution approach for a population consisting of subgroups based on geographic location and age. We ﬁnally demonstrate the proposed model’s ability to consider vaccine coverage inequity and discuss a derivative-free optimization approach, as an alternative solution method which can consider various diﬀerent objective functions and constraints. Our results indicate that consideration of group-speciﬁc transmission dynamics is paramount to the optimal distribution of inﬂuenza vaccines.

A. Falco, “Bridging the gap between h-matrices and sparse direct methods for the solution of large linear systems,” PhD thesis, Université de Bordeaux, 2019.

Abstract: Many physical phenomena may be studied through modeling and numerical simulations, commonplace in scientiﬁc applications. To be tractable on a computer, appropriated discretization techniques must be considered, which often lead to a set of linear equations whose features depend on the discretization techniques. Among them, the Finite Element Method usually leads to sparse linear systems whereas the Boundary Element Method leads to dense linear systems. The size of the resulting linear systems depends on the domain where the studied physical phenomenon develops and tends to become larger and larger as the performance of the computer facilities increases. For the sake of numerical robustness, the solution techniques based on the factorization of the matrix associated with the linear system are the methods of choice when aﬀordable. In that respect, hierarchical methods based on low-rank compression have allowed a drastic reduction of the computational requirements for the solution of dense linear systems over the last two decades. For sparse linear systems, their application remains a challenge which has been studied by both the community of hierarchical matrices and the community of sparse matrices. On the one hand, the ﬁrst step taken by the community of hierarchical matrices most often takes advantage of the sparsity of the problem through the use of nested dissection. While this approach beneﬁts from the hierarchical structure, it is not, however, as eﬃcient as sparse solvers regarding the exploitation of zeros and the structural separation of zeros from non-zeros. On the other hand, sparse factorization is organized so as to lead to a sequence of smaller dense operations, enticing sparse solvers to use this property and exploit compression techniques from hierarchical methods in order to reduce the computational cost of these elementary operations. Nonetheless, the globally hierarchical structure may be lost if the compression of hierarchical methods is used only locally on dense submatrices. We here review the main techniques that have been employed by both those communities, trying to highlight their common properties and their respective limits with a special emphasis on studies that have aimed to bridge the gap between them. With these observations in mind, we propose a class of hierarchical algorithms based on the symbolic analysis of the structure of the factors of a sparse matrix. These algorithms rely on a symbolic information to cluster and construct a hierarchical structure coherent with the non-zero pattern of the matrix. Moreover, the resulting hierarchical matrix relies on low-rank compression for the reduction of the memory consumption of large submatrices as well as the time to solution of the solver. We also compare multiple ordering techniques based on geometrical or topological properties. Finally, we open the discussion to a coupling between the Finite Element Method and the Boundary Element Method in a uniﬁed computational framework.

A. Fuchs and D. Wentzlaﬀ, “The accelerator wall: Limits of chip specialization,” in Proceedings of the 25th IEEE International Symposium on High-Performance Computer Architecture, ser. HPCA’19, 2019.

Abstract: Specializing chips using hardware accelerators has become the prime means to alleviate the gap between the growing computational demands and the stagnating transistor budgets caused by the slowdown of CMOS scaling. Much of the beneﬁts of chip specialization stems from optimizing a computational problem within a given chip’s transistor budget. Unfortunately, the stagnation of the number of transistors available on a chip will limit the accelerator design optimization space, leading to diminishing specialization returns, ultimately hitting an accelerator wall.
In this work, we tackle the question of what are the limits of future accelerators and chip specialization? We do this by characterizing how current accelerators depend on CMOS scaling, based on a physical modeling tool that we constructed using datasheets of thousands of chips. We identify key concepts used in chip specialization, and explore case studies to understand how specialization has progressed over time in diﬀerent applications and chip platforms (e.g., GPUs, FPGAs, ASICs). Utilizing these insights, we build a model which projects forward to see what future gains can and cannot be enabled from chip specialization. A quantitative analysis of specialization returns and technological boundaries is critical to help researchers understand the limits of accelerators and develop methods to surmount them.

D. Fujiki, N. Chatterjee, D. Lee, and M. O’Connor, “Near-memory data transformation for eﬃcient sparse matrix multi-vector multiplication,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’19, Denver, Colorado: ACM, 2019, 55:1–55:17, isbn: 978-1-4503-6229-0. doi: 10.1145/3295500.3356154.

Abstract: Eﬃcient manipulation of sparse matrices is critical to a wide range of HPC applications. Increasingly, GPUs are used to accelerate these sparse matrix operations. We study one common operation, Sparse Matrix Multi-Vector Multiplication (SpMM), and evaluate the impact of the sparsity, distribution of non-zero elements, and tile-traversal strategies on GPU implementations. Using these insights, we determine that operating on these sparse matrices in a Densiﬁed Compressed Sparse Row (DCSR) is well-suited to the parallel warp-synchronous execution model of the GPU processing elements.
Preprocessing or storing the sparse matrix in the DCSR format, however, often requires signiﬁcantly more memory storage than conventional Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) formats. Given that SpMM kernels are often bottlenecked on DRAM bandwidth, the increase in DRAM traﬃc to access the larger DCSR formatted data structure can result in a slowdown for many matrices.
We propose a near-memory transform engine to dynamically create DCSR formatted tiles for the GPU processing elements from the CSC formatted matrix in memory. This work enhances a GPU’s last-level cache/memory controller unit to act as an eﬃcient translator between the compute-optimized representation of data and its corresponding storage/bandwidth-optimized format to accelerate sparse workloads. Our approach achieves 2.26× better performance on average compared to the vendor supplied optimized library for sparse matrix operations, cuSPARSE.

M. Garstka, M. Cannon, and P. Goulart, “COSMO: A conic operator splitting method for large convex problems,” in 2019 18th European Control Conference (ECC), IEEE, Jun. 2019. doi: 10.23919/ecc.2019.8796161.

Abstract: This paper describes the Conic Operator Splitting Method (COSMO), an operator splitting algorithm for convex optimisation problems with quadratic objective function and conic constraints. At each step the algorithm alternates between solving a quasi-deﬁnite linear system with a constant coeﬃcient matrix and a projection onto convex sets. The solver is able to exploit chordal sparsity in the problem data and to detect infeasible problems. The low per-iteration computational cost makes the method particularly eﬃcient for large problems, e.g. semideﬁnite programs in portfolio optimisation, graph theory, and robust control. Our Julia implementation is open-source, extensible, integrated into the Julia optimisation ecosystem and performs well on a variety of large convex problem classes.

M. Gates, J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra, “SLATE: Design of a modern distributed and accelerated linear algebra library,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’19, Denver, Colorado: ACM, 2019, 26:1–26:18, isbn: 978-1-4503-6229-0. doi: 10.1145/3295500.3356223.

Abstract: The SLATE (Software for Linear Algebra Targeting Exascale) library is being developed to provide fundamental dense linear algebra capabilities for current and upcoming distributed high-performance systems, both accelerated CPU-GPU based and CPU based. SLATE will provide coverage of existing ScaLAPACK functionality, including the parallel BLAS; linear systems using LU and Cholesky; least squares problems using QR; and eigenvalue and singular value problems. In this respect, it will serve as a replacement for ScaLAPACK, which after two decades of operation, cannot adequately be retroﬁtted for modern accelerated architectures. SLATE uses modern techniques such as communication-avoiding algorithms, lookahead panels to overlap communication and computation, and task-based scheduling, along with a modern C++ framework. Here we present the design of SLATE and initial reports of several of its components.

I. Georgieva, S. Harizanov, and C. Hofreither, “Iterative low-rank approximation solvers for the extension method for fractional diﬀusion,” Johann Radon Institute for Computational and Applied Mathematics (RICAM), Tech. Rep. RICAM-Report 2019-14, 2019.

Abstract: We consider the numerical method for fractional diﬀusion problems which is based on an extension to a mixed boundary value problem for a local operator in a higher dimensional space. We observe that, when this problem is discretized using tensor product spaces as is commonly done, the solution can be very well approximated by low-rank tensors. This motivates us to apply iterative low-rank approximation algorithms in order to eﬃciently solve this extended problem. In particular, we employ a recently proposed greedy Tucker approximation method as well as a more classical greedy rank one update method. Throughout, all objects of interest are kept in suitable low-rank approximations, which dramatically reduces the required amount of memory compared to the full formulation of the extended problem.
Our approach can be used for general, non-structured space discretizations. If the space discretization itself has tensor product structure, we can further decompose the problem in order to deal with even lower dimensional objects. We also note that the approach can be directly applied to higher-order discretizations both in space and the extended variable.
In several numerical examples, we demonstrate the convergence behaviour of the proposed methods. In particular, the Tucker approximation approach requires only a few iterations in order to reach the discretization error in all tested settings.

M. B. Giles, A. Jentzen, and T. Welti, “Generalised multilevel picard approximations,” Nov. 8, 2019. arXiv: http://arxiv.org/abs/1911.03188v1 [math.NA].

Abstract: It is one of the most challenging problems in applied mathematics to approximatively solve high-dimensional partial diﬀerential equations (PDEs). In particular, most of the numerical approximation schemes studied in the scientiﬁc literature suﬀer under the curse of dimensionality in the sense that the number of computational operations needed to compute an approximation with an error of size at most 𝜀 > 0 grows at least exponentially in the PDE dimension d ∈ ℕ or in the reciprocal of 𝜀. Recently, so-called full-history recursive multilevel Picard (MLP) approximation methods have been introduced to tackle the problem of approximately solving high-dimensional PDEs. MLP approximation methods currently are, to the best of our knowledge, the only methods for parabolic semi-linear PDEs with general time horizons and general initial conditions for which there is a rigorous proof that they are indeed able to beat the curse of dimensionality. The main purpose of this work is to investigate MLP approximation methods in more depth, to reveal more clearly how these methods can overcome the curse of dimensionality, and to propose a generalised class of MLP approximation schemes, which covers previously analysed MLP approximation schemes as special cases. In particular, we develop an abstract framework in which this class of generalised MLP approximations can be formulated and analysed and, thereafter, apply this abstract framework to derive a computational complexity result for suitable MLP approximations for semi-linear heat equations. These resulting MLP approximations for semi-linear heat equations essentially are generalisations of previously introduced MLP approximations for semi-linear heat equations.

S. Goldenberg, A. Stathopoulos, and E. Romero, “A golub–kahan davidson method for accurately computing a few singular triplets of large sparse matrices,” SIAM Journal on Scientiﬁc Computing, vol. 41, no. 4, A2172–A2192, 2019. doi: 10.1137/18M1222004.

Abstract: Obtaining high accuracy singular triplets for large sparse matrices is a signiﬁcant challenge, especially when searching for the smallest triplets. Due to the diﬃculty and size of these problems, eﬃcient methods must function iteratively, with preconditioners, and under strict memory constraints. In this research, we present a Golub–Kahan Davidson method (GKD), which satisﬁes these requirements and includes features such as soft-locking with orthogonality guarantees, an inner correction equation similar to Jacobi–Davidson, locally optimal +k restarting, and the ability to ﬁnd real zero singular values in both square and rectangular matrices. Additionally, our method achieves full accuracy while avoiding the augmented matrix, which often converges slowly for the smallest triplets due to the diﬃculty of interior eigenvalue problems. We describe our method in detail, including implementation issues that arise. Our experimental results conﬁrm the eﬃciency and stability of our method over the current implementation of PHSVDS in the PRIMME software package.

A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar, “SparTen: A sparse tensor accelerator for convolutional neural networks,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52, Columbus, OH, USA: ACM, 2019, pp. 151–165, isbn: 978-1-4503-6938-1. doi: 10.1145/3352460.3358291.

Abstract: Convolutional neural networks (CNNs) are emerging as powerful tools for image processing. Recent machine learning work has reduced CNNs’ compute and data volumes by exploiting the naturally-occurring and actively-transformed zeros in the feature maps and ﬁlters. While previous semi-sparse architectures exploit one-sided sparsity either in the feature maps or the ﬁlters, but not both, a recent fully-sparse architecture, called Sparse CNN (SCNN), exploits two-sided sparsity to improve performance and energy over dense architectures. However, sparse vector-vector dot product, a key primitive in sparse CNNs, would be ineﬃcient using the representation adopted by SCNN. The dot product requires ﬁnding and accessing non-zero elements in matching positions in the two sparse vectors – an inner join using the position as the key with a single value ﬁeld. SCNN avoids the inner join by performing a Cartesian product capturing the relevant multiplications. However, SCNN’s approach incurs several considerable overheads and is not applicable to non-unit-stride convolutions. Further, exploiting reuse in sparse CNNs fundamentally causes systematic load imbalance not addressed by SCNN. We propose SparTen which achieves eﬃcient inner join by providing support for native two-sided sparse execution and memory storage. To tackle load imbalance, SparTen employs a software scheme, called greedy balancing, which groups ﬁlters by density via two variants, a software-only one which uses whole-ﬁlter density and a software-hardware hybrid which uses ﬁner-grain density. Our simulations show that, on average, SparTen performs 4.7×, 1.8×, and 3× better than a dense architecture, one-sided sparse architecture, and SCNN, respectively. An FPGA implementation shows that SparTen performs 4.3× and 1.9× better than a dense architecture and a one-sided sparse architecture, respectively.

Z. Gong, H. Ji, C. Fletcher, C. Hughes, and J. Torrellas, “Sparsetrain:leveraging dynamic sparsity in training dnns on general-purpose simd processors,” Nov. 22, 2019. arXiv:1911.10175 [cs.LG].

Abstract: Our community has greatly improved the eﬃciency of deep learning applications, including by exploiting sparsity in inputs. Most of that work, though, is for inference, where weight sparsity is known statically, and/or for specialized hardware. We propose a scheme to leverage dynamic sparsity during training. In particular, we exploit zeros introduced by the ReLU activation function to both feature maps and their gradients. This is challenging because the sparsity degree is moderate and the locations of zeros change over time. We also rely purely on software. We identify zeros in a dense data representation without transforming the data and performs conventional vectorized computation. Variations of the scheme are applicable to all major components of training: forward propagation, backward propagation by inputs, and backward propagation by weights. Our method signiﬁcantly outperforms a highly-optimized dense direct convolution on several popular deep neural networks. At realistic sparsity, we speed up the training of the non-initial convolutional layers in VGG16, ResNet-34, ResNet-50, and Fixup ResNet-50 by 2.19×, 1.37×, 1.31×, and 1.51× respectively on an Intel Skylake-X CPU.

L. le Gorrec, S. Mouysset, I. S. Duﬀ, P. A. Knight, and D. Ruiz, “Uncovering hidden block structure for clustering,” in Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery, ser. ECMLPKDD’19, 2019.

Abstract: We present a multistage procedure to cluster directed and undirected weighted graphs by ﬁnding the block structure of their adjacency matrices. A central part of the process is to scale the adjacency matrix into a doubly-stochastic form, which permits detection of the whole matrix block structure with minimal spectral information (theoretically a single pair of singular vectors suﬃces).We present the diﬀerent stages of our method, namely the impact of the doubly-stochastic scaling on singular vectors, detection of the block structure by means of these vectors, and details such as cluster reﬁnement and a stopping criterion. Then we test thealgorithm’s eﬀectiveness by using it on two unsupervised classiﬁcation tasks: community detection in networks and shape detection in cloudsof points in two dimensions. By comparing results of our approach with thoseof widely used algorithms designed for speciﬁc purposes, we observe that our method is competitive (for community detection) if not superior (for shape detection) in comparison with existing methods.

T. Grützmacher, T. Cojean, G. Flegar, F. Göbel, and H. Anzt, “A customized precision format based on mantissa segmentation for accelerating sparse linear algebra,” Concurrency and Computation: Practice and Experience, 2019. doi: 10.1002/cpe.5418. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.5418.

Abstract: In this work, we pursue the idea of radically decoupling the ﬂoating point format used for arithmetic operations from the format used to store the data in memory. We complement this idea with a customized precision memory format derived by splitting the mantissa (signiﬁcand) of standard IEEE formats into segments, such that values can be accessed faster if lower accuracy is acceptable. Combined with precision-aware algorithms that dynamically adapt the data access accuracy to the numerical requirements, the customized precision memory format can render attractive runtime savings without impacting the memory footprint of the data or the accuracy of the ﬁnal result. In an experimental analysis using the adaptive precision Jacobi method on diagonalizable test problems, we assess the beneﬁts of the mantissa-segmenting customized precision format on recent multi- and manycore architectures.

J. Han, K. Xiong, F. Nie, and X. Li, “Structured graph reconstruction for scalable clustering,” IEEE Transactions on Knowledge and Data Engineering, 2019. doi: 10.1109/TKDE.2019.2948850.

Abstract: Spectral clustering is a quite simple but eﬀective method for solving graph clustering problem. It ﬁrst embeds the original data into a lower dimensional space with spectral analysis, and then relys on an algorithm to obtain the ﬁnal cluster labels. Since it involves eigen-decomposition of the graph Laplacian matrix for spectral embedding, spectral clustering suﬀers from high computational cost as data grow in scale. It is also limited by the performance of post-processing algorithm such as kmeans. To address these two issues, in this paper, we propose a novel approach denoted by Orthogonal and Nonnegative Graph Reconstruction (ONGR) for large scale clustering. The two constraints are served as the structure constraint under which the graph reconstructed by the indicator matrix is structured. The proposed method mainly needs to perform economical singular value decomposition for small size matrix thus it scales linearly with the data size. Moreover, the interpretability of the indicator matrix is oﬀered due to the nonnegative constraint. Therefore, the ﬁnal cluster labels can be directly obtained without post-processing. Extensive experiments show the eﬀectiveness of the proposed method.

R. Han, Y. You, and J. Demmel, “Auto-precision scaling for distributed deep learning,” Nov. 20, 2019. arXiv:1911.08907 [cs.DC].

Abstract: In recent years, large-batch optimization is becoming the key of distributed deep learning. However, large-batch optimization is hard. Straightforwardly porting the code often leads to a signiﬁcant loss in testing accuracy. As some researchers suggested that large batch optimization leads to a low generalization performance, and they further conjectured that large-batch training needs a higher ﬂoating-point precision to achieve a higher generalization performance. To solve this problem, we conduct an open study in this paper. Our target is to ﬁnd the number of bits that large-batch training needs. To do so, we need a system for customized precision study. However, state-of-the-art systems have some limitations that lower the eﬃciency of developers and researchers. To solve this problem, we design and implement our own system CPD: A High Performance System for Customized-Precision Distributed DL. In our experiments, our application often loses accuracy if we use a very-low precision (e.g. 8 bits or 4 bits). To solve this problem, we proposed the APS (Auto-Precision-Scaling) algorithm, which is a layer-wise adaptive scheme for gradients shifting. With APS, we are able to make the large-batch training converge with only 4 bits.

R. R. Hannah, “Fundamental results on asynchronous parallel optimization algorithms,” PhD thesis, UCLA, 2019.

Abstract: In this thesis, we present a body of work on the performance and convergence properties of asynchronous-parallel algorithms completed over the course of my doctorate degree (Hannah, Feng, and Wotao Yin 2018; Hannah and Wotao Yin 2017b; T. Sun, Hannah, and Wotao Yin 2017; Hannah and Wotao Yin 2017a). Asynchronous algorithms eliminate the costly synchronization penalty of traditional synchronous-parallel algorithms. They do this by having computing nodes utilize the most recently available information to compute updates. However, it’s not immediately clear whether the trade-oﬀ of eliminating synchronization penalty at the cost of using outdated information is favorable.
We ﬁrst give a comprehensive theoretical justiﬁcation of the performance advantages of asynchronous algorithms, which we summarize as ”Faster Iterations, Same Quality” (Hannah and Wotao Yin 2017a). Under a well-justiﬁed model, we show that asynchronous algorithms complete ”Faster Iterations”. Using renewal theory, we demonstrate how network delays, heterogeneous sub-problem diﬃculty and computing power greatly hinder synchronous algorithms, but have no impact on their asynchronous counterparts. We next prove the ﬁrst exact convergence rate results for a variety of synchronous algorithms including synchronous ARock and synchronous randomized block coordinate descent (sync-RBCD). This allows us to make a fair comparison between these algorithms and their asynchronous counterparts.
Finally we show that a variety of asynchronous algorithms have a convergence rate that essentially matches the previously derived exact rates for synchronous counterparts so long as the delays are not too large. Hence asynchronous algorithms complete faster iteration that are of the ”Same Quality” as synchronous algorithms. Therefore we conclude that a wide variety of asynchronous algorithms will always outcompete their synchronous counterparts if the delays are not too large, and especially at scale.
Next we present the ﬁrst asynchonous Nesterov-accelerated algorithm that attains a speedup: A2BCD (Hannah, Feng, and Wotao Yin 2018). We ﬁrst prove that A2BCD attains NU_ACDM’s complexity to highest order. NU_ACDM is a state-of-the-art accelerated coordinate descent algorithm (Allen-Zhu, Qu, et al. 2016). Then we show that both A2BCD and NU_ACDM both have optimal complexity. Hence because A2BCD has faster iterations, and optimal complexity, it should be the fastest coordinate descent algorithm. We verify this with numerical experiments comparing A2BCD with NU_ACDM. We ﬁnd that A2BCD is up to 4-5× faster than NU_ACDM, and hence conclude that our algorithm is the current fastest coordinate descent algorithm that exists. Finally we derive a second-order ODE, which is the continuoustime limit of A2BCD. The ODE analysis motivates and clariﬁes our proof strategy.
Lastly, we present earlier foundational work that comprises the basis of the technical innovations that made the previous results possible (Hannah and Wotao Yin 2017b). We show that ARock and its many special cases may converge even under unbounded delays (both stochastic and deterministic). These results sidestep longstanding impossibility results derived in the 1980s by making slightly stronger assumptions. They were also an early demonstration of the power of meticulous Lyapunov-function construction techniques pioneered in this body of work.

D. Harvey and J. van der Hoeven, “Integer multiplication in time o(nlogn),” HAL archives, Tech. Rep. hal-02070778, 2019. [Online]. Available: https://hal.archives-ouvertes.fr/hal-02070778.

Abstract: We present an algorithm that computes the product of two n-bitintegers in O(nlogn) bit operations.

K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, “ExTensor: An accelerator for sparse tensor algebra,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52, Columbus, OH, USA: ACM, 2019, pp. 319–333, isbn: 978-1-4503-6938-1. doi: 10.1145/3352460.3358275.

A. E. Helal, A. M. Aji, M. L. Chu, B. M. Beckmann, and W.-c. Feng, “Adaptive task aggregation for high-performance sparse solvers on GPUs,” in The 28th International Conference on Parallel Architectures and Compilation, ser. PACT’19, 2019.

Abstract: Sparse solvers are heavily used in computational ﬂuid dynamics (CFD), computer-aided design (CAD), and other important application domains. These solvers remain challenging to execute on massively parallel architectures, due to the sequential dependencies between the ﬁne-grained application tasks. In particular, parallel sparse solvers typically suﬀer from substantial scheduling and dependency-management overheads relative to the compute operations. We propose adaptive task aggregation (ATA) to eﬃciently execute such irregular computations on GPU architectures via hierarchical dependency management and lowlatency task scheduling. On a gamut of representative problems with diﬀerent data-dependency structures, ATA signiﬁcantly outperforms existing GPU task-execution approaches, achieving a geometric mean speedup of 2.2× to 3.7× across diﬀerent sparse kernels (with speedups of up to two orders of magnitude).

T. M. Hernandez, R. V. Beeumen, M. A. Caprio, and C. Yang, “A greedy algorithm for computing eigenvalues of a symmetric matrix,” Nov. 20, 2019. arXiv:1911.10041 [physics.comp-ph].

Abstract: We present a greedy algorithm for computing selected eigenpairs of a large sparse matrix H that can exploit localization features of the eigenvector. When the eigenvector to be computed is localized, meaning only a small number of its components have large magnitudes, the proposed algorithm identiﬁes the location of these components in a greedy manner, and obtains approximations to the desired eigenpairs of H by computing eigenpairs of a submatrix extracted from the corresponding rows and columns of H. Even when the eigenvector is not completely localized, the approximate eigenvectors obtained by the greedy algorithm can be used as good starting guesses to accelerate the convergence of an iterative eigensolver applied to H. We discuss a few possibilities for selecting important rows and columns of H and techniques for constructing good initial guesses for an iterative eigensolver using the approximate eigenvectors returned from the greedy algorithm. We demonstrate the eﬀectiveness of this approach with examples from nuclear quantum many-body calculations and many-body localization studies of quantum spin chains.

J. Herrmann, M. Özkaya, B. Uçar, K. Kaya, and Ü. Çatalyürek, “Multilevel algorithms for acyclic partitioning of directed acyclic graphs,” SIAM Journal on Scientiﬁc Computing, vol. 41, no. 4, A2117–A2145, 2019. doi: 10.1137/18M1176865.

Abstract: We investigate the problem of partitioning the vertices of a directed acyclic graph into a given number of parts. The objective function is to minimize the number or the total weight of the edges having end points in diﬀerent parts, which is also known as the edge cut. The standard load balancing constraint of having an equitable partition of the vertices among the parts should be met. Furthermore, the partition is required to be acyclic; i.e., the interpart edges between the vertices from diﬀerent parts should preserve an acyclic dependency structure among the parts. In this work, we adopt the multilevel approach with coarsening, initial partitioning, and reﬁnement phases for acyclic partitioning of directed acyclic graphs. We focus on two-way partitioning (sometimes called bisection), as this scheme can be used in a recursive way for multiway partitioning. To ensure the acyclicity of the partition at all times, we propose novel and eﬃcient coarsening and reﬁnement heuristics. The quality of the computed acyclic partitions is assessed by computing the edge cut. We also propose eﬀective ways to use the standard undirected graph partitioning methods in our multilevel scheme. We perform a large set of experiments on a dataset consisting of (i) graphs coming from an application and (ii) some others corresponding to matrices from a public collection. We report signiﬁcant improvements compared to the current state of the art.

C. F. Higham and D. J. Higham, “Deep learning: An introduction for applied mathematicians,” SIAM Review, vol. 61, no. 3, pp. 860–891, Jan. 2019. doi: 10.1137/18m1165748.

Abstract: Multilayered artiﬁcial neural networks are becoming a pervasive tool in a host of application ﬁelds. At the heart of this deep learning revolution are familiar concepts from applied and computational mathematics, notably from calculus, approximation theory, optimization, and linear algebra. This article provides a very brief introduction to the basic ideas that underlie deep learning from an applied mathematics perspective. Our target audience includes postgraduate and ﬁnal year undergraduate students in mathematics who are keen to learn about the area. The article may also be useful for instructors in mathematics who wish to enliven their classes with references to the application of deep learning techniques. We focus on three fundamental questions: What is a deep neural network? How is a network trained? What is the stochastic gradient method? We illustrate the ideas with a short MATLAB code that sets up and trains a network. We also demonstrate the use of state-of-the-art software on a large scale image classiﬁcation problem. We ﬁnish with references to the current literature.

N. Higham and T. Mary, “Solving block low-rank linear systems by LU factorization is numerically stable,” 2019. MIMS EPrint: MIMS EPrint:2019.15. [Online]. Available: http://eprints.maths.manchester.ac.uk/2733/1/paper.pdf.

Abstract: Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientiﬁc computing. However, these units are inherently mixed precision and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplication and LU factorization and give rounding error analyses of both kernels. An important application is to GMRES-based iterative reﬁnement with block FMAs, for which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication and LU factorization with TC16 and TC32 forms of FMA, which diﬀer in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU conﬁrm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, while achieving almost the same performance.

N. Higham and S. Pranesh, “Exploiting lower precision arithmetic in solving symmetric positive deﬁnite linear systems and least squares problems,” 2019. MIMS EPrint: MIMS EPrint:2019.20. [Online]. Available: http://eprints.maths.manchester.ac.uk/2736/1/paper.pdf.

Abstract: What is the fastest way to solve a linear system Ax = b in arithmetic of a given precision when A is symmetric positive deﬁnite and otherwise unstructured? The usual answer is by Cholesky factorization, assuming that A can be factorized. We develop an algorithm that can be faster, given an arithmetic of precision lower than the working precision as well as (optionally) one of higher precision. The arithmetics might, for example, be of precisions half, single, and double; half and double, possibly with quadruple; or single and double, possibly with quadruple. We compute a Cholesky factorization at the lower precision and use the factors as preconditioners in GMRES-based iterative reﬁnement. To avoid breakdown of the factorization we shift the matrix by a small multiple of its diagonal. We explain why this is preferable to the common approach of shifting by a multiple of the identity matrix, We also incorporate scaling in order to avoid overﬂow and reduce the chance of underﬂow when working in IEEE half precision arithmetic. We extend the algorithm to solve a linear least squares problem with a well conditioned coeﬃcient matrix by forming and solving the normal equations. In both algorithms most of the work is done at low precision provided that iterative reﬁnement and the inner iterative solver converge quickly. We explain why replacing GMRES by the conjugate gradient method causes convergence guarantees to be lost, but show that this change has little eﬀect on convergence in practice. Our numerical experiments conﬁrm the potential of the new algorithms to provide faster solutions in environments that support multiple precisions of arithmetic.

M. Hoemmen, J. Badwaik, M. Brucher, A. ( Iliopoulos, and J. Michopoulos, “Historical lessons for c++ linear algebra library standardization,” ISO C++ standards meeting (Kona), Tech. Rep. P1417R0, 2019. [Online]. Available: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1417r0.pdf.

C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’19, Washington, District of Columbia: ACM, 2019, pp. 300–314, isbn: 978-1-4503-6225-2. doi: 10.1145/3293883.3295712.

Abstract: Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern of sparse matrix multiplication makes it challenging to use tiling to enhance data reuse. In this paper, we devise an adaptive tiling strategy and apply it to enhance the performance of two primitives: SpMM (product of sparse matrix and dense matrix) and SDDMM (sampled dense-dense matrix multiplication). In contrast to studies that have resorted to non-standard sparse-matrix representations to enhance performance, we use the standard Compressed Sparse Row (CSR) representation, within which intra-row reordering is performed to enable adaptive tiling. Experimental evaluation using an extensive set of matrices from the Sparse Suite collection demonstrates signiﬁcant performance improvement over currently available state-of-the-art alternatives.

S. Hossain and M. S. Mahmud, “On computing with diagonally structured matrices,” in Proceedings of the IEEE High Performance Extreme Computing Conference, ser. HPEC ’19, Sep. 2019, pp. 1–6. doi: 10.1109/HPEC.2019.8916325.

Abstract: We present a storage scheme for storing matrices by diagonals and algorithms for performing matrix-matrix and matrix-vector multiplication by diagonals. Matrix elements are accessed with stride-1 and involve no indirect referencing. Access to the transposed matrix requires no additional eﬀort. The proposed storage scheme handles dense matrices and matrices with special structure e.g., banded, triangular, symmetric in a uniform manner. Test results from preliminary numerical experiments with an OpenMP implementation of our method are encouraging.

Y. Hu, T.-M. Li, L. Anderson, J. Ragan-Kelley, and F. Durand, “Taichi: A language for high-performance computation on spatially sparse data structures,” ACM Transactions on Graphics, vol. 38, no. 6, 2019.

Abstract: 3D visual computing data are often spatially sparse. To exploit such sparsity, people have developed hierarchical sparse data structures, such as multilevel sparse voxel grids, particles, and 3D hash tables. However, developing and using these high-performance sparse data structures is challenging, due to their intrinsic complexity and overhead. We propose Taichi, a new data-oriented programming language for eﬃciently authoring, accessing, and maintaining such data structures. The language oﬀers a high-level, data structure-agnostic interface for writing computation code. The user independently speciﬁes the data structure. We provide several elementary components with diﬀerent sparsity properties that can be arbitrarily composed to create a wide range of multi-level sparse data structures. This decoupling of data structures from computation makes it easy to experiment with diﬀerent data structures without changing computation code, and allows users to write computation as if they are working with a dense array. Our compiler then uses the semantics of the data structure and index analysis to automatically optimize for locality, remove redundant operations for coherent accesses, maintain sparsity and memory allocations, and generate eﬃcient parallel and vectorized instructions for CPUs and GPUs.
Our approach yields competitive performance on common computational kernels such as stencil applications, neighbor lookups, and particle scattering. We demonstrate our language by implementing simulation, rendering, and vision tasks including a material point method simulation, ﬁnite element analysis, a multigrid Poisson solver for pressure projection, volumetric path tracing, and 3D convolution on sparse grids. Our computation-data structure decoupling allows us to quickly experiment with diﬀerent data arrangements, and to develop high-performance data structures tailored for speciﬁc computational tasks. With th as many lines of code, we achieve 4.55× higher performance on average, compared to hand-optimized reference implementations.

T. K. Huckle. (2019). Accelerated jacobi iterations for bidiagonal and sparse triangular matrices, [Online]. Available: https://www5.in.tum.de/persons/huckle/it_triang.pdf.

Abstract: In many applications a sparse linear system of equations Ax = b has to be solved. For applying iterative solvers like preconditioned conjugate gradient (pcg) or GMRES, eﬀective preconditioners are necessary, e.g. Jacobi, Gauss-Seidel, or incomplete LU factorization (ILU). Often, eﬀective preconditioners are given via sparse triangular matrices L, that have to be solved in every iteration step. Recent work by Edmond Chow introduced an easy to parallelize ﬁxed-point iteration for computing approximations to (I)LU factorizations. Therefore, the aching handicap in parallel solution methods for sparse matrices is the solving of sparse triangular systems, e.g. bidiagonal matrices. In a parallel environment direct solvers can take only restricted advantage of parallelism. Therefore, in this paper we develop a fast iterative solution method for sparse triangular matrices. In contrast to direct solvers for triangular matrices L like graph-based methods, sparse factorization methods, or Sherman-Morrison-Woodbury, here we want to consider stationary Jacobi iterations. In its original form the Jacobi iteration for ill-conditioned matrices can lead to very slow convergence. Therefore, we introduce diﬀerent acceleration tools like preconditioning (block Jacobi and Incomplete Sparse Approximate Inverse ISAI), and a recursive acceleration of the Jacobi method. Here the Neumann series is replaced by the Euler expansion (see [4, 19, 8]). This is derived by a recursive computation of the Neumann series using powers of the initial Jacobi iteration matrix. The goal is to shift the major part of the operations from cheap but numerous iteration steps to better parallelizable cheap and sparse matrix-matrix products reducing the number of necessary iterations considerably, e.g. to less than log ~2~(n) for an n × n matrix.

I. Duﬀ, J. Hogg, and F. Lopez. A new sparse symmetric indeﬁnite solver using A Posteriori Threshold Pivoting. Technical Report RAL-TR-2018-008. Science & Technology Facilities Council, UK, 2019.

Abstract: The factorization of sparse symmetric indeﬁnite systems is particularly challenging since pivoting is required to maintain stability of the factorization. Pivoting techniques generally oﬀer limited parallelism and are associated with signiﬁcant data movement hindering the scalability of these methods. Variants of the Threshold Partial Pivoting (TPP) algorithm for example have been often used because of its numerical robustness but standard implementations exhibit poor parallel performance. On the other hand, some methods trade stability for performance on parallel architectures such as the Supernode Bunch-Kaufman (SBK) used in the PARDISO solver. In this case, however, the factors obtained might not be used to accurately compute the solution of the system. For this reason we have designed a task-based LDLT factorization algorithm based on a new pivoting strategy called A Posteriori Threshold Pivoting (APTP) that is much more suitable for modern multicore architectures and has the same numerical robustness as the TPP strategy. We implemented our algorithm in a new version of the SPRAL Sparse Symmetric Indeﬁnite Direct Solver (SSIDS) which initially supported GPU-only factorization. We have used OpenMP 4 task features to implement a multifrontal algorithm with dense factorizations using the novel APTP, and we show that it performs favourably compared to the state-of-the-art solvers HSL_MA86, HSL_MA97 and PARDISO both in terms of performance on a multicore machine and in terms of numerical robustness. Finally we show that this new solver is able to make use of GPU devices for accelerating the factorization on heterogeneous architectures.

S. Idreos, N. Dayan, W. Qin, M. Akmanalp, S. Hilgard, A. Ross, J. Lennon, V. Jain, H. Gupta, D. Li, and Z. Zhu, “Design continuums and the path toward self-designing key-value stores that know and learn,” in Biennial Conference on Innovative Data Systems Research, 2019.

Abstract: We introduce the concept of design continuums for the data layout of key-value stores. A design continuum uniﬁes major distinct data structure designs under the same model. The critical insight and potential long-term impact is that such unifying models 1) render what we consider up to now as fundamentally diﬀerent data structures to be seen as views” of the very same overall design space, and 2) allow seeing” new data structure designs with performance properties that are not feasible by existing designs. The core intuition behind the construction of design continuums is that all data structures arise from the very same set of fundamental design principles, i.e., a small set of data layout design concepts out of which we can synthesize any design that exists in the literature as well as new ones. We show how to construct, evaluate, and expand, design continuums and we also present the ﬁrst continuum that uniﬁes major data structure designs, i.e., B+Tree, BeTree, LSM-tree, and LSH-Table.
The practical beneﬁt of a design continuum is that it creates a fast inference engine for the design of data structures. For example, we can near instantly predict how a speciﬁc design change in the underlying storage of a data system would aﬀect performance, or reversely what would be the optimal data structure (from a given set of designs) given workload characteristics and a memory budget. In turn, these properties allow us to envision a new class of self-designing key-value stores with a substantially improved ability to adapt to workload and hardware changes by transitioning between drastically diﬀerent data structure designs to assume a diverse set of performance properties at will.

E. I. Ioannidis, N. Cheimarios, A. N. Spyropoulos, and A. G. Boudouvis, “On the performance of various parallel GMRES implementations on CPU and GPU clusters,” 2019. arXiv:1906.04051v1.

T. Iwashita, S. Li, and T. Fukaya, “Hierarchical block multi-color ordering: A new parallel ordering method for vectorization and parallelization of the sparse triangular solver in the iccg method,” 2019. arXiv: 1908.00741v1 [cs.DC].

Abstract: In this paper, we propose a new parallel ordering method to vectorize and parallelize the sparse triangular solver, which is called hierarchical block multi-color ordering. In this method, the parallel forward and backward substitutions can be vectorized while preserving the advantages of block multi-color ordering, that is, fast convergence and fewer thread synchronizations. To evaluate the proposed method in a parallel ICCG (Incomplete Cholesky Conjugate Gradient) solver, numerical tests were conducted using ﬁve test matrices on three types of computational nodes. The numerical results indicate that the proposed method outperforms the conventional block and nodal multi-color ordering methods in 13 out of 15 test cases, which conﬁrms the eﬀectiveness of the method.

H. Jagode, A. Danalis, H. Anzt, and J. Dongarra, “Papi software-deﬁned events for in-depth performance analysis,” The International Journal of High Performance Computing Applications, 2019. doi: 10.1177/1094342019846287.

Z. Jia, M. Maggioni, J. Smith, and D. P. Scarpazza, “Dissecting the NVidia Turing T4 GPU via microbenchmarking,” Mar. 18, 2019. arXiv:1903.07486v1 [cs.DC].

Abstract: In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want to extract the highest possible performance. Last year, these very reasons motivated us to dissect the Volta GPU architecture using microbenchmarks. The introduction in August 2018 of Turing, NVidia’s latest architecture, pressed us to update our study. In this report, we examine Turing and compare it quantitatively against previous NVidia GPU generations. Speciﬁcally, we study the T4 GPU: a low-power board aiming at inference applications. We describe its improvements against its inference-oriented predecessor: the P4 GPU based on the Pascal architecture. Both T4 and P4 GPUs achieve signiﬁcantly higher frequency-per-Watt ﬁgures than their full-size counterparts. We study the performance of the T4’s TensorCores, ﬁnding a much higher throughput on low-precision operands than on the P4 GPU. We reveal that Turing introduces new instructions that express matrix math more succinctly. We map Turing’s instruction space, ﬁnding the same encoding as Volta, and additional instructions. We reveal that the Turing TU104 chip has the same memory hierarchy depth as the Volta GV100; cache levels sizes on the TU104 are frequently twice as large as those found on the Pascal GP104. We benchmark each constituent of the T4 memory hierarchy and ﬁnd substantial overall performance improvements over its P4 predecessor. We studied how clock throttling aﬀects compute-intensive workloads that hit power or thermal limits. Many of our ﬁndings are novel, published here for the ﬁrst time. All of them can guide high-performance software developers get closer to the GPU’s peak performance.

Y. Jiang, D. Kouzoupis, H. Yin, M. Diehl, and B. Houska, “Decentralized optimization over tree graphs,” Oct. 21, 2019. arXiv:1910.09206 [math.OC].

Abstract: This paper presents a decentralized algorithm for non-convex optimization over tree-structured networks. We assume that each node of this network can solve small-scale optimization problems and communicate approximate value functions with its neighbors based on a novel multi-sweep communication protocol. In contrast to existing parallelizable optimization algorithms for non-convex optimization the nodes of the network are neither synchronized nor assign any central entity. None of the nodes needs to know the whole topology of the network, but all nodes know that the network is tree-structured. We discuss conditions under which locally quadratic convergence rates can be achieved. The method is illustrated by running the decentralized asynchronous multi-sweep protocol on a radial AC power network case study.

K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi, S. Koppula, N. M. Ghiasi, T. Shahroodi, J. G. Luna, and O. Mutlu, “SMASH: Co-designing software compression and hardware-accelerated indexing for eﬃcient sparse matrix operations,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52, Columbus, OH, USA: ACM, 2019, pp. 600–614, isbn: 978-1-4503-6938-1. doi: 10.1145/3352460.3358286.

Abstract: Important workloads, such as machine learning and graph analytics applications, heavily involve sparse linear algebra operations. These operations use sparse matrix compression as an eﬀective means to avoid storing zeros and performing unnecessary computation on zero elements. However, compression techniques like Compressed Sparse Row (CSR) that are widely used today introduce signiﬁcant instruction overhead and expensive pointer-chasing operations to discover the positions of the non-zero elements. In this paper, we identify the discovery of the positions (i.e., indexing) of non-zero elements as a key bottleneck in sparse matrix-based workloads, which greatly reduces the beneﬁts of compression.
We propose SMASH, a hardware-software cooperative mechanism that enables highly-eﬃcient indexing and storage of sparse matrices. The key idea of SMASH is to explicitly enable the hardware to recognize and exploit sparsity in data. To this end, we devise a novel software encoding based on a hierarchy of bitmaps. This encoding can be used to eﬃciently compress any sparse matrix, regardless of the extent and structure of sparsity. At the same time, the bitmap encoding can be directly interpreted by the hardware. We design a lightweight hardware unit, the Bitmap Management Unit (BMU), that buﬀers and scans the bitmap hierarchy to perform highly-eﬃcient indexing of sparse matrices. SMASH exposes an expressive and rich ISA to communicate with the BMU, which enables its use in accelerating any sparse matrix computation.
We demonstrate the beneﬁts of SMASH on four use cases that include sparse matrix kernels and graph analytics applications. Our evaluations show that SMASH provides average performance improvements of 38% for Sparse Matrix Vector Multiplication and 44% for Sparse Matrix Matrix Multiplication, over a state-of-the-art CSR implementation, on a wide variety of matrices with diﬀerent characteristics. SMASH incurs a very modest hardware area overhead of up to 0.076% of an out-of-order CPU core.

K. Kawaguchi and L. Pack Kaelbling. (Apr. 2019). Every local minimum is a global minimum of an induced model. arXiv:1904.03673 [stat.ML].

Abstract: For non-convex optimization in machine learning, this paper proves that every local minimum achieves the global optimality of the perturbable gradient basis model at any diﬀerentiable point. As a result, non-convex machine learning is theoretically as supported as convex machine learning with a hand-crafted basis in terms of the loss at diﬀerentiable local minima, except in the case when a preference is given to the hand-crafted basis over the perturbable gradient basis. The proofs of these results are derived under mild assumptions. Accordingly, the proven results are directly applicable to many machine learning models, including practical deep neural networks, without any modiﬁcation of practical methods. Furthermore, as special cases of our general results, this paper improves or complements several state-of-the-art theoretical results in the literature with a simple and uniﬁed proof technique.

H. Kim, S. Hong, J. Park, and H. Han, “Static code transformations for thread-dense memory accesses in GPU computing,” Concurrency and Computation: Practice and Experience, Oct. 2019. doi: 10.1002/cpe.5512.

U. Kiran, S. Sanfui, S. K. Ratnakar, S. S. Gautam, and D. Sharma, “Comparative analysis of GPU-based solver libraries for a sparse linear system of equations,” in Advances in Computational Methods in Manufacturing, R. G. Narayanan, S. N. Joshi, and U. S. Dixit, Eds., Singapore: Springer Singapore, 2019, pp. 889–897, isbn: 978-981-32-9072-3.

Abstract: In this paper, a comparison of GPU-based linear solverLinear solver libraries for the solution of sparse positive-deﬁnite matrices is presented. These large sparse matrices arise in a number of computational disciplines seeking a solution for partial diﬀerential equations. The solution of these matrices is often a time-consuming process that can be reduced by parallel computingParallel computing. Since the development of GPU for general-purpose computing, a number of numerical solver libraries have evolved that can accelerate the solution procedure. The performance of three solver libraries has been evaluated in this paper for ﬁve diﬀerent test matrices. These test matrices have been taken from diﬀerent application domains with diﬀerent sparsity patterns. Results demonstrate a higher speedup from the iterative solver over the direct solver on GPU and also over a multithreaded CPU implementation.

B. Klowckiewicz and E. Darve, “Sparse hierarchical preconditioners using piecewise smooth approximations of eigenvectors,” 2019. arXiv:1907.0.

Abstract: When solving linear systems arising from PDE discretizations, iterative methods (such as Conjugate Gradient, GMRES, or MINRES) are often the only practical choice. To converge in a small number of iterations, however, they have to be coupled with an eﬃcient preconditioner. The eﬃciency of the preconditioner depends largely on its accuracy on the eigenvectors corresponding to small eigenvalues, and unfortunately, black-box methods typically cannot guarantee suﬃcient accuracy on these eigenvectors. Thus, constructing the preconditioner becomes a very problemdependent task. We describe a hierarchical approximate factorization approach which addresses this issue by focusing on improving the accuracy on smooth eigenvectors (such eigenvectors typically correspond to the small eigenvalues). The improved accuracy is achieved by preserving the action of the factorized matrix on piecewise polynomial functions of the PDE domain. Based on the factorization, we propose a family of sparse preconditioners with O(n) or O(nlogn) construction complexities. Our methods exhibit the optimal O(n) solution times in benchmarks run on large elliptic problems of diﬀerent types, arising for example in ﬂow or mechanical simulations. In the case of the linear elasticity equation the preconditioners are exact on the near-kernel rigid body modes.

F. Kong, “A parallel monolithic multilevel schwarz preconditioner for the neutron transport criticality calculations with a nonlinear diﬀusion acceleration method,” 2019. arXiv: 1907.12590v1 [math.NA].

Abstract: The multigroup neutron transport criticality calculations using modern supercomputers have been widely employed in a nuclear reactor analysis for studying whether or not a system is self-sustaining. However, the design and development of an eﬃcient parallel algorithm for the transport criticality calculations is a challenging task especially when the number of processor cores is large and the unstructured mesh is adopted since both the compute time and the memory usage need to be taken into consideration. In this paper, we study a monolithic multilevel Schwarz preconditioner for the transport criticality calculations using the nonlinear diﬀusion acceleration (NDA). In NDA, the linear systems of equations arising from the discretizations of the nonlinear diﬀusion equations and the transport equations need to be eﬃciently solved. To achieve this goal, we propose a monolithically coupled approach equipped with several important ingredients; e.g., subspace-based coarsening, aggressive coarsening and strength matrix thresholding. The proposed monolithic multilevel method is capable of eﬃciently handling the linear systems of equations for both the transport system and the diﬀusion system. In the multilevel method, the construction of coarse spaces is nontrivial and expensive. We propose a subspace-based coarsening algorithm to resolve this issue by exploring the matrix structures of the transport equations and the nonlinear diﬀusion equations. We numerically demonstrate that the monolithic multilevel preconditioner with the subspace-based coarsening algorithm is twice as fast as that equipped with a full space based coarsening approach on thousands of processor cores for an unstructured mesh neutron transport problem with billions of unknowns.

F. Kong, “Parallel memory-eﬃcient all-at-once algorithms for the sparse matrix triple products in multigrid methods,” The International Journal of High Performance Computing Applications, 2019. arXiv:1905.08423.

Q. Kong, Y.-F. Jing, T.-Z. Huang, and H.-B. An, “Acceleration of the scheduled relaxation jacobi method: Promising strategies for solving large, sparse linear systems,” Journal of Computational Physics, p. 108 862, 2019, issn: 0021-9991. doi: 10.1016/j.jcp.2019.108862. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0021999119305467.

Abstract: The main aim of this paper is to develop two algorithms based on the Scheduled Relaxation Jacobi (SRJ) method [J. Comput. Phys., 274 (2014), pp. 695-708] for solving problems arising from the ﬁnite-diﬀerence discretization of elliptic partial diﬀerential equations on large grids. These two algorithms are the Alternating Anderson-Scheduled Relaxation Jacobi (AASRJ) method by utilizing Anderson mixing after each SRJ iteration cycle and the Minimal Residual Scheduled Relaxation Jacobi (MRSRJ) method by minimizing residual after each SRJ iteration cycle, respectively. Through numerical experiments, we show that AASRJ is competitive with the optimal version of the SRJ method [J. Comput. Phys., 332 (2017), pp. 446-460] in most problems we considered here, and MRSRJ outperforms SRJ in all cases. The properties of AASRJ and MRSRJ are demonstrated. Both of them are promising strategies for solving large, sparse linear systems while maintaining the simplicity of the Jacobi method.

I. Koutis and H. Le, “Spectral modiﬁcation of graphs for improved spectral clustering,” in Proceedings of the 33rd Conference on Neural Information Processing Systems, ser. NeurIPS’ 19, Vancouver, CA, 2019.

Abstract: Spectral clustering algorithms provide approximate solutions to hard optimization problems that formulate graph partitioning in terms of the graph conductance. It is well understood that the quality of these approximate solutions is negatively aﬀected by a possibly signiﬁcant gap between the conductance and the second eigenvalue of the graph. In this paper we show that for any graph G, there exists a ‘spectral maximizer’ graph H which is cut-similar to G, but has eigenvalues that are near the theoretical limit implied by the cut structure of G. Applying then spectral clustering on H has the potential to produce improved cuts that also exist in G due to the cut similarity. This leads to the second contribution of this work: we describe a practical spectral modiﬁcation algorithm that raises the eigenvalues of the input graph, while preserving its cuts. Combined with spectral clustering on the modiﬁed graph, this yields demonstrably improved cuts.

G. Kouyialis, X. Wang, and R. Misener, “Symmetry detection for quadratic optimization using binary layered graphs,” Processes, vol. 7, no. 11, p. 838, Nov. 2019. doi: 10.3390/pr7110838.

Abstract: Symmetry in mathematical optimization may create multiple, equivalent solutions. In nonconvex optimization, symmetry can negatively aﬀect algorithm performance, e.g., of branch-and-bound when symmetry induces many equivalent branches. This paper develops detection methods for symmetry groups in quadratically-constrained quadratic optimization problems. Representing the optimization problem with adjacency matrices, we use graph theory to transform the adjacency matrices into binary layered graphs. We enter the binary layered graphs into the software package nauty that generates important symmetric properties of the original problem. Symmetry pattern knowledge motivates a discretization pattern that we use to reduce computation time for an approximation of the point packing problem. This paper highlights the importance of detecting and classifying symmetry and shows that knowledge of this symmetry enables quick approximation of a highly symmetric optimization problem.

S. R. Kuppannagari, R. Rajat, R. Kannan, A. Dasu, and V. K. Prasanna, “Ip cores for graph kernels on fpgas,” in Proceedings of the IEEE High Performance Extreme Computing Conference, ser. HPEC ’19, Sep. 2019, pp. 1–7. doi: 10.1109/HPEC.2019.8916363.

Abstract: Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over graph partitions. Partitioning the input graph into nonoverlapping partitions improves on-chip data reuse. Additional optimizations to exploit intra and interpartition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR, SSSP and WCC, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, 16.4× speedup for PR, 3.5× speedup for SSSP, and 35.1× speedup for WCC, and compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× speedup for SpMV, 1.64× speedup for PR, and 1.8× speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance predictions of our designs assuming the graph is stored in HBM2 instead of DRAM. We further discuss extensions to our optimizations to improve the throughput.

J. Kurzak, Y. M. Tsai, M. Gates, A. Abdelfattah, and J. Dongarra, “Massively parallel automated software tuning,” in Proceedings of the 48th International Conference on Parallel Processing, ser. ICPP 2019, Kyoto, Japan: ACM, 2019, 92:1–92:10, isbn: 978-1-4503-6295-5. doi: 10.1145/3337821.3337908.

Abstract: This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined.

J. Lagravière, J. Langguth, M. Prugger, L. Einkemmer, P. H. Ha, and X. Cai, “Performance optimization and modeling of ﬁne-grained irregular communication in UPC,” Scientiﬁc Programming, vol. 2019, 2019. doi: 10.1155/2019/6825728.

Abstract: .e Uniﬁed Parallel C (UPC) programming language oﬀers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory subsystems. One convenient feature of UPC is its ability to automatically execute betweenthread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. .e programmer friendliness, however, can come at the cost of substantial performance penalties. .is is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a ﬁne-grained pattern. In this paper, we study performance enhancement strategies speciﬁcally targeting such ﬁnegrained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require ﬁne-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the diﬀerent scenarios, in the form of quantiﬁable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. .ese performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh.

C.-L. Lee, C.-T. Chao, J.-K. Lee, C.-W. Huang, and M.-Y. Hung, “Sparse-matrix compression primitives with opencl framework to support halide,” in Proceedings of the International Workshop on OpenCL, ser. IWOCL’19, Boston, MA, USA: ACM, 2019, pp. 1–2, isbn: 978-1-4503-6230-6. doi: 10.1145/3318170.3318179.

Abstract: Halide and OpenCL now play important roles for heterogeneous multi-core computing. OpenCL provides vendor-level support and Halide provides domain-speciﬁc support such as vision processing and AI model (TVM Halide IR). Halide also provides ﬂexible scheduling for applications on target machines. OpenCL plays a supporting role for Halide environments. In this work, we investigate the research issues in supporting sparse computation with Halide and their corresponding OpenCL support. We present sparse matrix compression primitives on Halide for sparse matrix matrix (SpMM) multiplication with OpenCL framework. Halide is a programming language designed to process image and array from numerous algorithms and scheduling primitives to achieve state-of-art performance including SIMD and heterogeneous computation. This paper proposed the implementation of sparse matrix compression for Halide scheduling primitives including COO, CSR, and hybrid CSR. The design of experiments includes Halide primitives for sparse matrix compression and matrix computations. The experimental result of computation with compressing matrix shows the performance are improved by up to 85% compared to the baseline without compression.

D. Lee, J. Oh, and H. Yu, “OCam: Out-of-core coordinate descent algorithm for matrix completion,” Information Sciences, 2019, issn: 0020-0255. doi: 10.1016/j.ins.2019.09.077. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0020025519309284.

Abstract: Recently, there are increasing reports that most datasets can be actually stored in disks of a single oﬀ-the-shelf workstation, and utilizing out-of-core methods is much cheaper and even faster than using a distributed system. For these reasons, out-of-core methods have been actively developed for machine learning and graph processing. The goal of this paper is to develop an eﬃcient out-of-core matrix completion method based on coordinate descent approach. Coordinate descent-based matrix completion (CD-MC) has two strong beneﬁts over other approaches: 1) it does not involve heavy computation such as matrix inversion and 2) it does not have step-size hyper-parameters, which reduces the eﬀort for hyper-parameter tuning. Existing solutions for CD-MC have been developed and analyzed for in-memory setting and they do not take disk-I/O into account. Thus, we propose OCam, a novel out-of-core coordinate descent algorithm for matrix completion. Our evaluation results and cost analyses provide sound evidences supporting the following beneﬁts of OCam: (1) Scalability – OCam is a truly scalable out-of-core method and thus decomposes a matrix larger than the size of memory, (2) Eﬃciency – OCam is super fast. OCam is up to 10× faster than the state-of-the-art out-of-core method, and up to 4.1× faster than a competing distributed method when using eight machines. The source code of OCam will be available for reproducibility.

D. Lei, M. Du, H. Chen, Z. Li, and Y. Wu, “Distributed parallel sparse multinomial logistic regression,” IEEE Access, vol. 7, pp. 55 496–55 508, 2019, issn: 2169-3536. doi: 10.1109/ACCESS.2019.2913280.

Abstract: Sparse Multinomial Logistic Regression (SMLR) is widely used in the ﬁeld of image classiﬁcation, multi-class object recognition, and so on, because it has the function of embedding feature selection during classiﬁcation. However, it cannot meet the time and memory requirements for processing large-scale data. We have reinvestigated the classiﬁcation accuracy and running eﬃciency of the algorithm for solving SMLR problems using the Alternating Direction Method of Multipliers (ADMM), which is called fast SMLR (FSMLR) algorithm in this paper. By reformulating the optimization problem of FSMLR, we transform the serial convex optimization problem to the distributed convex optimization problem, i.e., global consensus problem and sharing problem. Based on the distributed optimization problem, we propose two distribute parallel SMLR algorithms, sample partitioning-based distributed SMLR (SP-SMLR), and feature partitioning-based distributed SMLR (FP-SMLR), for a large-scale sample and large-scale feature datasets in big data scenario, respectively. The experimental results show that the FSMLR algorithm has higher accuracy than the original SMLR algorithm. The big data experiments show that our distributed parallel SMLR algorithms can scale for massive samples and large-scale features, with high precision. In a word, our proposed serial and distribute SMLR algorithms outperform the state-of-the-art algorithms.

J. Li, B. Uçar, Ü. V. Çatalyürek, J. Sun, K. Barker, and R. Vuduc, “Eﬃcient and eﬀective sparse tensor reordering,” in Proceedings of the ACM International Conference on Supercomputing, ser. ICS ’19, Phoenix, Arizona: ACM, 2019, pp. 227–237, isbn: 978-1-4503-6079-1. doi: 10.1145/3330345.3330366.

Abstract: This paper formalizes the problem of reordering a sparse tensor to improve the spatial and temporal locality of operations with it, and proposes two reordering algorithms for this problem, which we call BFS-MCS and Lexi-Order. The BFS-MCS method is a Breadth First Search (BFS)-like heuristic approach based on the maximum cardinality search family; Lexi-Order is an extension of doubly lexical ordering of matrices to tensors. We show the eﬀects of these schemes within the context of a widely used tensor computation, the CANDECOMP/PARAFAC decomposition (CPD), when storing the tensor in three previously proposed sparse tensor formats: coordinate (COO), compressed sparse ﬁber (CSF), and hierarchical coordinate (HiCOO). A new partition-based superblock scheduling is also proposed for HiCOO format to improve load balance. On modern multicore CPUs, we show Lexi-Order obtains up to 4.14× speedup on sequential HiCOO-Mttkrp and 11.88× speedup on its parallel counterpart. The performance of COO- and CSF-based Mttkrps also improves. Our two reordering methods are more eﬀective than state-of-the-art approaches. The code is released as part of Parallel Tensor Infrastructure (ParTI!): https://github.com/hpcgarage/ParTI.

M. Li, Y. Liu, H. Yang, Z. Luan, L. Gan, G. Yang, and D. Qian, “Accelerating sparse cholesky factorization on sunway manycore architecture,” IEEE Transactions on Parallel and Distributed Systems, 2019, issn: 2161-9883. doi: 10.1109/TPDS.2019.2953852.

Abstract: To improve the performance of Sparse Cholesky factorization, existing research divides the adjacent columns of the sparse matrix with the same nonzero patterns into supernodes for parallelization. However, due to the various structures of sparse matrices, the computation of the generated supernodes varies signiﬁcantly, and thus hard to optimize when computed by dense matrix kernels. Therefore, how to eﬃciently map sparse Choleksy factorization to the emerging architectures, such as Sunway many-core processor, remains an active research direction. In this paper, we propose swCholesky, which is a highly optimized implementation of sparse Cholesky factorization on Sunway processor. Speciﬁcally, we design three kernel task queues and a dense matrix library to dynamically adapt to the kernel characteristics and architecture features. In addition, we propose an auto-tuning mechanism to search for the optimal settings of the important parameters in swCholesky. Our experiments show that swCholesky achieves better performance than state-of-the-art implementations.

M. Li, P. Hawrylak, and J. Hale, “Combining opencl and mpi to support heterogeneous computing on a cluster,” in Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), ser. PEARC ’19, Chicago, IL, USA: ACM, 2019, 5:1–5:6, isbn: 978-1-4503-7227-5. doi: 10.1145/3332186.3333059.

Abstract: This paper presents an implementation of a heterogeneous programming model which combines Open Computing Language (OpenCL) and Message Passing Interface (MPI). The model is applied to solving a Markov decision process (MDP) with value iteration method. The performance test is conducted on a high performance computing cluster. At peak performance, the model is able to achieve a 57× speedup over a serial implementation. For an extremely large input MDP, which has 1,000,000 states, the obtained speedup is still over 12×, showing that this heterogeneous programming model can solve MDPs more eﬃciently than the serial solver does.

Y. Li, P. Xie, X. Chen, J. Liu, B. Yang, S. Li, C. Gong, X. Gan, and H. Xu, “Vbsf: A new storage format for simd sparse matrix–vector multiplication on modern processors,” The Journal of Supercomputing, Apr. 10, 2019, issn: 1573-0484. doi: 10.1007/s11227-019-02835-4.

Abstract: Sparse matrix–vector multiplication (SpMV) is one of the most indispensable kernels of solving problems in numerous applications, but its performance of SpMV is limited by the need for frequent memory access. Modern processors exploit data-level parallelism to improve the performance using single-instruction multiple data (SIMD). In order to take full advantage of SIMD acceleration technology, a new storage format called Variable Blocked-σ-SIMD Format (VBSF) is proposed in this paper to change the irregular nature of traditional matrix storage formats. This format combines the adjacent nonzero elements into variable size blocks to ensure that SpMV can be computed with SIMD vector units. We compare the VBSF-based SpMV with traditional storage formats using 15 matrices as a benchmark suite on three computing platforms (FT2000, Intel Xeon E5 and Intel Silver) with diﬀerent SIMD length. For the matrices in the benchmark suite, the VBSF obtains great performance improvement on three platforms, respectively, and it proves to have better storage eﬃciency compared with other storage formats.

C. Liu, H. Yang, X. Liu, Z. Luan, and D. Qian, “Intelligent-unrolling: Exploiting regular patterns in irregular applications,” Oct. 24, 2019. arXiv:1910.13346 [cs.DC].

Abstract: Modern optimizing compilers are able to exploit memory access or computation patterns to generate vectorization codes. However, such patterns in irregular applications are unknown until runtime due to the input dependence. Thus, either compiler’s static optimization or proﬁle-guided optimization based on speciﬁc inputs cannot predict the patterns for any common input, which leads to suboptimal code generation. To address this challenge, we develop Intelligent-Unroll, a framework to automatically optimize irregular applications with vectorization. Intelligent-Unroll allows the users to depict the computation task using code seed with the memory access and computation patterns represented in feature table and information-code tree, and generates highly eﬃcient codes. Furthermore, Intelligent-Unroll employs several novel optimization techniques to optimize reduction operations and gather/scatter instructions. We evaluate Intelligent-Unroll with sparse matrix-vector multiplication (SpMV) and graph applications. Experimental results show that Intelligent-Unroll is able to generate more eﬃcient vectorization codes compared to the state-of-the-art implementations.

H. Liu, Y. Tian, H. Zong, Q. Ma, M. Y. Wang, and L. Zhang, “Fully parallel level set method for large-scale structural topology optimization,” Computers & Structures, vol. 221, pp. 13–27, 2019, issn: 0045-7949. doi: 10.1016/j.compstruc.2019.05.010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0045794918316511.

Abstract: To realize large-scale or high-resolution structural topology optimization design, a fully parallel parameterized level set method with compactly supported radial basis functions (CSRBFs) is developed based on both the uniform and non-uniform structured meshes. In this work, the whole computation process is parallelized, including mesh generation, sensitivity analysis, calculation and assembly of the element stiﬀness matrices, solving of the structural state equation, parameterization and updating of the level set function, and output of the computational results during the optimization iterations. In addition, some typical numerical examples, in which the calculation scale is up to 7 million 8-node hexahedral elements, are carried out for verifying the eﬀectiveness of the proposed method. Finally, the computing time is also analyzed in detail. It is found that: (1) In the optimized structures, the thin sheet-like components gradually replace the truss-like ones when reﬁning the mesh, (2) the parameterization process of the level set function will become fast as long as the non-uniformity of mesh is not very high and the supported radius of CSRBF is small enough, and (3) more than 80% of the total computing time is always consumed for solving the structural state equation during the ﬁnite element analysis (FEA).

J. A. Loe, H. K. Thornquist, and E. G. Boman, “Polynomial preconditioned gmres to reduce communication in parallel computing,” 2019. arXiv:1907.00072.

Abstract: Polynomial preconditioning with the GMRES minimal residual polynomial has the potential to greatly reduce orthogonalization costs, making it useful for communication reduction. We implement polynomial preconditioning in the Belos package from Trilinos and show how it can be eﬀective in both serial and parallel implementations. We further show it is a communication-avoiding technique and is a viable option to CAGMRES for large-scale parallel computing.

S. Ma, Z. Liu, S. Chen, L. Huang, Y. Guo, Z. Wang, and M. Zhang, “Coordinated dma: Improving the dram access eﬃciency for matrix multiplication,” IEEE Transactions on Parallel and Distributed Systems, pp. 1–1, 2019, issn: 1045-9219. doi: 10.1109/TPDS.2019.2906891.

Abstract: High performance implementation of matrix multiplication is essential for scientiﬁc computing. The memory access procedure is quite possible to be the bottleneck of matrix multiplication. The widely used GotoBLAS GEMM implementation divides the integral matrix into several partitions to be assigned to diﬀerent cores for parallelization. Traditionally, each core deploys a DMA transfer to access its own partition in the DRAM memory. However, deploying an independent DMA transfer for each core cannot eﬃciently exploit the inter-core locality. Also, multiple concurrent DMA transfers interfere with each other, further reducing the DRAM access eﬃciency. We observe that the same row of neighboring partitions is in the same DRAM page, which means that there is signiﬁcant locality inherent in the address layout. We propose the coordinated DMA to eﬃciently exploit the locality. It invokes one transfer to serve all cores and moves data in a row-major manner to improve the DRAM access eﬃciency. Compared with a baseline design, the coordinated DMA improves the bandwidth by 84.8% and reduces DRAM energy consumption by 43.1% for micro-benchmarks. It achieves higher performance for the GEMM and Linpack benchmark. With much less hardware costs, the coordinated DMA signiﬁcantly outperforms an out-of-order memory controller.

H. J. Macintosh, J. E. Banks, and N. A. Kelson, “Implementing and evaluating an heterogeneous, scalable, tridiagonal linear system solver with OpenCL to target FPGAs, GPUs, and CPUs,” International Journal of Reconﬁgurable Computing, vol. 2019, pp. 1–13, Oct. 2019. doi: 10.1155/2019/3679839.

Abstract: Solving diagonally dominant tridiagonal linear systems is a common problem in scientiﬁc high-performance computing (HPC). Furthermore, it is becoming more commonplace for HPC platforms to utilise a heterogeneous combination of computing devices. Whilst it is desirable to design faster implementations of parallel linear system solvers, power consumption concerns are increasing in priority. This work presents the oclspkt routine. The oclspkt routine is a heterogeneous OpenCL implementation of the truncated SPIKE algorithm that can use FPGAs, GPUs, and CPUs to concurrently accelerate the solving of diagonally dominant tridiagonal linear systems. The routine is designed to solve tridiagonal systems of any size and can dynamically allocate optimised workloads to each accelerator in a heterogeneous environment depending on the accelerator’s compute performance. The truncated SPIKE FPGA solver is developed ﬁrst for optimising OpenCL device kernel performance, global memory bandwidth, and interleaved host to device memory transactions. The FPGA OpenCL kernel code is then refactored and optimised to best exploit the underlying architecture of the CPU and GPU. An optimised TDMA OpenCL kernel is also developed to act as a serial baseline performance comparison for the parallel truncated SPIKE kernel since no FPGA tridiagonal solver capable of solving large tridiagonal systems was available at the time of development. The individual GPU, CPU, and FPGA solvers of the oclspkt routine are 110%, 150%, and 170% faster, respectively, than comparable device-optimised third-party solvers and applicable baselines. Assessing heterogeneous combinations of compute devices, the GPU + FPGA combination is found to have the best compute performance and the FPGA-only conﬁguration is found to have the best overall estimated energy eﬃciency.

P. Mamooler, “The domain decomposition method of bank and jimack as an optimized schwarz method,” PhD thesis, University of Geneva, 2019.

Abstract: The aim of this thesis is to introduce the Bank-Jimack domain decomposition method and study its convergence behavior. We are interested in understanding what the precise contribution of the outer coarse mesh is to the convergence behavior of the domain decomposition method proposed by Bank and Jimack. We show for a two subdomain decomposition that the outer coarse mesh can be interpreted as computing an approximation to the optimal transmission condition represented by the Dirichlet to Neumann map, and thus the method of Bank and Jimack can be viewed as an optimized Schwarz method, i.e. a Schwarz method that uses Robin or higher order transmission conditions instead of theclassical Dirichlet ones.

S. M. V. N. Marques, T. S. Medeiros, F. D. Rossi, M. C. Luizelli, A. G. Girardi, A. C. S. Beck, and A. F. Lorenzon, “The impact of turbo frequency on the energy, performance, and aging of parallel applications,” in Proceedings of the IFIP/IEEE 27th International Conference on Very Large Scale Integration, ser. VLSI-SoC ’19, Oct. 2019, pp. 149–154. doi: 10.1109/VLSI-SoC.2019.8920389.

Abstract: Technologies that improve the performance of parallel applications by increasing the nominal operating frequency of processors respecting a given TDP (Thermal Design Power) have been widely used. However, they may impact on other non-functional requirements in diﬀerent ways (e.g. increasing energy consumption or aging). Therefore, considering the huge number of conﬁgurations available, represented by the range of all possible combinations among diﬀerent parallel applications, amount of threads, dynamic voltage and frequency scaling (DVFS) governors, boosting technologies and simultaneous multithreading (SMT), selecting the one that oﬀers the best tradeoﬀ for a non-functional requirement is extremely challenging for software designers. Given that, in this work we assess the impact of changing these conﬁgurations on the energy consumption, performance, and aging of parallel applications on a turbo-compliant processor. Results show that there is no single conﬁguration that would provide the best solution for all nonfunctional requirements at once. For instance, we demonstrate that the conﬁguration that oﬀers the best performance is the same one that has the worst impact on aging, accelerating it by up to 1.75 times. With our experiments, we provide guidelines for the developer when it comes to tuning performance using turbo boosting to save as much energy as possible and increase the lifespan of the hardware components1.1This study was ﬁnanced in part by the Coordenao de Aperfeioamento de Pessoal de Nvel Superior – Brasil (CAPES) – Finance Code 001.

M. Massias, S. Vaiter, A. Gramfort, and J. Salmon, “Dual extrapolation for sparse generalized linear models,” 2019. arXiv:1907.05830.

Abstract: Generalized Linear Models (GLM) form a wide class of regression and classiﬁcation models, where prediction is a function of a linear combination of the input variables. For statistical inference in high dimension, sparsity inducing regularizations have proven to be useful while oﬀering statistical guarantees. However, solving the resulting optimization problems can be challenging: even for popular iterative algorithms such as coordinate descent, one needs to loop over a large number of variables. To mitigate this, techniques known as screening rules and working sets diminish the size of the optimization problem at hand, either by progressively removing variables, or by solving a growing sequence of smaller problems. For both techniques, signiﬁcant variables are identiﬁed thanks to convex duality arguments. In this paper, we show that the dual iterates of a GLM exhibit a Vector AutoRegressive (VAR) behavior after sign identiﬁcation, when the primal problem is solved with proximal gradient descent or cyclic coordinate descent. Exploiting this regularity, one can construct dual points that oﬀer tighter certiﬁcates of optimality, enhancing the performance of screening rules and helping to design competitive working set algorithms.

T. Mattson, T. A. Davis, M. Kumar, A. Buluc, S. McMillan, J. Moreira, and C. Yang, “LAGraph: A community eﬀort to collect graph algorithms built on top of the graphblas,” in 2019 IEEE International Parallel and Distributed Processing Symposium Workshops, ser. IPDPSW’19, May 2019, pp. 276–284. doi: 10.1109/IPDPSW.2019.00053.

Abstract: In 2013, we released a position paper to launch a community eﬀort to deﬁne a common set of building blocks for constructing graph algorithms in the language of linear algebra. This led to the GraphBLAS. We released a speciﬁcation for the C programming language binding to the GraphBLAS in 2017. Since that release, multiple libraries that conform to the GraphBLAS C speciﬁcation have been produced. In this position paper, we launch the next phase of this ongoing community eﬀort: a project to assemble a set of high level graph algorithms built on top of the GraphBLAS. While many of these algorithms are well-known with high quality implementations available, they have not been assembled in one place and integrated with the GraphBLAS. We call this project the LAGraph graph algorithms project and with this position paper, we put out a call for collaborators to join us. While the initial goal is to just assemble these algorithms into a single framework, the long term goal is a library of production-worthy code, with the LAGraph library serving as an open source repository of veriﬁed graph algorithms that use the GraphBLAS.

H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, M. Urban, M. Burkart, M. Dippel, M. Lindauer, and F. Hutter, “Towards automatically-tuned deep neural networks,” in Automated Machine Learning: Methods, Systems, Challenges, F. Hutter, L. Kotthoﬀ, and J. Vanschoren, Eds. Cham: Springer International Publishing, 2019, pp. 135–149, isbn: 978-3-030-05318-5. doi: 10.1007/978-3-030-05318-5˙7.

Abstract: Recent advances in AutoML have led to automated tools that can compete with machine learning experts on supervised learning tasks. In this work, we present two versions of Auto-Net, which provide automatically-tuned deep neural networks without any human intervention. The ﬁrst version, Auto-Net 1.0, builds upon ideas from the competition-winning system Auto-sklearn by using the Bayesian Optimization method SMAC and uses Lasagne as the underlying deep learning (DL) library. The more recent Auto-Net 2.0 builds upon a recent combination of Bayesian Optimization and HyperBand, called BOHB, and uses PyTorch as DL library. To the best of our knowledge, Auto-Net 1.0 was the ﬁrst automatically-tuned neural network to win competition datasets against human experts (as part of the ﬁrst AutoML challenge). Further empirical results show that ensembling Auto-Net 1.0 with Auto-sklearn can perform better than either approach alone, and that Auto-Net 2.0 can perform better yet.

K. Meng, J. Li, G. Tan, and N. Sun, “A pattern based algorithmic autotuner for graph processing on GPUs,” in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’19, Washington, District of Columbia: ACM, 2019, pp. 201–213, isbn: 978-1-4503-6225-2. doi: 10.1145/3293883.3295716.

Abstract: This paper proposes Gswitch, a pattern-based algorithmic auto-tuning system that dynamically switches between optimization variants with negligible overhead. Its novelty lies in a small set of algorithmic patterns that allow for the conﬁgurable assembly of variants of the algorithm. The fast transition of Gswitch is based on a machine learning model trained using 644 real graphs. Moreover, Gswitch provides a simple programming interface that conceals low-level tuning details from the user. We evaluate Gswitch on typical graph algorithms (BFS, CC, PR, SSSP, and BC) using Nvidia Kepler and Pascal GPUs. The results show that Gswitch runs up to 10times faster than the best conﬁguration of the state-of-the-art programmable GPU-based graph processing libraries on 10 representative graphs. Gswitch outperforms Gunrock on 92.4% cases of 644 graphs which is the largest dataset evaluation reported to date.

D. J. Milroy, A. H. Baker, J. M. Dennis, A. Gettelman, and D. M. Hammerling, “Investigating the impact of mixed precision on correctness for a large climate code,” in Proceedings of the Third International Workshop on Software Correctness for HPC Applications, 2019.

Abstract: Earth system models (ESMs) are computationally expensive and represent many complex processes on a wide range of scales from molecular to global. Certain ESM computations require high precision while others, such as atmospheric microphysics (e.g., precipitation) which are approximated by bulk properties, should not. As such, atmospheric microphysics models are prime candidates for conversion to single precision, which aﬀord distinct computational and memory advantages over typical double-precision numbers. However, care must be taken as indiscriminate type casting to single precision can result in numerical instability and divergent output when applied naively. In this work we relate our experiences attempting to improve the performance of the Morrison-Gettelman microphysics package (MG2) in a popular ESM by modifying it to compute in single precision without sacriﬁcing correctness. We ﬁnd that modiﬁcation of the entire MG2 package to compute with singleprecision ﬂoats achieves a respectable performance increase but does not appear to be correct in terms of maintaining consistency with double-precision MG2. On the other hand, narrowing the scope of our conversion to a couple expensive subprograms yields more satisfying results in terms of correctness but with negligible overall performance improvement. We evaluate correctness with both an objective statistical tool and traditional approaches more familiar to climate scientists. While we are still working toward our ultimate goal of improving the performance of MG2 without negatively aﬀecting model output, we believe that our experiences may be helpful to other groups pursuing similar goals.

S. M. Mniszewski, “Graph partitioning as quadratic unconstrained binary optimization (QUBO) on spiking neuromorphic hardware,” in Proceedings of the International Conference on Neuromorphic Systems, ser. ICONS’19, Knoxville, TN, USA: ACM, 2019, 4:1–4:5, isbn: 978-1-4503-7680-8. doi: 10.1145/3354265.3354269.

Abstract: In this work, graph partitioning (GP) is explored using quadratic unconstrained binary optimization (QUBO) on the IBM TrueNorth spiking neuromorphic architecture. GP splits a graph into similar-sized parts while minimizing the number of cut edges between parts. Classical approaches to GP rely on heuristics and approximation algorithms. The GP QUBO formulation was inspired by previous work using the D-Wave quantum annealer. This approach is not limited to graph algorithms, but is applicable to solving a spectrum of NP-hard optimization problems. A classical pseudo simulated annealing metaheuristic is used to solve the QUBO. Implementation on the IBM TrueNorth using a spiking framework is described. Results as converged high-energy solutions are shown to be ”good enough” or optimal for partitioning a graph into 2 parts.

M. S. Mohammadi, T. Yuki, K. Cheshmi, E. C. Davis, M. Hall, M. M. Dehnavi, P. Nandy, C. Olschanowsky, A. Venkat, and M. M. Strout, “Sparse computation data dependence simpliﬁcation for eﬃcient compiler-generated inspectors,” in Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI 2019, Phoenix, AZ, USA: ACM, 2019, pp. 594–609, isbn: 978-1-4503-6712-7. doi: 10.1145/3314221.3314646.

Abstract: This paper presents a combined compile-time and runtime loop-carried dependence analysis of sparse matrix codes and evaluates its performance in the context of wavefront parallellism. Sparse computations incorporate indirect memory accesses such as x[col[j]] whose memory locations cannot be determined until runtime. The key contributions of this paper are two compile-time techniques for signiﬁcantly reducing the overhead of runtime dependence testing: (1) identifying new equality constraints that result in more eﬃcient runtime inspectors, and (2) identifying subset relations between dependence constraints such that one dependence test subsumes another one that is therefore eliminated. New equality constraints discovery is enabled by taking advantage of domain-speciﬁc knowledge about index arrays, such as col[j]. These simpliﬁcations lead to automatically-generated inspectors that make it practical to parallelize such computations. We analyze our simpliﬁcation methods for a collection of seven sparse computations. The evaluation shows our methods reduce the complexity of the runtime inspectors signiﬁcantly. Experimental results for a collection of ﬁve large matrices show parallel speedups ranging from 2x to more than 8x running on a 8-core CPU.

E. Montagne and R. Surós, “Systolic sparse matrix vector multiply in the age of tpus and accelerators,” in 2019 Spring Simulation Conference (SpringSim), Apr. 2019, pp. 1–10. doi: 10.23919/SpringSim.2019.8732860.

Abstract: Tensor Processing Units has brought back systolic arrays as a computational alternative to high performance computing. Recently Google presented a Tensor Processing Unit for handling matrix multiplication using systolic arrays. This unit is designed for dense matrices only. As they stated, sparse architectural support was omitted momentarily but they will focus on sparsity in future designs. We propose a systolic array to compute the Sparse Matrix Vector product in T2(n) ≈⌈⌉ + 2n + 2 using 2 n+2 processing elements. The systolic array we propose also use accumulators to collect the partial results of the resulting vector and supports adapting tiling.

A. Montoison and D. Orban, “BiLQ: An iterative method for nonsymmetric linear systems with a quasi-minimum error property,” Oct. 7, 2019. doi: 10.13140/RG.2.2.18287.59042. arXiv:1910.02598 [math.NA].

Abstract: We introduce an iterative method named BiLQ for solving general square linear systems Ax = b based on the Lanczos biorthogonalization process deﬁned by least-norm subproblems, and that is a natural companion to BiCG and QMR. Whereas the BiCG (Fletcher, 1976), CGS (Sonneveld, 1989) and BiCGSTAB (van der Vorst, 1992) iterates may not exist when the tridiagonal projection of A is singular, BiLQ is reliable on compatible systems even if A is ill-conditioned or rank deﬁcient. As in the symmetric case, the BiCG residual is often smaller than the BiLQ residual and, when the BiCG iterate exists, an inexpensive transfer from the BiLQ iterate is possible. Although the Euclidean norm of the BiLQ error is usually not monotonic, it is monotonic in a diﬀerent norm that depends on the Lanczos vectors. We establish a similar property for the QMR (Freund and Nachtigal, 1991) residual. BiLQ combines with QMR to take advantage of two initial vectors and solve a system and an adjoint system simultaneously at a cost similar to that of applying either method. We derive an analogous combination of USYMLQ and USYMQR based on the orthogonal tridiagonalization process (Saunders, Simon, and Yip, 1988). The resulting combinations, named BiLQR and TriLQR, may be used to estimate integral functionals involving the solution of a primal and an adjoint system. We compare BiLQR and TriLQR with Minres-qlp on a related augmented system, which performs a comparable amount of work and requires comparable storage. In our experiments, BiLQR terminates earlier than TriLQR and MINRES-QLP in terms of residual and error of the primal and adjoint systems.

A. Mukkara, N. Beckmann, and D. Sanchez, “PHI: Architectural support for synchronization- and bandwidth-eﬃcient commutative scatter updates,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on MIcroarchitecture, ser. MICRO-52, Columbus, OH, USA, Oct. 2019, p. 14.

Abstract: Many applications perform frequent scatter update operations to large data structures. For example, in push-style graph algorithms, processing each vertex requires updating the data of all its neighbors. Neighbors are often scattered over the whole graph, so these scatter updates have poor spatial and temporal locality. In current systems, scatter updates suﬀer high synchronization costs and high memory traﬃc. These drawbacks make push-style execution unattractive, and, when algorithms allow it, programmers gravitate towards pull-style implementations based on gather reads instead.
We present PHI, a push cache hierarchy that makes scatter updates synchronization- and bandwidth-eﬃcient. PHI adds support for pushing sparse, commutative updates from cores towards main memory. PHI adds simple compute logic at each cache level to buﬀer and coalesce these commutative updates throughout the hierarchy. This avoids synchronization, exploits temporal locality, and produces a load balanced execution. Moreover, PHI exploits spatial locality by selectively deferring updates with poor spatial locality, batching them to achieve sequential main memory transfers.
PHI is the ﬁrst system to leverage both the temporal and spatial locality beneﬁts of commutative scatter updates, some of which do not apply to gather reads. As a result, PHI not only makes push algorithms eﬃcient, but makes them consistently faster than pull ones. We evaluate PHI on graph algorithms and other sparse applications processing large inputs. PHI improves performance by 4.7× on average (and by up to 11×), and reduces memory traﬃc by 2× (and by up to 5×).

R. Muro, A. Fujii, and T. Tanaka, “Acceleration of symmetric sparse matrix-vector product using improved hierarchical diagonal blocking format,” in Proceedings of the International Conference on High Performance Computing in Asia-Paciﬁc Region, ser. HPC Asia 2019, Guangzhou, China: ACM, 2019, pp. 63–70, isbn: 978-1-4503-6632-8. doi: 10.1145/3293320.3293332.

Abstract: In the previous study, Guy et al. proposed sparse matrix-vector product (SpMV) acceleration using the Hierarchical Diagonal Blocking (HDB) format that recursively repeated partitioning, reordering, and blocking on symmetric sparse matrix. The HDB format stores sparse matrix hierarchically using tree structure. Each node of tree structure of HDB format store small sparse matrices using CSR format.
In this present study, we examined two problems with the HDB format and provided a solution for each problem.
First, SpMV using the HDB format has a partial dependent relationship among hierarchies. The problem with the HDB format is that the parallelism of computation decreases as the hierarchy of nodes gets closer to the root. Thus, we propose cutting of dependency using work vectors to solve this problem.
Second, each node of the conventional HDB format is stored in Compressed Sparse Row (CSR) format. Block compressed Sparse Row (BSR) format often becomes faster than CSR format in SpMV performance. Thus, we evaluated the eﬀectiveness of our proposed method with work vectors also for BSR-HDB format.
In addition, we compare the performance in the general format (CSR format, BSR format) using the Intel Math Kernel Library (MKL), the conventional HDB format, and the expanded HDB format by using 22 types of sparse matrix that from various ﬁeld. The results showed that the SpMV performance was highest in the HDB format that we expanded in 19 types of sparse matrix, which was 1.99 times faster than the CSR format.

Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç, “Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors,” Parallel Computing, p. 102 545, Aug. 2019. doi: 10.1016/j.parco.2019.102545.

Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware speciﬁc optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We ﬁrstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Speciﬁcally targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Diﬀerent from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-ﬁrst search or triangle counting. Our hash-table and heap-based algorithms are showing signiﬁcant speedups from libraries in the majority of the cases while diﬀerent algorithms dominate the other scenarios with diﬀerent matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. We build the performance model for hash-table and heap-based algorithms, which supports the recipe. A critical ﬁnding is that hash-table-based SpGEMM gets a signiﬁcant performance boost if the nonzeros are not required to be sorted within each row of the output matrix. Finally, we integrate our implementations into a large-scale protein clustering code named HipMCL, accelerating its SpGEMM kernel by up to 10X and achieving an overall performance boost for the whole HipMCL application by 2.6×.

F. Nataf, “Adaptive domain decomposition method for saddle point problem in matrix form,” 2019. arXiv:1911.01858 [cs.DC].

Abstract: We introduce an adaptive domain decomposition (DD) method for solving saddle point problems deﬁned as a block two by two matrix. The algorithm does not require any knowledge of the constrained space. We assume that all sub matrices are sparse and that the diagonal blocks are the sum of positive semi deﬁnite matrices. The latter assumption enables the design of adaptive coarse space for DD methods.

J. Nie, C. Zhang, D. Zou, F. Xia, L. Lu, X. Wang, and F. Zhao, “Adaptive sparse matrix-vector multiplication on CPU-GPU heterogeneous architecture,” in Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference, ser. HPCCT’19, ACM Press, 2019. doi: 10.1145/3341069.3341072.

Abstract: SpMV is the core algorithm in solving the sparse linear equations, which is widely used in many research and engineering application ﬁeld. GPU is the most common coprocessor in high-performance computing domain, and has already been proven to researchers the practical value in accelerating various algorithms. A lot of reletead work has been carried out to optimize parallel SpMV on CPU-GPU platforms, which mainly focuses on reducing the computing overhead on the GPU, including branch divergence and cache missing, and little attention was paid to the overall eﬃciency of the heterogeneous platform. In this paper, we describe the design and implementation of an adaptive sparse matrix-vector multiplication (SpMV) on CPU-GPU heterogeneous architecture. We propose a dynamic task scheduling framework for CPU-GPU platform to improve the utilization of both CPU and GPU. A double buﬀering scheme is also presented to hide the data transfer overhead between CPU and GPU. Two deeply optimized SpMV kernels are deployed for CPU and GPU respectively. The evaluation on typical sparse matrices indicates that the proposed algorithm obtains both signiﬁcant performance increase and adaptability to diﬀerent types of sparse matrices.

Q. Nie and S. Malik, “SpFlow: Memory-driven data ﬂow optimization for sparse matrix-matrix multiplication,” in Proceedings of the IEEE International Symposium on Circuits and Systems, ser. ISCAS’19, May 2019, pp. 1–5. doi: 10.1109/ISCAS.2019.8702111.

Abstract: To improve the performance of sparse matrix-matrix multiplication (SpMM) running on a specialized architecture, orchestrating a data ﬂow that maximizes data reuse in local memory is critical but challenging due to the irregular non-zero element locations and the wide range of sparsity. In this work, we proposed SpFlow, a memory-driven data ﬂow optimization framework for SpMM. SpFlow can realize 54X fewer DRAM accesses and 97X fewer SRAM accesses on average than a GPU running the cuSPARSE kernel. And in comparison with a state-of-the-art accelerator, the performance can be improved by 3X, and SRAM accesses reduced by 5X on average.

I. J. Nisa, “Architecture-aware algorithm design of sparse tensor/matrix primitives for GPUs,” PhD thesis, Ohio State University, 2019.

Abstract: Sparse matrix/tensor operations have been a common computational motif in a wide spectrum of domains – numerical linear algebra, graph analytics, machine learning, health-care, etc. Sparse kernels play a key role in numerous machine learning algorithms and the rising popularity of this domain increases the signiﬁcance of the primitives like SpMV (Sparse Matrix-Vector Multiplication), SDDMM (Sampled Dense-Dense Matrix Multiplication), MF/TF(Sparse Matrix/Tensor Factorization), etc. These primitives are data-parallel and highly suitable for GPU-like architectures that provide massive parallelism. Real-world matrices and tensors are large-scale and have millions of data points, which is suﬃcient to utilize all the cores of a GPU. Yet, a data parallel algorithm can become the bottleneck of an application and perform way below than the upper bound of the rooﬂine model. Some common reasons are frequent irregular global memory access, low data reuse, and imbalanced work distribution. However, eﬃcient utilization of GPU memory hierarchy, reduced thread communication, increased data locality , and an even workload distribution can provide ample opportunities for signiﬁcant performance improvement. The challenge lies in utilizing the techniques across applications and achieve an even performance in spite of the irregularity of the input matrices or tensors. In this work, we systematically identify the performance bottlenecks of the important sparse algorithms and provide optimized and high performing solutions.
At the beginning of this dissertation, we explore the application of cost-eﬀective ML techniques in solving the format selection and performance modeling problem in the SpMV domain. By identifying a small set of sparse matrix features to use in training the ML models, we are able to select the best storage format, and predict the execution time of an SpMV kernel as well. Next, we optimize the SDDMM kernel, which is a key bottleneck in factor analysis and topic modeling algorithms like ALS, LDA, GaP, ALS, etc. The performance constraints are addressed by exploiting data reuse and increasing parallelism using virtual warping, multi-level tiling and eﬀective use of on-chip memory (shared memory), etc. Rest of the following works are on the optimization of factorization techniques of sparse matrix and tensors on GPUs. For matrix factorization, we optimize the cyclic coordinate descent (CCD++), which is the state-of-the-art factorization method. An eﬃcient GPU implementation is devised by using kernel fusion, tiling and binning. Next, we extend the optimization of the factorization problem to higher order data – tensor. MTTKRP (Matricized Tensor Times Khatri-Rao Products) is a key bottleneck of one of the most common tensor factorization techniques – CPD (CANDECOMP/PARAFAC decomposition). We develop new storage-eﬃcient representations, B-CSF and HB-CSF, for tensors that enables high-performance and load-balanced execution of MTTKRP on GPUs. However, for a tensor with d modes, CPD requires a sequence of d tensor computations. To guarantee eﬃcient memory access with respect to diﬀerent modes, many storage formats store d distinct representations despite d-fold space overhead. Hence, we devise MM-CSF, a compact mixed-mode representation where better performance is achieved compared to existing solutions while utilizing a small fraction of the space.

I. Nisa, J. Li, A. Sukumaran-Rajam, P. S. Rawat, S. Krishnamoorthy, and P. Sadayappan, “An eﬃcient mixed-mode representation of sparse tensors,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’19, Denver, Colorado: ACM, 2019, 49:1–49:25, isbn: 978-1-4503-6229-0. doi: 10.1145/3295500.3356216.

Abstract: The Compressed Sparse Fiber (CSF) representation for sparse tensors is a generalization of the Compressed Sparse Row (CSR) format for sparse matrices. For a tensor with d modes, typical tensor methods such as CANDECOMP/PARAFAC decomposition (CPD) require a sequence of d tensor computations, where eﬃcient memory access with respect to diﬀerent modes is required for each of them. The straightforward solution is to use d distinct representations of the tensor, with each one being eﬃcient for one of the d computations. However, a d-fold space overhead is often unacceptable in practice, especially with memory-constrained GPUs. In this paper, we present a mixed-mode tensor representation that partitions the tensor’s nonzero elements into disjoint sections, each of which is compressed to create ﬁbers along a diﬀerent mode. Experimental results demonstrate that better performance can be achieved while utilizing only a small fraction of the space required to keep d distinct CSF representations.

I. Nisa, J. Li, A. Sukumaran-Rajam, R. Vuduc, and P. Sadayappan, “Load-balanced sparse MTTKRP on GPUs,” in Proceedings of the 2019 International Parallel and Distributed Processing Symposium, ser. IPDPS’19, 2019.

Abstract: Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse tensor computations. This work focuses on optimizing the MTTKRP operation on GPUs, addressing both performance and storage requirements. We begin by identifying the performance bottlenecks in directly extending the state-ofthe-art CSF (compressed sparse ﬁber) format from CPUs to GPUs. A signiﬁcant challenge with GPUs compared to multicore CPUs is that of utilizing the much greater degree of parallelism in a load-balanced fashion for irregular computations like sparse MTTKRP. To address this issue, we develop a new storage-eﬃcient representation for tensors that enables highperformance, load-balanced execution of MTTKRP on GPUs. A GPU implementation of sparse MTTKRP using the new sparse tensor representation is shown to outperform all currently known parallel sparse CPU and GPU MTTKRP implementations.

I. Nowak, P. Muts, and E. M. T. Hendrix, “Multi-tree decomposition methods for large-scale mixed integer nonlinear optimization,” in Large Scale Optimization in Supply Chains and Smart Manufacturing: Theory and Applications, J. Velásquez-Bermúdez, M. Khakiﬁrooz, and M. Fathi, Eds. Cham: Springer International Publishing, 2019, pp. 27–58, isbn: 978-3-030-22788-3. doi: 10.1007/978-3-030-22788-3˙2.

Abstract: Most industrial optimization problems are sparse and can be formulated as block-separable mixed-integer nonlinear programmingMixed integer nonlinear programming(MINLP) problems, deﬁned by linking low-dimensional sub-problems by (linear) coupling constraints. Decomposition methods solve a block-separable MINLP by alternately solving master problems and sub-problems. In practice, decomposition methods are sometimes the only possibility to compute high-quality solutions of large-scale optimization problems. However, eﬃcient implementations may require expert knowledge and problem-speciﬁc features. Recently, there is renewed interest in making these methods accessible to general users by developing generic decomposition frameworks and modelling support. The focus of this chapter is on so-called multi-tree decomposition methods, which iteratively approximate the feasible area without using a single (global) branch-and-bound tree, i.e. branch-and-bound is only used for solving sub-problems. After an introduction, we describe ﬁrst outer approximation (OA) decomposition methods, Outer approximationincluding the adaptive, multivariate partitioning (AMP)Adaptive, Multivariate Partitioning (AMP) algorithmand the novel decomposition-based outer approximation (DECOA) algorithmDecomposition-based outer approximation (DECOA). This is followed by a description of multi-tree methods using a reduced master problem for solving large-scale industrial optimization problems. The ﬁrst method to be described applies parallel column generationColumn generation(CG) and iterative ﬁxing for solving nonconvex transport optimization problems with several hundred millions of variables and constraints. The second method is based on a novel approach combining CG and compact outer approximation. The last methodology to be discussed is the general Benders decomposition methodBenders decomposition methodfor globally solving large nonconvex stochastic programs using a reduced mixed-integer programming (MIP) master problem.

M. Osama, M. Truong, C. Yang, A. Buluç, and J. D. Owens, “Graph coloring on the GPU,” UC Davis: College of Engineering, Tech. Rep., 2019. [Online]. Available: https://escholarship.org/uc/item/6kp4p18t.

Abstract: We design and implement parallel graph coloring algorithms on the GPU using two diﬀerent abstractions—one datacentric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic aﬀect performance. Our Gunrock graph coloring implementation has a peak 2× speed-up, a geomean speed-up of 1.3× and produces 1.6× more colors over previous hardwired state-of-theart implementations on real-world datasets. Our GraphBLAS implementation of Luby’s algorithm produces 1.9× fewer colors than the previous state-of-the-art parallel implementation at the cost of 3× extra runtime, and 1.014× fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6×.

S. Pandey, X. S. Li, A. Buluç, J. Xu, and H. Liu, “H-INDEX: Hash-indexing for parallel triangle counting on gpus,” in Proceedings of the IEEE High Performance Extreme Computing Conference, ser. HPEC ’19, Sep. 2019, pp. 1–7. doi: 10.1109/HPEC.2019.8916492.

Abstract: Triangle counting is a graph algorithm that calculates the number of triangles involving each vertex in a graph. Brieﬂy, a triangle encompasses three vertices from a graph, where every vertex possesses at least one incidental edge to the other two vertices from the triangle. Consequently, list intersection, which identiﬁes the incidental edges, becomes the core algorithm for triangle counting. At the meantime, attracted by the enormous parallel computing potential of Graphics Processing Units (GPUs), numerous eﬀorts have been devoted to deploy triangle counting algorithms on GPUs.While state-of-the-art intersection algorithms, such as merge-path and binary-search, perform well on traditional multi-core CPU systems, deploying them on massively parallel GPUs turns out to be challenging. In particular, merge-path based approach experiences the hardship of evenly distributing the workload across vast GPU threads and irregular memory accesses. Binary-search based approach often suﬀers from the potential problem of high time complexity. Furthermore, both approaches require sorted neighbor lists from the input graphs, which involves nontrivial preprocessing overhead. To this end, we introduce H-INDEX, a hash-indexing assisted triangle counting algorithm that overcomes all the aforementioned shortcomings. Notably, HINDEX achieves 141.399 billion TEPS computing rate on a Protein K-mer V2a graph with 64 GPUs. To the best of our knowledge, this is the ﬁrst work that advances triangle counting beyond the 100 billion TEPS rate.

S. Peng and S. X. D. Tan, “GLU3.0: Fast GPU-based parallel sparse LU factorization for circuit simulation,” 2019. arXiv: 1908.00204v2 [cs.DC].

Abstract: In this article, we propose a new GPU-based sparse LU factorization method, called GLU3.0, solves the aforementioned problems. First, it introduces a much more eﬃcient double-U dependency detection algorithm to make the detection much simpler. Second, we observe that the potential parallelism is diﬀerent as the matrix factorization goes on. We then develop three diﬀerent modes of GPU kernel to adapt to diﬀerent stages to accommodate the computing task changes in the factorization. As a result, the new GLU can dynamically allocate GPU blocks and wraps based on the number of columns in a level to better balance the computing demands and resources during the LU factorization process. Experimental results on circuit matrices from University of Florida Sparse Matrix Collection (UFL) show that the GLU3.0 can deliver 2-3 orders of magnitude speedup over GLU2.0 for the data dependency detection. Furthermore, GLU3.0 achieve 13.0× (arithmetic mean) and 6.7× (geometric mean) speedup over GLU2.0 and 7.1× (arithmetic mean) and 4.8× (geometric mean) over the recently proposed enhanced GLU2.0 sparse LU solver on the same set of circuit matrices.

D. Piccinotti, E. Ramalli, A. Parravicini, R. Brondolin, and M. Santambrogio, “Solving write conﬂicts in GPU-accelerated graph computation: A PageRank case-study,” in Proceedings of the IEEE 5th International forum on Research and Technology for Society and Industry, ser. RTSI ’19, Sep. 2019, pp. 144–148. doi: 10.1109/RTSI.2019.8895572.

Abstract: Graph ranking algorithms, such as PageRank, are widely used in a number of real-world applications like web search. As the size of the graphs on which these algorithms are applied gets bigger and bigger, it becomes necessary to devise powerful and ﬂexible techniques to accelerate and parallelize the computation, both at software and hardware level. Leveraging GPUs is a promising direction due to their highly parallel computing capabilities, but execution time is often hampered by write conﬂicts. In this paper, we present a solution to handle write conﬂicts in GPU computations exploiting high level of parallelism, and show how this technique can eﬀectively be used to accelerate the computation of PageRank by a factor of 5×, with respect to a baseline in which conﬂicts are not handled. Our solution is implemented at software level, and doesn’t require speciﬁc hardware resources.

R. Quirynen and S. D. Cairano, “PRESAS: Block-structured preconditioning of iterative solvers within a primal active-set method for fast MPC,” Dec. 4, 2019. arXiv:1912.02122 [math.OC].

Abstract: Model predictive control (MPC) for linear dynamical systems requires solving an optimal control structured quadratic program (QP) at each sampling instant. This paper proposes a primal active-set strategy (PRESAS) for the eﬃcient solution of such block-sparse QPs, based on a preconditioned iterative solver to compute the search direction in each iteration. Rank-one factorization updates of the preconditioner result in a per-iteration computational complexity of 𝒪(Nm2), where m denotes the number of state and control variables and N the number of control intervals. Three diﬀerent block-structured preconditioning techniques are presented and their numerical properties are studied further. In addition, an augmented Lagrangian based implementation is proposed to avoid a costly initialization procedure to ﬁnd a primal feasible starting point. Based on a standalone C code implementation, we illustrate the computational performance of PRESAS against current state of the art QP solvers for multiple linear and nonlinear MPC case studies. We also show that the solver is real-time feasible on a dSPACE MicroAutoBox-II rapid prototyping unit for vehicle control applications, and numerical reliability is illustrated based on experimental results from a testbench of small-scale autonomous vehicles.

H. M. D. Rais, S. A. Abed, and J. Watada, “Computational comparison of major proposed methods for graph partitioning problem,” Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 23, no. 1, pp. 5–17, 2019. doi: 10.20965/jaciii.2019.p0005.

Abstract: k-way graph partitioning is an NP-complete problem, which is applied to various tasks such as route planning, image segmentation, community detection, and high-performance computing. The approximate methods constitute a useful solution for these types of problems. Thus, many research studies have focused on developing meta-heuristic algorithms to tackle the graph partitioning problem. Local search is one of the earliest methods that has been applied eﬃciently to this type of problem. Recent studies have explored various types of local search methods and have improved them such that they can be used with the partitioning process. Moreover, local search methods are widely integrated with population-based approaches, to provide the best diversiﬁcation and intensiﬁcation for the problem space. This study emphasizes the local search approaches, as well as their combination with other graph partitioning approaches. At present, none of the surveys in the literature has focused on this class of state of the art approaches in much detail. In this study, the vital parts of these approaches including neighborhood structure, acceptance criterion, and the ways of combining them with other approaches, are highlighted. Additionally, we provide an experimental comparison that shows the variance in the performance of the reviewed methods. Hence, this study clariﬁes these methods to show their advantages and limitations for the targeted problem, and thus can aid in the direction of research ﬂow towards the area of graph partitioning.

C. Ramesh, “Hardware-software co-design accelerators for sparse blas,” PhD thesis, Centre for Nano Science and Engineering (CeNSE), Indian Institute of Science, Bangalore, 2019. eprint: http://etd.iisc.ac.in/handle/2005/4276.

Abstract: Sparse Basic Linear Algebra Subroutines (Sparse BLAS) is an important library. Sparse BLAS includes three levels of subroutines. Level 1, Level2 and Level 3 Sparse BLAS routines. Level 1 Sparse BLAS routines do computations over sparse vector and spare/dense vector. Level 2 deals with sparse matrix and vector operations. Level 3 deals with sparse matrix and dense matrix operations. The computations of these Sparse BLAS routines on General Purpose Processors (GPPs) not only suﬀer from less utilization of hardware resources but also takes more compute time than the workload due to poor data locality of sparse vector/matrix storage formats. In the literature, tremendous eﬀorts have been put into software to improve these Sparse BLAS routines performance on GPPs. GPPs best suit for applications with high data locality, whereas Sparse BLAS routines operate on applications with less data locality hence, GPPs performance is poor. Various Custom Function Units (Hardware Accelerators) are proposed in the literature and are proved to be eﬃcient than soft wares which tried to accelerate Sparse BLAS subroutines. Though existing hardware accelerators improved the Sparse BLAS performance compared to software Sparse BLAS routines, there is still lot of scope to improve these accelerators. This thesis describes both the existing software and hardware software co-designs (HW/SW co-design) and identiﬁes the limitations of these existing solutions. We propose a new sparse data representation called Sawtooth Compressed Row Storage (SCRS) and corresponding SpMV and SpMM algorithms. SCRS based SpMV and SpMM are performing better than existing software solutions. Even though SCRS based SpMV and SpMM algorithms perform better than existing solutions, they still could not reach theoretical peak performance. The knowledge gained from the study of limitations of these existing solutions including the proposed SCRS based SpMV and SpMM is used to propose new HW/SW co-designs. Software accelerators are limited by the hardware properties of GPPs, and GPUs itself, hence, we propose HW/SW co-designs to accelerate few basic Sparse BLAS operations (SpVV and SpMV). Our proposed Parallel Sparse BLAS HW/SW co-design achieves near theoretical peak performance with reasonable hardware resources.

S. Regev and M. A. Saunders, “SSAI: A symmetric sparse approximate inverse preconditioner for the conjugate gradient method,” Stanford University, Tech. Rep., 2019.

Abstract: We propose a method for solving a Hermitian positive deﬁnite linear system Ax = b, where A is an explicit sparse matrix (real or complex). A sparse approximate right inverse M is computed and replaced by = (M + MH)∕2, which is used as a left-right preconditioner in a modiﬁed version of the preconditioned conjugate gradient (PCG) method. M is formed column by column and can therefore be computed in parallel. PCG requires only matrix-vector multiplications with A and (not solving a linear system with the preconditioner), and so too can be carried out in parallel. We compare it with incomplete Cholesky factorization (the gold standard for PCG) and with MATLAB’s backslash operator (sparse Cholesky) on matrices from various applications.

I. Z. Reguly, G. R. Mudalige, M. B. Giles, and S. Maheswaran, “Improving resilience of scientiﬁc software through a domain-speciﬁc approach,” Journal of Parallel and Distributed Computing, 2019, issn: 0743-7315. doi: 10.1016/j.jpdc.2019.01.015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731519300917.

Abstract: In this paper we present research on improving the resilience of the execution of scientiﬁc software, an increasingly important concern in High Performance Computing (HPC). We build on an existing high-level abstraction framework, the Oxford Parallel library for Structured meshes (OPS), developed for the solution of multi-block structured mesh-based applications, and implement an algorithm in the library to carry out checkpointing automatically, without the intervention of the user. The target applications are a hydrodynamics benchmark application from the Mantevo Suite, CloverLeaf 3D, the sparse linear solver proxy application TeaLeaf, and the OpenSBLI compressible Navier–Stokes direct numerical simulation (DNS) solver. We present (1) the basic algorithm that OPS relies on to determine the optimal checkpoint in terms of size and location, (2) improvements that supply additional information to improve the decision, (3) techniques that reduce the cost of writing the checkpoints to non-volatile storage, (4) a performance analysis of the developed techniques on a single workstation and on several supercomputers, including ORNL’s Titan. Our results demonstrate the utility of the high-level abstractions approach in automating the checkpointing process and show that performance is comparable to, or better than the reference in all cases.

S. C. Regunta, S. H. Tondomker, and K. Kothapalli, “Brics – eﬃcient techniques for estimating the farness-centrality in parallel,” in 2019 IEEE International Parallel and Distributed Processing Symposium Workshops, ser. IPDPSW’19, May 2019, pp. 645–654. doi: 10.1109/IPDPSW.2019.00110.

Abstract: In this paper, we study scalable parallel algorithms for estimating the farness-centrality value of the nodes in a given undirected and connected graph. Our algorithms consider approaches that are more suitable for sparse graphs. To this end, we propose four optimization techniques based on removing redundant nodes, removing identical nodes, removing chain nodes, and making use of decomposition based on the biconnected components of the input graph. We test our techniques on a collection of real-world graphs for the time taken and the average error percentage. We further analyze the applicability of our techniques on various classes of real-world graphs. We suggest why certain techniques work better on certain classes of graphs.

T. Ribizel and H. Anzt, “Approximate and exact selection on GPUs,” in 2019 IEEE International Parallel and Distributed Processing Symposium Workshops, ser. IPDPSW’19, May 2019, pp. 471–478. doi: 10.1109/IPDPSW.2019.00088.

Abstract: We present a novel algorithm for parallel selection on GPUs. The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for diﬀerent GPU generations, always using the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for – and exploiting the characteristics of – ”pleasant” data distributions. At the same time, as the SampleSelect does not work on the actual values but the ranks of the elements only, it is robust to the input data and can complete signiﬁcantly faster for adversarial data distributions. Additionally to the exact SampleSelect, we address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy.

T. Ribizel and H. Anzt, “Parallel selection on GPUs,” Parallel Computing, p. 102 588, 2019, issn: 0167-8191. doi: 10.1016/j.parco.2019.102588. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167819119301796.

Abstract: We present a novel parallel selection algorithm for GPUs capable of handling single rank selection (single selection) and multiple rank selection (multiselection). The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for diﬀerent GPU generations, always leveraging the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for – and exploiting the characteristics of – “pleasant” data distributions. At the same time, as the proposed SampleSelect algorithm does not work on the actual element values but on the element ranks of the elements only, it is robust to the input data and can complete signiﬁcantly faster for adversarial data distributions. We also address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy.

F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. Pileggi, and F. Franchetti, “Eﬃcient SpMV operation for large and highly sparse matrices using scalable multi-way merge parallelization,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52, Columbus, OH, USA: ACM, 2019, pp. 347–358, isbn: 978-1-4503-6938-1. doi: 10.1145/3352460.3358330.

Abstract: The importance of Sparse Matrix dense Vector multiplication (SpMV) operation in graph analytics and numerous scientiﬁc applications has led to development of custom accelerators that are intended to over-come the diﬃculties of sparse data operations on general purpose architectures. However, eﬃcient SpMV operation on large problem (i.e. working set exceeds on-chip storage) is severely constrained due to strong dependence on limited amount of fast random access memory to scale. Additionally, unstructured matrix with high sparsity pose diﬃculties as most solutions rely on exploitation of data locality. This work presents an algorithm co-optimized scalable hardware architecture that can eﬃciently operate on very large ( billion nodes) and/or highly sparse (avg. degree ¡10) graphs with signiﬁcantly less on-chip fast memory than existing solutions. A novel parallelization methodology for implementing large and high throughput multi-way merge network is the key enabler of this high performance SpMV accelerator. Additionally, a data compression scheme to reduce oﬀ-chip traﬃc and special computation for nodes with exceptionally large number of edges, commonly found in power-law graphs, are presented. This accelerator is demonstrated with 16-nm fabricated ASIC and Stratix 10 FPGA platforms. Experimental results show more than an order of magnitude improvement over current custom hardware solutions and more than two orders of magnitude improvement over commercial oﬀ-the-shelf (COTS) architectures for both performance and energy eﬃciency.

D. Sahasrabudhe, E. T. Phipps, S. Rajamanickam, and M. Berzins, “A portable SIMD primitive using kokkos for heterogeneous architectures,” in Proceedings of the 6th Workshop on Accelerator Programming Using Directives, Denver, Colorado: ACM, 2019, pp. 1–23.

Abstract: As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-speciﬁc development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more eﬃcient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A uniﬁed, portable, Single Instruction Multiple Data (SIMD) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (KNL), and also facilitates Single Instruction Multiple Threads (SIMT) based execution on gpus. This uniﬁed primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new SIMD primitive is used on diﬀerent architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against eﬃciently vectroized baseline, and to evaluate the new feature called the “logical vector length” (LVL). The SIMD primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.

P. Sao, R. Kannan, X. S. Li, and R. Vuduc, “A communication-avoiding 3d sparse triangular solver,” in Proceedings of the ACM International Conference on Supercomputing, ser. ICS ’19, Phoenix, Arizona: ACM, 2019, pp. 127–137, isbn: 978-1-4503-6079-1. doi: 10.1145/3330345.3330357.

Abstract: We present a novel distributed memory algorithm to improve the strong scalability of the solution of a sparse triangular system. This operation appears in the solve phase of direct methods for solving general sparse linear systems, Ax = b. Our 3D sparse triangular solver employs several techniques, including a 3D MPI process grid, elimination tree parallelism, and data replication, all of which reduce the per-process communication when combined. We present analytical models to understand the communication cost of our algorithm and show that our 3D sparse triangular solver can reduce the per-process communication volume asymptotically by a factor of O(n1∕4) and O(n1∕6) for problems arising from the ﬁnite element discretizations of 2D ”planar” and 3D ”non-planar” PDEs, respectively. We implement our algorithm for use in SuperLU_DIST3D, using a hybrid MPI+OpenMP programming model. Our 3D triangular solve algorithm, when run on 12k cores of Cray XC30, outperforms the current state-of-the-art 2D algorithm by 7.2× for planar and 2.7× for the non-planar sparse matrices, respectively.

P. Sao, X. S. Li, and R. Vuduc, “A communication-avoiding 3d algorithm for sparse lu factorization on heterogeneous systems,” Journal of Parallel and Distributed Computing, 2019, issn: 0743-7315. doi: 10.1016/j.jpdc.2019.03.004. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731518305197.

Abstract: We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades oﬀ increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (speciﬁcally for 3D grids and meshes). For a planar graph with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of Ologn and latency by a factor of Ologn. For non-planar cases, our algorithm can reduce the per-process communication volume by 3× and latency by On13 times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Oﬄoad (Halo) algorithm for co-processor oﬄoad [44]. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration.

J. Scott and M. Tuma, “Sparse stretching for solving sparse-dense linear least-squares problems,” SIAM Journal on Scientiﬁc Computing, 2019, issn: 1095-7197.

Abstract: Large-scale linear least-squares problems arise in a wide range of practical applications. In some cases, the system matrix contains a small number of dense rows. These make the problem signiﬁcantly harder to solve because their presence limits the direct applicability of sparse matrix techniques. In particular, the normal matrix is (close to) dense, so that forming it is impractical. One way to help overcome the dense row problem is to employ matrix stretching. Stretching is a sparse matrix technique that improves sparsity by making the least-squares problem larger. We show that standard stretching can still result in the normal matrix for the stretched problem having an unacceptably large amount of ﬁll. This motivates us to propose a new sparse stretching strategy that performs the stretching so as to limit the ﬁll in the normal matrix and its Cholesky factor. Numerical examples from real problems are used to illustrate the potential gains.

D. di Seraﬁno and D. Orban, “Constraint-preconditioned krylov solvers for regularized saddle-point systems,” Oct. 7, 2019. doi: 10.5281/zenodo.3473542. arXiv:1910.02552 [math.NA].

Abstract: We consider the iterative solution of regularized saddle-point systems. When the leading block is symmetric and positive semi-deﬁnite on an appropriate subspace, Dollar, Gould, Schilders, and Wathen (2006) describe how to apply the conjugate gradient (CG) method coupled with a constraint preconditioner, a choice that has proved to be eﬀective in optimization applications. We investigate the design of constraint-preconditioned variants of other Krylov methods for regularized systems by focusing on the underlying basis-generation process. We build upon principles laid out by Gould, Orban, and Rees (2014) to provide general guidelines that allow us to specialize any Krylov method to regularized saddle-point systems. In particular, we obtain constraint-preconditioned variants of Lanczos and Arnoldi-based methods, including the Lanczos version of CG, MINRES, SYMMLQ, GMRES(m) and DQGMRES. We also provide MATLAB implementations in hopes that they are useful as a basis for the development of more sophisticated software. Finally, we illustrate the numerical behavior of constraint-preconditioned Krylov solvers using symmetric and nonsymmetric systems arising from constrained optimization.

H. Shaiek, S. Tomov, A. Ayala, A. Haidar, and J. Dongarra, “GPUDirect MPI communications and optimizations to accelerate FFTs on exascale systems,” ICL, Extended Abstract icl-ut-19-06, Sep. 2019.

Abstract: Fast Fourier transforms (FFTs) are used in applications ranging from molecular dynamics and spectrum estimation to machine learn- ing, fast convolution and correlation, signal modulation, wireless multimedia applications, and others. However, FFTs are memory bound, and therefore, to accelerate them, it is crucial to avoid and optimize the FFTs communications. To this end, we present a 3-D FFT design for distributed graphics processing unit (GPU) systems that: (1) eﬃciently uses GPUs high bandwidth, (2) reduces global communications algorithmically, when possible, and (3) employs GPUDirect technologies as well as MPI optimizations in the development of high-performance FFTs for large-scale GPU-accelerated systems. We show that these developments and optimizations lead to very good strong scalability and a performance that is close to 90% of the theoretical peak.

Y. Shi, “Eﬃcient tensor operations via compression and parallel computation,” PhD thesis, University of California, Irvine, 2019. [Online]. Available: https://escholarship.org/uc/item/2wm4k3sn.

Abstract: Linear algebra is the foundation of machine learning, especially for handling big data. We want to extract useful information that can represent the behavior of the data. For data with underlying known structures, it is straightforward to apply algorithms that maintain that structure. For instance, singular value decomposition (SVD) is one way to approximate lowrank matrices. The generalized SVD, tensor decomposition, is the crux of model estimation for tensors. However, not all data has a trivial structure. Multi-modality data that contains information from diﬀerent sources can be complex and hard to extract the structure. A data-independent randomized algorithm, such as sketching, is the solution for this case. Under both scenarios, the information extraction process may be statistically challenging as the problems are non-convex optimization problems. More importantly, the large size and the high-dimensionality of the data have been signiﬁcant obstacles in discovering hidden variables and summarizing them. Thus, how to improve high-dimensional data computation eﬃciency is vitally important.
This thesis contains the theoretical analysis for learning the underlying information from high-dimensional structured or non-structured data via tensor operations such as tensor decomposition and tensor sketching. It is easy to consider tensors as multi-dimensional vectors or matrices and apply vector/matrix-based algorithms to ﬁnd the solution. However, these methods omit multi-dimensionality of the data and can be computational ineﬃcient than considering the tensor as a whole. We show the superiority of our approximation algorithms over these methods from computation and memory eﬃciency point of views.
This thesis also discusses optimizing tensor operation computations from the high-performance computing aspect. Conventional methods treat tensors as ﬂattened matrices or vectors. Operations between tensors may require lots of permutations and reshapes. We propose new tensor algebra computation routines that avoid the prepossessing as much as possible. The value of this approach and its applications are recognized by NVIDIA. The proposed interface exists in the CUBLAS 8.0.

Z. Shi and A. Eryilmaz, “Cubic regularized ADMM with convergence to a local minimum in non-convex optimization,” in Proceedings of the 57th Annual Allerton Conference on Communication, Contrl, and Computing, ser. Allerton, 2019.

Abstract: How to escape saddle points is a critical issue in nonconvex optimization. Previous methods on this issue mainly assume that the objective function is Hessian-Lipschitz, which leave a gap for applications using non-Hessian-Lipschitz functions. In this paper, we propose Cubic Regularized Alternating Direction Method of Multipliers (CR-ADMM) to escape saddle points of separable non-convex functions containing a non-HessianLipschitz component. By carefully choosing a parameter, we prove that CR-ADMM converges to a local minimum of the original function with a rate of O() in time horizon T, which is faster than gradient-based methods. We also show that when one or more steps of CR-ADMM are not solved exactly, CRADMM can converge to a neighborhood of the local minimum. Through the experiments of matrix factorization problems, CRADMM is shown to have a faster rate and a lower optimality gap compared with other gradient-based methods. Our approach can also ﬁnd applications in other scenarios where regularized non-convex cost minimization is performed, such as parameter optimization of deep neural networks.

W. M. Sid-Lakhdar, M. M. Aznaveh, X. S. Li, and J. W. Demmel, “Multitask and transfer learning for autotuning exascale applications,” Aug. 15, 2019. arXiv:1908.05792 [cs.LG].

Abstract: Multitask learning and transfer learning have proven to be useful in the ﬁeld of machine learning when additional knowledge is available to help a prediction task. We aim at deriving methods following these paradigms for use in autotuning, where the goal is to ﬁnd the optimal performance parameters of an application treated as a black-box function. We show comparative results with state-of-the-art autotuning techniques. For instance, we observe an average 1.5× improvement of the application runtime compared to the OpenTuner and HpBandSter autotuners. We explain how our approaches can be more suitable than some state-of-the-art autotuners for the tuning of any application in general and of expensive exascale applications in particular.

F. Silvestri and F. Vella, “A computational model for tensor core units,” Aug. 19, 2019. arXiv:1908.06649v1 [cs.DS].

Abstract: To respond to the need of eﬃcient training and inference of deep neural networks, a pletora of domain-speciﬁc hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for eﬃciently computing a dense matrix multiplication of a given small size. In order to broad the class of algorithms that exploit these systems, we propose a computational model, named TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for linear algebra problems, including dense and sparse matrix multiplication, FFT, integer multiplication, and polynomial evaluation. We ﬁnally highlight a relation between the TCU model and the external memory model.

I. Sivkov, A. Lazzaro, and J. Hutter, “DBCSR: A library for dense matrix multiplications on distributed gpu-accelerated systems,” Oct. 10, 2019. arXiv:1910.04796 [cs.DC].

Abstract: Most, if not all the modern scientiﬁc simulation packages utilize matrix algebra operations. Among the operation of the linear algebra, one of the most important kernels is the multiplication of matrices, dense and sparse. Examples of application of such a kernel are in electronic structure calculations, machine learning, data mining, graph processing, and digital signal processing. Several optimized libraries exist that can achieve high-performance on distributed systems. Only a few of them target distributed GPU-accelerated systems. In most of the cases, these libraries are provided and optimized by system vendors for their speciﬁc computer systems. In this paper, we present the DBCSR library (Distributed Block Compressed Sparse Row) for the distributed dense matrix-matrix multiplications. Although the library is speciﬁcally designed for block-sparse matrix-matrix multiplications, we optimized it for the dense case on GPU-accelerated systems. We show that the DBCSR outperforms the multiplication of matrices of diﬀerent sizes and shapes provided by a vendor optimized GPU version of the ScaLAPACK library up to 2.5× (1.4× on average).

I. Sivkov, P. Seewald, A. Lazzaro, and J. Hutter, “DBCSR: A blocked sparse tensor algebra library,” Oct. 29, 2019. arXiv:1910.13555 [cs.DC].

Abstract: Advanced algorithms for large-scale electronic structure calculations are mostly based on processing multi-dimensional sparse data. Examples are sparse matrix-matrix multiplications in linear-scaling Kohn-Sham calculations or the eﬃcient determination of the exact exchange energy. When going beyond mean ﬁeld approaches, e.g. for Moller-Plesset perturbation theory, RPA and Coupled-Cluster methods, or the GW methods, it becomes necessary to manipulate higher-order sparse tensors. Very similar problems are also encountered in other domains, like signal processing, data mining, computer vision, and machine learning. With the idea that the most of the tensor operations can be mapped to matrices, we have implemented sparse tensor algebra functionalities in the frames of the sparse matrix linear algebra library DBCSR (Distributed Block Compressed Sparse Row). DBCSR has been speciﬁcally designed to eﬃciently perform blocked-sparse matrix operations, so it becomes natural to extend its functionality to include tensor operations. We describe the newly developed tensor interface and algorithms. In particular, we introduce the tensor contraction based on a fast rectangular sparse matrix multiplication algorithm.

D. Steck and C. Kanzow, “Regularization of limited memory quasi-newton methods for large-scale nonconvex minimization,” Nov. 11, 2019. arXiv:1911.04584 [math.OC].

Abstract: This paper deals with the unconstrained optimization of smooth objective functions. It presents a class of regularized quasi-Newton methods whose globalization turns out to be more eﬃcient than standard line search or trust-region strategies. The focus is therefore on the solution of large-scale problems using limited memory quasi-Newton techniques. Global convergence of the regularization methods is shown under mild assumptions. The details of the regularized limited memory quasi-Newton updates are discussed including their compact representations. Numerical results using all large-scale test problems from the CUTEst collection indicate that the regularization method outperforms the standard line search limited memory BFGS method.

G. Stramondo, C. B. Ciobanu, C. Laat, and A. L. Varbanescu, “Designing and building application-centric parallel memories,” Concurrency and Computation: Practice and Experience, Aug. 2019. doi: 10.1002/cpe.5485.

Abstract: Memory bandwidth is a critical performance factor for many applications and architectures. Intuitively, a parallel memory could be a good solution for any bandwidth-limited application, yet building application-centric custom parallel memories remains a challenge. In this work, we present a comprehensive approach to tackle this challenge and demonstrate how to systematically design and implement application-centric parallel memories. Speciﬁcally, our approach (1) analyzes the application memory access traces to extract parallel accesses, (2) conﬁgures our parallel memory for maximum performance, and (3) builds the actual application-centric memory system. We further provide a simple performance prediction model for the constructed memory system. We evaluate our approach with two sets of experiments. First, we demonstrate how our parallel memories provide performance beneﬁts for a broad range of memory access patterns. Second, we prove the feasibility of our approach and validate our performance model by implementing and benchmarking the designed parallel memories using FPGA hardware and a sparse version of the STREAM benchmark.

Y.-H. Tang, O. Selvitopi, D. Popovici, and A. Buluç, “A high-throughput solver for marginalized graph kernels on GPU,” Oct. 14, 2019. arXiv:1910.06310 [cs.DC].

Abstract: We present the design and optimization of a solver for eﬃcient and high-throughput computation of the marginalized graph kernel on General Purpose GPUs. The graph kernel is computed using the conjugate gradient method to solve a generalized Laplacian of the tensor product between a pair of graphs. To cope with the large gap between the instruction throughput and the memory bandwidth of the GPUs, our solver forms the graph tensor product on-the-ﬂy without storing it in memory. This is achieved by using threads in a warp cooperatively to stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. We propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to further exploit sparsity. We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales.

B. Tasseﬀ, C. Coﬀrin, A. Wächter, and C. Laird, “Exploring beneﬁts of linear solver parallelism on modern nonlinear optimization applications,” Sep. 17, 2019. arXiv:1909.08104 [math.OC].

Abstract: The advent of eﬃcient interior point optimization methods has enabled the tractable solution of large-scale linear and nonlinear programming (NLP) problems. A prominent example of such a method is seen in Ipopt, a widely-used, open-source nonlinear optimization solver. Algorithmically, Ipopt depends on the use of a sparse symmetric indeﬁnite linear system solver, which is heavily employed within the optimization of barrier subproblems. As such, the performance and reliability of Ipopt is dependent on the properties of the selected linear solver. Inspired by a trend in mathematical programming toward solving larger and more challenging NLPs, this work explores two core questions: ﬁrst, how does the scalability of available linear solvers, many of which exhibit shared-memory parallelism, impact Ipopt performance; and second, does the best linear solver vary across NLP problem classes, including nonlinear network problems and problems constrained by partial diﬀerential equations? To better understand these properties, this paper ﬁrst describes available open- and closed-source, serial and parallel linear solvers and the fundamental diﬀerences among them. Second, it introduces the coupling of a new open-source linear solver capable of heterogeneous parallelism over multi-core central processing units and graphics processing units. Third, it compares linear solvers using a variety of mathematical programming problems, including standard test problems for linear and nonlinear optimization, optimal power ﬂow benchmarks, and scalable two- and three-dimensional partial diﬀerential equation and optimal control problems. Finally, linear solver recommendations are provided to maximize Ipopt performance across diﬀerent application domains.

S. Tomov, A. Haidar, A. Ayala, H. Shaiek, and J. Dongarra, “FFT-ECP implementation optimizations and features phase,” Innovative Computing Laboratory, University of Tennessee, ECP WBS 2.3.3.09 Milestone Report FFT-ECP ST-MS-10-1440, Sep. 2019, revision 09-2019. [Online]. Available: https://www.icl.utk.edu/files/publications/2019/icl-utk-1263-2019.pdf.

Abstract: The goal of this milestone was the imlementation optimizations and features phase of 3-D FFTs in the FFT-ECP project. The target architectures are large-scale distributed GPU-accelerated platforms.
In this milestone we describe the implmentation optimizations, features, and performance of the 3-D FFTs that we developed for heterogeneous systems with GPUs. Speciﬁcally, this milestone delivered on the following sub-tasks: – Extend FFT-ECP to support various precisions, including real, and investigate the feasibility of mixed precision FFT solvers; – Develop support for ﬂexible data layouts and enable the new library to handle data conversion/communication on the backend in an optimized and dynamic adaptive fashion based on the communication cost model analyzed in the previous milestone; – Optimize the distributed 3-D FFT-ECP solver to enable multiple FFTs per MPI process (with accelerators) and multiple GPUs per node. A main part of this milestone were the performance optimizations and the additions of features that targeted ECP applications need.
The artifacts delivered include the performance optimizations and features added to the solvers, and a tuned FFT-ECP software, freely available on the FFT-ECP’s Git repository hosted on Bitbucket, [https://bitbucket.org/icl/heffte/](https://bitbucket.org/icl/heffte/). This is the ﬁrst software release under the FFT-ECP project. Released is a new FFT library, called heFFTe version 0.1 (Highly Eﬃcient FFTs for Exascale).
See also the FFT-ECP website, [http://icl.utk.edu/fft/](http://icl.utk.edu/fft/) for more details on the FFT-ECP project.

B. Uçar, “Partitioning, matching, and ordering: Combinatorial scientiﬁc computing with matrices and tensors,” École Normale Supérieure de Lyon, Tech. Rep., 2019.

Abstract: This document investigates three classes of problems at the interplay of discrete algorithms, combinatorial optimization, and numerical methods. The general research area is called combinatorial scientiﬁc computing (CSC). In CSC, the contributions have practical and theoretical ﬂavor. For all problems discussed in this document, we have the design, analysis, and implementation of algorithms along with many experiments. The theoretical results are included in this document, some with proofs; the reader is invited to the original papers for the omitted proofs. A similar approach is taken for presenting the experiments. While most results for observing theoretical ﬁndings in practice are included, the reader is referred to the original papers for some other results (e.g., run time analysis).

S. Usman, R. Mehmood, I. Katib, A. Albeshri, and S. M. Altowaijri, “Zaki: A smart method and tool for automatic performance optimization of parallel spmv computations on distributed memory machines,” Mobile Networks and Applications, 2019. doi: 10.1007/s11036-019-01318-3.

Abstract: SpMV is a vital computing operation of many scientiﬁc, engineering, economic and social applications, increasingly being used to develop timely intelligence for the design and management of smart societies. Several factors aﬀect the performance of SpMV computations, such as matrix characteristics, storage formats, software and hardware platforms. The complexity of the computer systems is on the rise with the increasing number of cores per processor, diﬀerent levels of caches, processors per node and high speed interconnect. There is an ever-growing need for new optimization techniques and eﬃcient ways of exploiting parallelism. In this paper, we propose ZAKI, a data-driven, machine-learning approach and tool, to predict the optimal number of processes for SpMV computations of an arbitrary sparse matrix on a distributed memory machine. The aim herein is to allow application scientists to automatically obtain the best conﬁguration, and hence the best performance, for the execution of SpMV computations. We train and test the tool using nearly 2000 real world matrices obtained from 45 application domains including computational ﬂuid dynamics (CFD), computer vision, and robotics. The tool uses three machine learning methods, decision trees, random forest, gradient boosting, and is evaluated in depth. A discussion on the applicability of our proposed tool to energy eﬃciency optimization of SpMV computations is given. This is the ﬁrst work where the sparsity structure of matrices have been exploited to predict the optimal number of processes for a given matrix in distributed memory environments by using diﬀerent base and ensemble machine learning methods.

J. Wang, Z. Huang, L. Kong, J. Xiao, P. Wang, L. Zhang, and C. Li, “Performance of training sparse deep neural networks on GPUs,” in Proceedings of the IEEE High Performance Extreme Computing Conference, ser. HPEC ’19, Sep. 2019, pp. 1–5. doi: 10.1109/HPEC.2019.8916506.

Abstract: Deep neural networks have revolutionized the ﬁeld of machine learning by dramatically improving the state-of-the-art in various domains. The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to fast store and train them. Over the past few decades, researches have explored the prospect of sparse DNNs before, during, and after training by pruning edges from the underlying topology. After the above operation, the generated neural network is known as a sparse neural network. More recent works have demonstrated the remarkable results that certain sparse DNNs can train to the same precision as dense DNNs at lower runtime and storage cost. Although existing methods ease the situation that high demand for computation resources severely hinders the deployment of large-scale DNNs in resource-constrained devices, DNNs can be trained at a faster speed and lower cost. In this work, we propose a Fine-tune Structured Sparsity Learning (FSSL) method to regularize the structures of DNNs and accelerate the training of DNNs. FSSL can: (1) learn a compact structure from large sparse DNN to reduce computation cost; (2) obtain a hardware-friendly to accelerate the DNNs evaluation eﬃciently. Experimental results of the training time and the compression rate show that superior performance and eﬃciency than the Matlab example code. These speedups are about twice speedups of non-structured sparsity.

D. B. Williams-Young, P. G. Beckman, and C. Yang, “A shift selection strategy for parallel shift-invert spectrum slicing in symmetric self-consistent eigenvalue computation,” Aug. 16, 2019. arXiv:1908.06043 [math.NA].

Abstract: The central importance of large scale eigenvalue problems in scientiﬁc computation necessitates the development massively parallel algorithms for their solution. Recent advances in dense numerical linear algebra have enabled the routine treatment of eigenvalue problems with dimensions on the order of hundreds of thousands on the world’s largest supercomputers. In cases where dense treatments are not feasible, Krylov subspace methods oﬀer an attractive alternative due to the fact that they do not require storage of the problem matrices. However, demonstration of scalability of either of these classes of eigenvalue algorithms on computing architectures capable of expressing excessive parallelism is non-trivial due to communication requirements and serial bottlenecks, respectively. In this work, we introduce the SISLICE method: a parallel shift-invert algorithm for the solution of the symmetric self-consistent ﬁeld (SCF) eigenvalue problem. The SISLICE method drastically reduces the communication requirement of current parallel shift-invert eigenvalue algorithms through various shift selection and migration techniques based on density of states estimation and k-means clustering, respectively. This work demonstrates the robustness and parallel performance of the SISLICE method on a representative set of SCF eigenvalue problems and outlines research directions which will be explored in future work.

M. Winter, D. Mlakar, R. Zayer, H.-P. Seidel, and M. Steinberger, “Adaptive sparse matrix-matrix multiplication on the GPU,” in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’19, Washington, District of Columbia: ACM, 2019, pp. 68–81, isbn: 978-1-4503-6225-2. doi: 10.1145/3293883.3295701.

Abstract: In the ongoing eﬀorts targeting the vectorization of linear algebra primitives, sparse matrix-matrix multiplication (SpGEMM) has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important, this disparity can be attributed mainly to the additional formidable challenges raised by SpGEMM.
In this paper, we present a dynamic approach for addressing SpGEMM on the GPU. Our approach works directly on the standard compressed sparse rows (CSR) data format. In comparison to previous SpGEMM implementations, our approach guarantees a homogeneous, load-balanced access pattern to the ﬁrst input matrix and improves memory access to the second input matrix. It adaptively re-purposes GPU threads during execution and maximizes the time eﬃcient on-chip scratchpad memory can be used. Adhering to a completely deterministic scheduling pattern guarantees bit-stable results during repetitive execution, a property missing from other approaches. Evaluation on an extensive sparse matrix benchmark suggests our approach being the fastest SpGEMM implementation for highly sparse matrices (80% of the set). When bit-stable results are sought, our approach is the fastest across the entire test set.

R. Wu, “Dynamic scheduling strategy for block parallel cholesky factorization based on activity on edge network,” IEEE Access, vol. 7, pp. 66 317–66 324, 2019, issn: 2169-3536. doi: 10.1109/ACCESS.2019.2917714.

Abstract: The eﬃcient development of system software and design applications in parallel architecture is a notable challenge considering various aspects, such as load balancing, memory spaces, communication, and synchronization. This paper presents a block parallel Cholesky factorization algorithm for a multicore system, which is developed based on activity on edge network. First, the basic block computing tasks and their dependencies are taken as vertices and edges, respectively, and a directed acyclic graph corresponding to the speciﬁc block parallel Cholesky factorization is generated. Next, each edge of the directed acyclic graph is assigned to a weight equal to the processing time of the initial vertex of the edge, and the directed acyclic graph becomes an activity on edge network with only one starting and one ending vertex. Finally, a queuing algorithm is designed for the basic block computing tasks according to the edge activity on edge network, and a dynamic scheduling strategy is developed for block parallel Cholesky factorization. The results of the experiments concerning the parallel execution time of the algorithm in multicore systems with diﬀerent conﬁgurations demonstrate that the proposed algorithm has notable advantages compared with the traditional static scheduling algorithm, and it exhibits satisfactory load balancing, parallelism, and scalability capacities.

G. Xiao, K. Li, Y. Chen, W. He, A. Zomaya, and T. Li, “CASpMV: A customized and accelerative spmv framework for the sunway taihulight,” IEEE Transactions on Parallel and Distributed Systems, pp. 1–1, 2019, issn: 1045-9219. doi: 10.1109/TPDS.2019.2907537.

Abstract: The Sunway TaihuLight, equipped with 10 million cores, is currently the world’s third fastest supercomputer. SpMV is one of core algorithms in many high-performance computing applications. This paper implements a ﬁne-grained design for generic parallel SpMV based on the special Sunway architecture and ﬁnds three main performance limitations, i.e., storage limitation, load imbalance, and huge overhead of irregular memory accesses. To address these problems, this paper introduces a customized and accelerative framework for SpMV (CASpMV) on the Sunway. The CASpMV customizes an auto-tuning four-way partition scheme for SpMV based on the proposed statistical model, which describes the sparse matrix structure characteristics, to make it better ﬁt in with the computing architecture and memory hierarchy of the Sunway. Moreover, the CASpMV provides an accelerative method and customized optimizations to avoid irregular memory accesses and further improve its performance on the Sunway. Our CASpMV achieves a performance improvement that ranges from 588.05% to 2118.62% over the generic parallel SpMV on a CG (which corresponds to an MPI process) of the Sunway on average and has good scalability on multiple CGs. The performance comparisons of the CASpMV with state-of-the-art methods on the Sunway indicate that the sparsity and irregularity of data structures have less impact on CASpMV.

J. Xie and Y. Liang, “Spart: Optimizing cnns by utilizing both sparsity of weights and feature maps,” in Advanced Parallel Processing Technologies, P.-C. Yew, P. Stenström, J. Wu, X. Gong, and T. Li, Eds., Cham: Springer International Publishing, 2019, pp. 71–85, isbn: 978-3-030-29611-7.

Abstract: Intense convolution computation and great memory requirement in CNNs constraint their wider deployments and applications. Although both the weights and feature maps in CNNs can be sparse, directly mapping sparse convolution to spGEMM in HPC domain fails to improve the actual performance. Besides, existing sparse formats like CSR are not suitable for encoding the sparse feature maps because convolution operates across rows.

Z. Xie, G. Tan, W. Liu, and N. Sun, “IA-SpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication,” in Proceedings of the 33rd ACM Conference on Supercomputing, ser. ICS ’19, Phoenix, AZ, USA, 2019. [Online]. Available: https://folk.idi.ntnu.no/weifengl/papers/spgemm_xie_ics19.pdf.

Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a sparse kernel that is used in a number of scientiﬁc applications. Although several SpGEMM algorithms have been proposed, almost all of them are restricted to the compressed sparse row (CSR) format, and the possible performance gain from exploiting other formats has not been well studied. The particular format and algorithm that yield the best performance for SpGEMM also remain undetermined.
In this work, we conduct a prospective study on format-speciﬁc parallel SpGEMM algorithms, and analyze their pros and cons. We then propose IA-SpGEMM, an input-aware auto-tuning Framework for SpGEMM, that provides a uniﬁed programming interface in the CSR format and automatically determines the best format and algorithm for arbitrary sparse matrices. For this purpose, we set-up an algorithm set and design a deep learning model called MatNet that is trained by over 2,700 matrices from the SuiteSparse Matrix Collection to quickly and accurately predict the best solution by using sparse features and density representations. We evaluate our framework on CPUs and a GPU, and the results show that IA-SpGEMM is on average 3.27× and 13.17× faster than MKL on an Intel and an AMD platform, respectively, and is 2.23× faster than cuSPARSE on an NVIDIA GPU.

W. Xu, S. Fan, T. Wang, and Y. Zhou, “Blocking and sparsity for optimization of convolution calculation algorithm on gpus,” Sep. 22, 2019. arXiv:1909.09927 [cs.DC].

Abstract: Convolution neural network (CNN) plays a paramount role in machine learning, which has made signiﬁcant contributions, such as medical image classiﬁcation, natural language processing, and recommender system. The success convolution neural network achieved excellent performance with fast execution time. Due to the convolution operation dominate the total operation time of Convolution neural network. In this paper, we propose a novel convolution method of Graphic Processing Units (GPUs), which reduce the convolution operation time and improve the execution speed approximately 2× than the state of the art convolution algorithm. Our work based on the observation is that the sparsity of the input feature map of convolution operation is relatively large, and the zero value of the feature map is redundancy for convolution result. Therefore, we skip the zero value calculation and improve the speed by compressing the feature map. Besides, the shape of the feature map for the deep network is small, and the number of threads is limited. Therefore, for a limited number of threads, it is necessary to reduce the amount of calculation to increase the calculation speed. Our algorithm has a good eﬀect on the convolution operation of the feature map of the deep network with large sparsity and small size. In this work, our contributions can be summarized as follows: 1) A novel store format for hight-sparsity feature map. 2) A novel convolution algorithm based on block compression and Shared memory is proposed. 3) A feature map data-set for convolution algorithm optimization. 4) We performed a single-layer convolution comparison experiment with CuDNN for diﬀerent models, and it is best to achieve 3.5× speedup. We also implemented the algorithm on the VGG-19 model, which can achieve 1.3×∼2.9× speedup in deep convolution operation, and the entire network can achieve 2.3× speedup.

Z. Xu, X. Chen, J. Shen, Y. Zhang, C. Chen, and C. Yang, “GARDENIA: A graph processing benchmark suite for next-generation accelerators,” ACM Journal on Emerging Technologies in Computing Systems, vol. 15, no. 1, 9:1–9:13, Jan. 2019, issn: 1550-4832. doi: 10.1145/3283450.

Abstract: This article presents the Graph Algorithm Repository for Designing Next-generation Accelerators (GARDENIA), a benchmark suite for studying irregular graph algorithms on massively parallel accelerators. Applications with limited control and data irregularity are the main focus of existing generic benchmarks for accelerators, while available graph processing benchmarks do not apply state-of-the-art algorithms and/or optimization techniques. GARDENIA includes emerging graph processing workloads from graph analytics, sparse linear algebra, and machine-learning domains, which mimic massively multithreaded commercial programs running on modern large-scale datacenters. Our characterization shows that GARDENIA exhibits irregular microarchitectural behavior, which is quite diﬀerent from structured workloads and straightforward-implemented graph benchmarks.

Y. Yamamoto, “High-performance algorithms for numerical linear algebra,” in The Art of High Performance Computing for Computational Science, Vol. 1: Techniques of Speedup and Parallelization for General Purposes, M. Geshi, Ed. Singapore: Springer Singapore, 2019, pp. 113–136, isbn: 9789811361944. doi: 10.1007/978-981-13-6194-4˙7.

Abstract: Matrix computations lie at the heart of many scientiﬁc computations. While sophisticated algorithms have been established for various numerical linear algebra problems such as the solution of linear simultaneous equationsLinear simultaneous equation and eigenvalue problemsEigenvalue problem, they require considerable modiﬁcation with the advent of exaFLOPS- scale supercomputers, which are expected to have a huge number of computing cores, deep memory hierarchyMemory hierarchy, and increased probability of hardware errors. In this chapter, we discuss projected hardware characteristics of exaFLOPS machines and summarize the challenges to be faced by numerical linear algebra algorithms in the near future. Based on these preparations, we present a brief survey of recent research eﬀorts in the ﬁeld of numerical linear algebra targeted at meeting these challenges.

C. Y. Yang, “High-performance linear algebra-based graph framework on the GPU,” PhD thesis, University of California, Davis, 2019. [Online]. Available: https://escholarship.org/uc/item/37j8j27d.

Abstract: High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs, because of three challenges: (1) diﬃculty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic ratio. To address these challenges, GraphBLAS is an innovative, on-going eﬀort by the graph analytics community to propose building blocks based in sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. Initial research eﬀorts in implementing GraphBLAS on GPUs has been promising, but performance still trails by an order of magnitude compared to state-of-the-art graph frameworks using the traditional graph-centric approach of describing operations on vertices or edges.
This dissertation examines the performance challenges of a linear algebra-based approach to building graph frameworks and describes new design principles for overcoming these bottlenecks. Among the new design principles is making exploiting input sparsity a ﬁrst-class citizen in the framework. This is an especially important optimization, because it allows users to write graph algorithms without specifying certain implementation details thus permitting the software backend to choose the optimal implementation based on the input sparsity. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. We examine when it is proﬁtable to exploit this output sparsity to reduce computational complexity. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with diﬀerent characteristics.
The design principles described in the thesis have been implemented in GraphBLAST, an open-source high-performance graph framework on GPU developed as part of this dissertation. It is notable for being the ﬁrst graph framework based in linear algebra to get comparable or faster performance compared to the traditional, vertex-centric backends. The beneﬁts of design principles described in this thesis have been shown to be important for single GPU, and it will grow in importance when it serves as a building block for distributed implementation in the future and as a single GPU backend for higher-level languages such as Python. A graph framework based in linear algebra not only improves performance of existing graph algorithms, but in quickly prototyping new algorithms as well.

C. Yang, A. Buluç, and J. D. Owens, “GraphBLAST: A high-performance linear algebra-based graph framework on the GPU,” 2019. arXiv: 1908.01407v2 [cs.DC].

Abstract: High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs, because of three challenges: (1) diﬃculty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address these challenges, GraphBLAS is an innovative, on-going eﬀort by the graph analytics community to propose building blocks based in sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with diﬀerent characteristics. The design principles described in this paper have been implemented in ”GraphBLAST”, the ﬁrst open-source linear algebra-based graph framework on GPU targeting high-performance computing. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while oﬀering a simpler and more concise programming model.

T. Yang, “Advancing non-convex and constrained learning: Challenges and opportunities,” AI Matters, vol. 5, pp. 29–39, 3 2019. doi: 10.1145/3362077.3362085.

A. Yaşar and Ü. V. Çatalyürek, “Heuristics for symmetric rectilinear matrix partitioning,” Sep. 26, 2019. arXiv:1909.12209 [cs.DS].

Abstract: Partitioning sparse matrices and graphs is a common and important problem in many scientiﬁc and graph analytics applications. In this work, we are concerned with a spatial partitioning called rectilinear partitioning (also known as generalized block distribution) of sparse matrices, which is needed for tiled (or blocked) execution of sparse matrix and graph analytics kernels. More speciﬁcally, in this work, we address the problem of symmetric rectilinear partitioning of square matrices. By symmetric, we mean having the same partition on rows and columns of the matrix, yielding a special tiling where the diagonal tiles (blocks) will be squares. We propose ﬁve heuristics to solve two diﬀerent variants of this problem, and present a thorough experimental evaluation showing the eﬀectiveness of the proposed algorithms.

Yaşar, Abdurrahman, Sivasankaran Rajamanickam, et al. “Linear Algebra-Based Triangle Counting via Fine-Grained Tasking on Heterogeneous Environments”. In: Proceedings of the IEEE High Performance Extreme Computing Conference. HPEC ’19. Sept. 2019, pp. 1–4. doi: 10.1109/HPEC.2019.8916492.

Abstract: Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic formulation of the triangle counting problem. Our new approach relies on ﬁne-grained tasking based on a tile layout. We adopt this task based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8× speed up over past year’s graph challenge submission. This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication of the graph to the GPUs. In terms of end-toend execution time, our implementation also achieves the fastest end-to-end times due to very low overhead costs.

Yu, Ting and Mengchi Liu. “A Memory Eﬃcient Clique Enumeration Method for Sparse Graphs with a Parallel Implementation”. In: Parallel Computing (2019). issn: 0167-8191. doi: 10.1016/j.parco.2019.05.005. url: http://www.sciencedirect.com/science/article/pii/S0167819118301297.

Abstract: Maximal clique enumeration (MCE) is a widely studied problem that plays a crucial role in structure mining of undirected graphs. The increasing scale of real-world graphs has brought the challenges of high memory cost and high CPU workload to the problem. In this paper, we propose a memory eﬃcient method named CMC-bit for MCE on sparse graphs. It reduces the memory cost via minimizing the candidate cliques and representing them by the data structure bitset. It generates an appropriate order for the vertex set according to two optimized principles to reduce the CPU cost. We further design a partition-based CMC-bit algorithm with a one-side extending strategy to solve the memory-limited problem. We parallelize the CMC-bit algorithm based on MapReduce with a range-based partition strategy to make an optimal trade-oﬀ between the shuﬄing workload of graph decomposition and load balance in the Reduce phase. We conduct extensive experiments on 30 real-world datasets. The results demonstrate that both the CMC-bit algorithm and its parallel implementation signiﬁcantly outperform the respective state-of-the-art algorithms in speed. We also show that the parallel CMC-bit algorithm achieves good performance on the scalability with respect to both the reducer number and the CPU number.

Yuan, Ganzhao, Li Shen, and Wei-Shi Zheng. “A Block Decomposition Algorithm for Sparse Optimization”. 2019. arXiv:1905.11031.

Abstract: Sparse optimization is a central problem in machine learning and computer vision. However, this problem is inherently NP-hard and thus diﬃcult to solve in general. Combinatorial search methods ﬁnd the global optimal solution but are conﬁned to small-sized problems, while coordinate descent methods are eﬃcient but often suﬀer from poor local minima. This paper considers a new block decomposition algorithm that combines the eﬀectiveness of combinatorial search methods and the eﬃciency of coordinate descent methods. Speciﬁcally, we consider a random strategy or/and a greedy strategy to select a subset of coordinates as the working set, and then perform a global combinatorial search over the working set based on the original objective function. We show that our method ﬁnds stronger stationary points than Amir Beck et al.’s coordinate-wise optimization method. In addition, we establish the global convergence and convergence rate of our block decomposition algorithm. Our experiments on solving sparse regularized and sparsity constrained least squares optimization problems demonstrate that our method achieves state-ofthe-art performance in terms of accuracy.

Zhang, Kai et al. “Greedy Orthogonal Pivoting Algorithm for Non-negative Matrix Factorization”. In: Proceedings of the 36th International Conference on Machine Learning. PMLR’19. 2019. url: http://proceedings.mlr.press/v97/zhang19r/zhang19r.pdf.

Abstract: Non-negative matrix factorization is a powerful tool for learning useful representations in the data and has been widely applied in many problems such as data mining and signal processing. Orthogonal NMF, which can further improve the locality of decomposition, has drawn considerable interest in clustering problems. However, imposing simultaneous non-negative and orthogonal structure can be diﬃcult, and so existing algorithms can only solve it approximately. To address this challenge, we propose an innovative procedure called Greedy Orthogonal Pivoting Algorithm (GOPA). The GOPA method fully exploits the sparsity of non-negative orthogonal solutions to break the global problem into a series of local optimizations, in which an adaptive subset of coordinates are updated in a greedy, closed-form manner. The biggest advantage of GOPA is that it promotes exact orthogonality and provides solid empirical evidence that stronger orthogonality does contribute favorably to better clustering performance. On the other hand, we have designed randomized and batch-mode version of GOPA, which can further reduce the computational cost and improve accuracy, making it suitable for large data.

Zhang, Yongzhe, Ariful Azad, and Zhenjiang Hu. “FastSV: A Distributed-Memory Connected Component Algorithm with Fast Convergence”. Oct. 14, 2019. arXiv:1910.05971 [cs.DS].

Abstract: This paper presents a new distributed-memory algorithm called FastSV for ﬁnding connected components in an undirected graph. Our algorithm simpliﬁes the classic Shiloach-Vishkin algorithm and employs several novel and eﬃcient hooking strategies for faster convergence. We map diﬀerent steps of FastSV to linear algebraic operations and implement them with the help of scalable graph libraries. FastSV uses sparse operations to avoid redundant work and optimized MPI communication to avoid bottlenecks. The resultant algorithm shows high-performance and scalability as it can ﬁnd the connected components of a hyperlink graph with over 134B edges in 30 seconds using 262K cores on a Cray XC40 supercomputer. FastSV outperforms the state-of-the-art algorithm by an average speedup of 2.21× (max 4.27×) on a variety of real-world graphs.

Zhang, Yunming et al. “PriorityGraph: A Uniﬁed Programming Model for Optimizing Ordered Graph Algorithms”. Nov. 17, 2019. arXiv:1911.07260 [cs.PL].

Abstract: Many graph problems can be solved using ordered parallel graph algorithms that achieve signiﬁcant speedup over their unordered counterparts by reducing redundant work. This paper introduces PriorityGraph, a new programming framework that simpliﬁes writing high-performance parallel ordered graph algorithms. The key to PriorityGraph is a novel priority-based abstraction that enables vertices to be processed in a dynamic order while hiding low-level implementation details from the user. PriorityGraph is implemented as a language and compiler extension to GraphIt. The PriorityGraph compiler extension leverages new program analyses, transformations, and code generation to produce fast implementations of ordered parallel graph algorithms. We also introduce bucket fusion, a new performance optimization that fuses together diﬀerent rounds of ordered algorithms to reduce synchronization overhead, resulting in 1.2× to 3× speedup over the fastest existing ordered algorithm implementations on road networks with large diameters. PriorityGraph achieves up to 3× speedup on six ordered graph algorithms over state-of-the-art frameworks and hand-optimized implementations (Julienne, Galois, and GAPBS) that support ordered algorithms.

Zhang, Zecheng et al. “Enabling Distributed-Memory Tensor Completion in Python using New Sparse Tensor Kernels”. Oct. 6, 2019. arXiv:1910.02371 [cs.DC].

Abstract: Tensor computations are increasingly prevalent numerical techniques in data science.However, innovation and deployment of methods on large sparse tensor datasets are made challenging by the diﬃculty of eﬃcient implementation thereof.We provide a Python extension to the Cyclops tensor algebra library, which fully automates the management of distributed-memory parallelism and sparsity for NumPy-style operations on multidimensional arrays.We showcase this functionality with novel high-level implementations of three algorithms for the tensor completion problem: alternating least squares (ALS) with an implicit conjugate gradient method, stochastic gradient descent (SGD), and coordinate descent (CCD++).To make possible tensor completion for very sparse tensors, we introduce a new multi-tensor routine that is asymptotically more eﬃcient than pairwise tensor contraction for key components of the tensor completion methods.Further, we add support for hypersparse matrix representations to Cyclops.We provide microbenchmarking results on the Stampede2 supercomputer to demonstrate the eﬃciency of this functionality.Finally, we study the accuracy and performance of the tensor completion methods for a synthetic tensor with 10 billion nonzeros and the Netﬂix dataset.

Zhao, K. et al. “SQL-G: Eﬃcient Graph Analytics by SQL”. In: IEEE Transactions on Knowledge and Data Engineering (2019). doi: 10.1109/TKDE.2019.2950620.

Abstract: Querying graphs and conducting graph analytics become important in data processing since there are many real applications dealing with massive graphs, such as online social networks, Semantic Web, knowledge graphs, etc. Over the years, many distributed graph processing systems have been developed to support graph analytics using various programming models, and many graph querying languages have been proposed. A natural question that arises is how to integrate graph data and traditional non-graph data in a distributed system for users to conduct analytics. There are two issues. One issue is related to expressiveness on how to specify graph analytics as well as data analytics by a querying language. The other issue is related to eﬃciency on how to process analytics in a distributed system. For the ﬁrst issue, SQL is a best candidate, since SQL is a well-accepted language for data processing. We concentrate on SQL for graph analytics. Our early work shows that graph analytics can be supported by SQL in a way from ”semiring + while” to ”relational algebra + while” via the enhanced recursive SQL queries. In this paper, we focus on the second issue on how to process such enhanced recursive SQL queries based on the GAS (Gather-Apply-Scatter) model under which eﬃcient graph processing systems can be developed. To demonstrate the eﬃciency, we implemented a system by tightly coupling Spark SQL and GraphX on Spark which is one of the most popular in-memory data-ﬂow processing platforms. First, we enhance Spark SQL by adding the capability of supporting the enhanced recursive SQL queries for graph analytics. In this regard, graph analytics can be processed using a distributed SQL engine alone. Second, we further propose new transformation rules to optimize/translate the operations for recursive SQL queries to the operations by GraphX. In this regard, graph analytics by SQL can be processed in a similar way as done by a distributed graph processing system using the APIs provided by the system. We conduct extensive performance studies to test graph analytics using large real graphs. We show that our approach can achieve similar or even higher eﬃciency, in comparison to the built-in graph algorithms in the existing graph processing systems.

Zheng, Hua, Seakweng Vong, and Ling Liu. “A direct preconditioned modulus-based iteration method for solving nonlinear complementarity problems of H-matrices”. In: Applied Mathematics and Computation 353 (2019), pp. 396–405. issn: 0096-3003. doi: 10.1016/j.amc.2019.02.015. url: http://www.sciencedirect.com/science/article/pii/S0096300319301134.

Abstract: In this paper, we establish a direct preconditioned modulus-based iteration method for solving a class of nonlinear complementarity problems with the system matrix being an H-matrix. The convergence theorems of the proposed method are given, which generalize and improve the existing ones. Numerical examples show that the proposed method is eﬃcient.

Zheng, Tianqi, Zhibin Zhang, and Xueqi Cheng. “SilverChunk: An Eﬃcient In-Memory Parallel Graph Processing System”. In: Database and Expert Systems Applications. Ed. by Sven Hartmann et al. Springer International Publishing, 2019, pp. 222–236. isbn: 978-3-030-27618-8.

Abstract: One of the main constructs of graph processing is the two-level nested loop structure. Parallelizing nested loops is notoriously unfriendly to both CPU and memory access when dealing with real graph data due to its skewed distribution. To address this problem, we present SilverChunk, a high performance graph processing system. SilverChunk builds edge chunks of equal size from original graphs and unfolds nested loops statically in pull-based executions (VR-Chunk) and dynamically in push-based executions (D-Chunk). VR-Chunk slices the entire graph into several chunks. A virtual vertex is generated pointing to the ﬁrst half of each sliced edge list so that no edge list lives in more than one chunk. D-Chunk builds its chunk list via binary searching over the preﬁx degree sum array of the active vertices. Each chunk has a local buﬀer for conﬂict-free maintenance of the next frontier. By changing the units of scheduling from edges to chunks, SilverChunk achieves better CPU and memory utilization. SilverChunk provides a high level programming interface combined with multiple optimization techniques to help developing eﬃcient graph processing applications. Our evaluation results reveal that SilverChunk outperforms state-of-the-art shared-memory graph processing systems by up to 4 × 4, including Gemini, Grazelle, etc. Moreover, it has lower memory overheads and nearly zero pre-processing time.

Zhou, Gan et al. “GPU-accelerated sparse matrices parallel inversion algorithm for large-scale power systems”. In: International Journal of Electrical Power & Energy Systems 111 (2019), pp. 34–43. issn: 0142-0615. doi: 10.1016/j.ijepes.2019.03.074. url: http://www.sciencedirect.com/science/article/pii/S0142061518325109.

Abstract: State-of-the-art Graphics Processing Unit (GPU) has superior performances on ﬂoat-pointing calculation and memory bandwidth, and therefore has great potential in many computationally intensive power system applications, one of which is the inversion of large-scale sparse matrix. It is a fundamental component for many power system analyses which requires to solve massive number of forward and backward substitution (F&B) subtasks and seems to be a good GPU-accelerated candidate application. By means of solving multiple F&B subtasks concurrently and a serial of performance tunings in compliance with GPU’s architectures, we successfully develop a batch F&B algorithm on GPUs, which not only extracts the intra-level and intra-level parallelisms inside single F&B subtask but also explores a more regular parallelism among massive F&B subtasks, called inter-task parallelism. Case study on a 9241-dimension case shows that the proposed batch F&B solver consumes 2.92 μs per forward substitution (FS) subtask when the batch size is equal to 3072, achieving 65 times speedup relative to KLU library. And on the basis the complete design process of GPU-based inversion algorithm is proposed. By oﬄoading the tremendous computational burden to GPU, the inversion of 9241-dimension case consumes only 97 ms, which can achieve 8.1 times speedup relative to the 12-core CPU inversion solver based on KLU library. The proposed batch F&B solver is practically very promising in many other power system applications requiring solving massive F&B subtasks, such as probabilistic power ﬂow analysis.

Zhu, G., P. Jiang, and G. Agrawal. “A Methodology for Characterizing Sparse Datasets and Its Application to SIMD Performance Prediction”. In: Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques. PACT ’19. Sept. 2019, pp. 445–456. doi: 10.1109/PACT.2019.00042.

Abstract: Irregular computations are commonly seen in many scientiﬁc and engineering domains that use unstructured meshes or sparse matrices. The performance of an irregular application is very dependent upon the dataset. This paper poses the following question: ”given an unstructured mesh or a graph, what method(s) can be used to sample it, such that the execution on the resulting sampled dataset can accurately reﬂect performance characteristics on the full dataset”. Our ﬁrst insight is that developing a universal sampling approach for all sparse matrices is unpractical. According to the non-zero distribution of the sparse matrix, we propose two novel sampling strategies: Stride Average sampling and Random Tile sampling, which are suitable for uniform and skewed sparse matrices respectively. To help categorize a sparse matrix as uniform or skewed, we introduce clustering coeﬃcient as an important feature which can be propagated into the decision tree model. We also adapt Random Node Neighbor sampling approach for eﬃcient estimation of clustering coeﬃcient. We apply our unstructured dataset characterization approach to modeling the performance for SIMD irregular applications, where the sampled dataset obtained is used to predict cache miss rate and SIMD utilization ratio. We also build analytical models to estimate overheads incurred by load imbalance among threads. With knowledge of these factors, we adapt a code skeleton framework SKOPE to capture the workload behaviors and aggregate performance statistics for execution time prediction.

Zhu, Maohua et al. “Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs”. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO ’52. Columbus, OH, USA: ACM, 2019, pp. 359–371. isbn: 978-1-4503-6938-1. doi: 10.1145/3352460.3358269.

Abstract: Deep neural networks have become the compelling solution for the applications such as image classiﬁcation, object detection, speech recognition, and machine translation. However, the great success comes at the cost of excessive computation due to the over-provisioned parameter space. To improve the computation eﬃciency of neural networks, many pruning techniques have been proposed to reduce the amount of multiply-accumulate (MAC) operations, which results in high sparsity in the networks.
Unfortunately, the sparse neural networks often run slower than their dense counterparts on modern GPUs due to their poor device utilization rate. In particular, as the sophisticated hardware primitives (e.g., Tensor Core) have been deployed to boost the performance of dense matrix multiplication by an order of magnitude, the performance of sparse neural networks lags behind signiﬁcantly.
In this work, we propose an algorithm and hardware co-design methodology to accelerate the sparse neural networks. A novel pruning algorithm is devised to improve the workload balance and reduce the decoding overhead of the sparse neural networks. Meanwhile, new instructions and micro-architecture optimization are proposed in Tensor Core to adapt to the structurally sparse neural networks. Our experimental results show that the pruning algorithm can achieve 63% performance gain with model accuracy sustained. Furthermore, the hardware optimization gives an additional 58% performance gain with negligible area overhead.

Nicholas I. M. Gould, Tyrone Rees and Jennifer A. Scott. “Convergence and evaluation-complexity analysis of a regularized tensor-Newton method for solving nonlinear least-squares problems”. In: Computational Optimization and Applications. doi: 10.1007/s10589-019-00064-2.

Abstract: Given a twice-continuously diﬀerentiable vector-valued function r(x), a local minimizer of ∥r(x)∥2 is sought. We propose and analyse tensor-Newton methods, in which r(x) is replaced locally by its second-order Taylor approximation. Convergence is controlled by regularization of various orders. We establish global convergence to a ﬁrst-order critical point of ∥r(x)∥2, and provide function evaluation bounds that agree with the best-known bounds for methods using second derivatives. Numerical experiments comparing tensor-Newton methods with regularized Gauss–Newton and Newton methods demonstrate the practical performance of the newly proposed method.

This year has been particularly exciting for the fields of numerical linear algebra and parallel computing. I managed to add 209 entries to my bibliography, which means I added a new paper about every 1.7 days! To me, this is a big step up compared to last year’s 56 papers.