# A selection of papers on numerical linear algebra and parallel computing published in 2018

About two years ago, I setup a Google Scholar alert to stay up-to-date with the literature on the topics I discussed in my PhD dissertation. I took the habit of recording the most interesting papers on Jabref, together with their abstract.

Here is a list of such papers for 2018 (up until today):

Matching entries: 0
settings...
 Abdelfattah A, Haidar A, Tomov S and Dongarra J (2018), "Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-14. [Abstract] [BibTeX] [DOI] Abstract: Graphics Processing Units (GPUs) are widely used in accelerating dense linear solvers. The matrix factorizations, which dominate the runtime for these solvers, are often designed using a hybrid scheme, where GPUs perform trailing matrix updates, while the CPUs perform the panel factorizations. Consequently, hybrid solutions require high-end CPUs and optimized CPU software in order to deliver high performance. Furthermore, they lack the energy efficiency inherent for GPUs due to the use of less energy-efficient CPUs, as well as CPU-GPU communications. BibTeX: @article{Abdelfattah2018, author = {Abdelfattah, A. and Haidar, A. and Tomov, S. and Dongarra, J.}, title = {Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs}, journal = {IEEE Transactions on Parallel and Distributed Systems}, year = {2018}, pages = {1--14}, doi = {10.1109/TPDS.2018.2842785} }  Abdelfattah A, Haidar A, Tomov S and Dongarra J (2018), "Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization", In IEEE High Performance Extreme Computing Conference. Waltham, MA, September, 2018. IEEE. [Abstract] [BibTeX] Abstract: This paper introduces several frameworks for the design and implementation of high performance GPU kernels that target batch workloads with irregular sizes. Such workloads are ubiquitous in many scientific applications, including sparse direct solvers, astrophysics, and quantum chemistry. The paper addresses two main categories of frameworks, taking the Cholesky factorization as a case study. The first uses host-side kernel launches, and the second uses device-side launches. Within each category, different design options are introduced, with an emphasis on the advantages and the disadvantages of each approach. Our best performing design outperforms the state-of-the-art CPU implementation, scoring up to 4.7× speedup in double precision on a Pascal P100 GPU. BibTeX: @conference{Abdelfattah2018a, author = {Abdelfattah, Ahmad and Haidar, Azzam and Tomov, Stanimire and Dongarra, Jack}, title = {Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization}, booktitle = {IEEE High Performance Extreme Computing Conference}, publisher = {IEEE}, year = {2018} }  Abe K (2018), "On Convergence Speed of Parallel Variants of BiCGSTAB for Solving Linear Equations", In Methods and Applications for Modeling and Simulation of Complex Systems. Singapore , pp. 401-413. Springer Singapore. [Abstract] [BibTeX] [DOI] Abstract: A number of hybrid Bi-Conjugate Gradient (Bi-CG) methods such as the Bi-CG STABilized (BiCGSTAB) method have been developed for solving linear equations. BiCGSTAB has been most often used for efficiently solving the linear equations, but we have sometimes seen the convergence behavior with a long stagnation phase. In such cases, it is important to have Bi-CG coefficients that are as accurate as possible, and the stabilization strategy for improving the accuracy of the Bi-CG coefficients has been proposed. In present petascale high-performance computing hardware, the main bottleneck of Krylov subspace methods for efficient parallelization is the inner products which require a global reduction. The parallel variants of BiCGSTAB such as communication avoiding and pipelined BiCGSTAB reducing the number of global communication phases and hiding the communication latency have been proposed. However, the numerical stability, specifically, the convergence speed of the parallel variants of BiCGSTAB has not previously been clarified on problems with situations where the convergence is slow (strongly affected by rounding errors). In this paper, therefore, we examine the convergence speed between the standard BiCGSTAB and the parallel variants, and the effectiveness of the stabilization strategy by numerical experiments on the problems where the convergence has a long stagnation phase. BibTeX: @inproceedings{Abe2018, author = {Abe, Kuniyoshi}, editor = {Li, Liang and Hasegawa, Kyoko and Tanaka, Satoshi}, title = {On Convergence Speed of Parallel Variants of BiCGSTAB for Solving Linear Equations}, booktitle = {Methods and Applications for Modeling and Simulation of Complex Systems}, publisher = {Springer Singapore}, year = {2018}, pages = {401--413}, doi = {10.1007/978-981-13-2853-4_31} }  Abubaker NFT, Akbudak K and Aykanat C (2018), "Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core Architectures", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. [Abstract] [BibTeX] [DOI] Abstract: There exist graph/hypergraph partitioning-based row/column reordering methods for encoding either spatial or temporal locality separately for sparse matrix-vector multiplication (SpMV) operations. Spatial and temporal hypergraph models in these methods are extended to encapsulate both spatial and temporal localities based on cut/uncut net categorization obtained from vertex partitioning. These extensions of spatial and temporal hypergraph models encode the spatial locality primarily and the temporal locality secondarily, and vice-versa, respectively. However, the literature lacks models that simultaneously encode both spatial and temporal localities utilizing only vertex partitioning for further improving the performance of SpMV on shared-memory architectures. In order to fill this gap, we propose a novel spatiotemporal hypergraph model that leads to a one-phase spatiotemporal reordering method which encodes both types of locality simultaneously. We also propose a framework for spatiotemporal methods which encodes both types of locality in two dependent phases and two separate phases. The validity of the proposed spatiotemporal models and methods are tested on a wide range of sparse matrices and the experiments are performed on both a 60-core Intel Xeon Phi processor and a Xeon processor. Results show the validity of the methods via almost doubling the Gflop/s performance through enhancing data locality in parallel SpMV operations. BibTeX: @article{Abubaker2018, author = {Abubaker, N. F. T. and Akbudak, K. and Aykanat, C.}, title = {Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core Architectures}, journal = {IEEE Transactions on Parallel and Distributed Systems}, year = {2018}, pages = {1--1}, doi = {10.1109/TPDS.2018.2864729} }  Acer S, Selvitopi O and Aykanat C (2018), "Optimizing nonzero-based sparse matrix partitioning models via reducing latency", Journal of Parallel and Distributed Computing. [Abstract] [BibTeX] [DOI] [URL] Abstract: For the parallelization of sparse matrix--vector multiplication (SpMV) on distributed memory systems, nonzero-based fine-grain and medium-grain partitioning models attain the lowest communication volume and computational imbalance among all partitioning models. This usually comes, however, at the expense of high message count, i.e., high latency overhead. This work addresses this shortcoming by proposing new fine-grain and medium-grain models that are able to minimize communication volume and message count in a single partitioning phase. The new models utilize message nets in order to encapsulate the minimization of total message count. We further fine-tune these models by proposing delayed addition and thresholding for message nets in order to establish a trade-off between the conflicting objectives of minimizing communication volume and message count. The experiments on an extensive dataset of nearly one thousand matrices show that the proposed models improve the total message count of the original nonzero-based models by up to 27% on the average, which is reflected on the parallel runtime of SpMV as an average reduction of 15% on 512 processors. BibTeX: @article{Acer2018, author = {Acer, Seher and Selvitopi, Oguz and Aykanat, Cevdet}, title = {Optimizing nonzero-based sparse matrix partitioning models via reducing latency}, journal = {Journal of Parallel and Distributed Computing}, year = {2018}, url = {http://www.sciencedirect.com/science/article/pii/S0743731518305860}, doi = {10.1016/j.jpdc.2018.08.005} }  Aliaga JI, Dufrechou E, Ezzatti P and Quintana-Ortíz ES (2018), "An efficient GPU version of the preconditioned GMRES method", The Journal of Supercomputing., October, 2018. [Abstract] [BibTeX] [DOI] Abstract: In a large number of scientific applications, the solution of sparse linear systems is the stage that concentrates most of the computational effort. This situation has motivated the study and development of several iterative solvers, among which preconditioned Krylov subspace methods occupy a place of privilege. In a previous effort, we developed a GPU-aware version of the GMRES method included in ILUPACK, a package of solvers distinguished by its inverse-based multilevel ILU preconditioner. In this work, we study the performance of our previous proposal and integrate several enhancements in order to mitigate its principal bottlenecks. The numerical evaluation shows that our novel proposal can reach important run-time reductions. BibTeX: @article{Aliaga2018, author = {Aliaga, José I. and Dufrechou, Ernesto and Ezzatti, Pablo and Quintana-Ortíz, Enrique S.}, title = {An efficient GPU version of the preconditioned GMRES method}, journal = {The Journal of Supercomputing}, year = {2018}, doi = {10.1007/s11227-018-2658-1} }  Aliaga JI, Barreda M and Castaño A (2018), "Energy-aware Strategies for Task-parallel Sparse Linear System Solvers", Concurrency and Computation Practice and Experience. [Abstract] [BibTeX] [DOI] Abstract: We present some energy-aware strategies to improve the energy efficiency of a task-parallel preconditioned Conjugate Gradient (PCG) iterative solver on a Haswell-EP Intel Xeon. These techniques leverage the power-saving states of the processor, promoting the hardware into a more energy-efficient C-state and modifying the CPU frequency (P-states of the processors) of some operations of the PCG. We demonstrate that the application of these strategies during the main operations of the iterative solver can reduce its energy consumption considerably. BibTeX: @article{Aliaga2018a, author = {Aliaga, José I. and Barreda, María and Castaño, Asunción}, title = {Energy-aware Strategies for Task-parallel Sparse Linear System Solvers}, journal = {Concurrency and Computation Practice and Experience}, year = {2018}, doi = {10.1002/cpe.4633} }  Anzt H, Chow E and Dongarra J (2018), "ParILUT -- A New Parallel Threshold ILU Factorization", SIAM Journal on Scientific Computing. Vol. 40(4), pp. C503-C519. [Abstract] [BibTeX] [DOI] Abstract: We propose a parallel algorithm for computing a threshold incomplete LU (ILU) factorization. The main idea is to interleave a parallel fixed-point iteration that approximates an incomplete factorization for a given sparsity pattern with a procedure that adjusts the pattern. We describe and test a strategy for identifying nonzeros to be added and nonzeros to be removed from the sparsity pattern. The resulting pattern may be different and more effective than that of existing threshold ILU algorithms. Also in contrast to other parallel threshold ILU algorithms, much of the new algorithm has fine-grained parallelism. BibTeX: @article{Anzt2018, author = {Anzt, Hartwig and Chow, Edmond and Dongarra, Jack}, title = {ParILUT -- A New Parallel Threshold ILU Factorization}, journal = {SIAM Journal on Scientific Computing}, year = {2018}, volume = {40}, number = {4}, pages = {C503-C519}, doi = {10.1137/16M1079506} }  Anzt H, Huckle TK, Bräckle J and Dongarra J (2018), "Incomplete Sparse Approximate Inverses for Parallel Preconditioning", Parallel Computing. Vol. 71, pp. 1-22. [Abstract] [BibTeX] [DOI] [URL] Abstract: In this paper, we propose a new preconditioning method that can be seen as a generalization of block-Jacobi methods, or as a simplification of the sparse approximate inverse (SAI) preconditioners. The Incomplete Sparse Approximate Inverses (ISAI) is in particular efficient in the solution of sparse triangular linear systems of equations. Those arise, for example, in the context of incomplete factorization preconditioning. ISAI preconditioners can be generated via an algorithm providing fine-grained parallelism, which makes them attractive for hardware with a high concurrency level. In a study covering a large number of matrices, we identify the ISAI preconditioner as an attractive alternative to exact triangular solves in the context of incomplete factorization preconditioning. BibTeX: @article{Anzt2018a, author = {Anzt, Hartwig and Huckle, Thomas K. and Bräckle, Jürgen and Dongarra, Jack}, title = {Incomplete Sparse Approximate Inverses for Parallel Preconditioning}, journal = {Parallel Computing}, year = {2018}, volume = {71}, pages = {1--22}, url = {http://www.sciencedirect.com/science/article/pii/S016781911730176X}, doi = {10.1016/j.parco.2017.10.003} }  Benatia A, Ji W, Wang Y and Shi F (2018), "BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU", ACM Transactions on Architecture and Code Optimization (TACO). New York, NY, USA, September, 2018. Vol. 15(3), pp. 29:1-29:27. ACM. [Abstract] [BibTeX] [DOI] Abstract: The Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed to improve this kernel on the recent GPU architectures. However, it has been widely observed that there is no “best-for-all” sparse format for the SpMV kernel on GPU. Indeed, serious performance degradation of an order of magnitude can be observed without a careful selection of the sparse format to use. To address this problem, we propose in this article BestSF (Best Sparse Format), a new learning-based sparse meta-format that automatically selects the most appropriate sparse format for a given input matrix. To do so, BestSF relies on a cost-sensitive classification system trained using Weighted Support Vector Machines (WSVMs) to predict the best sparse format for each input sparse matrix. Our experimental results on two different NVIDIA GPU architectures using a large number of real-world sparse matrices show that BestSF achieved a noticeable overall performance improvement over using a single sparse format. While BestSF is trained to select the best sparse format in terms of performance (GFLOPS), our further experimental investigations revealed that using BestSF also led, in most of the test cases, to the best energy efficiency (MFLOPS/W). To prove its practical effectiveness, we also evaluate the performance and energy efficiency improvement achieved when using BestSF as a building block in a GPU-based Preconditioned Conjugate Gradient (PCG) iterative solver. BibTeX: @article{Benatia2018, author = {Benatia, Akrem and Ji, Weixing and Wang, Yizhuo and Shi, Feng}, title = {BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU}, journal = {ACM Transactions on Architecture and Code Optimization (TACO)}, publisher = {ACM}, year = {2018}, volume = {15}, number = {3}, pages = {29:1--29:27}, doi = {10.1145/3226228} }  Bernaschi M, D'Ambra P and Pasquini D (2018), "AMG based on compatible weighted matching", Parallel Computing. [Abstract] [BibTeX] Abstract: We describe main issues and design principles of an efficient implementation, tailored to recent generations of Nvidia Graphics Processing Units (GPUs), of an Algebraic MultiGrid (AMG) preconditioner previously proposed by one of the authors and already available in the open-source package BootCMatch: Bootstrap algebraic multigrid based on Compatible weighted Matching for standard CPU. The AMG method relies on a new approach for coarsening sparse symmetric positive definite (s.p.d.) matrices, named coarsening based on compatible weighted matching. It exploits maximum weight matching in the adjacency graph of the sparse matrix, driven by the principle of compatible relaxation, providing a suitable aggregation of unknowns which goes beyond the limits of the usual heuristics applied in the current methods. We adopt an approximate solution of the maximum weight matching problem, based on a recently proposed parallel algorithm, referred as the Suitor algorithm, and show that it allow us to obtain good quality coarse matrices for our AMG on GPUs. We exploit inherent parallelism of modern GPUs in all the kernels involving sparse matrix computations both for the setup of the preconditioner and for its application in a Krylov solver, outperforming preconditioners available in Nvidia AmgX library. We report results about a large set of linear systems arising from discretization of scalar and vector partial differential equations (PDEs). BibTeX: @article{Bernaschi2018, author = {Bernaschi, Massimo and D'Ambra, Pasqua and Pasquini, Dario}, title = {AMG based on compatible weighted matching}, journal = {Parallel Computing}, year = {2018} }  Biswas R, Oliker L and Shan H (2003), "Parallel computing strategies for irregular algorithms". Thesis at: LBNL. (53115) [Abstract] [BibTeX] Abstract: Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computing platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms. BibTeX: @techreport{Biswas2003, author = {Biswas, Rupak and Oliker, Leonid and Shan, Hongzhang}, title = {Parallel computing strategies for irregular algorithms}, school = {LBNL}, year = {2003}, number = {53115} }  Booth JD and Bolet G (2018), "Javelin: A Scalable Implementation for Sparse Incomplete LU Factorization", December, 2018. [Abstract] [BibTeX] Abstract: In this work, we present a new scalable incomplete LU factorization framework called Javelin to be used as a preconditioner for solving sparse linear systems with iterative methods. Javelin allows for improved parallel factorization on shared-memory many-core systems, while packaging the coefficient matrix into a format that allows for high performance sparse matrix-vector multiplication and sparse triangular solves with minimal overheads. The framework achieves these goals by using a collection of traditional permutations, point-to-point thread synchronizations, tasking, and segmented prefix scans in a conventional compressed sparse row format. Using these changes, traditional fill-in and drop tolerance methods can be used, while still being able to have observed speedups of up to  42× on 68 Intel Knights Landing cores and  12× on 14 Intel Haswell cores. BibTeX: @online{Booth2018, author = {Booth, Joshua Dennis and Bolet, Gregory}, title = {Javelin: A Scalable Implementation for Sparse Incomplete LU Factorization}, year = {2018} }  Bromberger M, Hoffmann M and Rehrmann R (2018), "Do Iterative Solvers Benefit from Approximate Computing? An Evaluation Study Considering Orthogonal Approximation Methods", In Architecture of Computing Systems. Cham , pp. 297-310. Springer International Publishing. [Abstract] [BibTeX] Abstract: Employing algorithms of scientific computing often comes in hand with finding a trade-off between accuracy and performance. Novel parallel hardware and algorithms only slightly improve these issues due to the increasing size of the problems. While high accuracy is inevitable for most problems, there are parts in scientific computing that allow us to introduce approximation. Therefore, in this paper we give answers to the following questions: (1) Can we exploit different approximate computing strategies in scientific computing? (2) Is there a strategy to combine approaches? To answer these questions, we apply different approximation strategies to a widely used iterative solver for linear systems of equations. We show the advantages and the limits of each strategy and a way to configure a combination of strategies according to a given relative error. Combining orthogonal strategies as an overall concept gives us significant opportunities to increase the performance. BibTeX: @inproceedings{Bromberger2018, author = {Bromberger, Michael and Hoffmann, Markus and Rehrmann, Robin}, editor = {Berekovic, Mladen and Buchty, Rainer and Hamann, Heiko and Koch, Dirk and Pionteck, Thilo}, title = {Do Iterative Solvers Benefit from Approximate Computing? An Evaluation Study Considering Orthogonal Approximation Methods}, booktitle = {Architecture of Computing Systems}, publisher = {Springer International Publishing}, year = {2018}, pages = {297--310} }  Busato F, Green O, Bombieri N and Bader DA (2018), "Hornet: An Efficient Data Structure for Dynamic Sparse Graphs and Matrices on GPUs", In Proceedings of the 2018 IEEE High Performance extreme Computing Conference (HPEC)., September, 2018. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Sparse data computations are ubiquitous in science and engineering. Unlike their dense data counterparts, sparse data computations have less locality and more irregularity in their execution, making them significantly more challenging to parallelize and optimize. Many of the existing formats for sparse data representations on parallel architectures are restricted to static data problems, while those for dynamic data suffer from inefficiency both in terms of performance and memory footprint. This work presents Hornet, a novel data representation that targets dynamic data problems. Hornet is scalable with the input size, and does not require any data re-allocation or re-initialization during the data evolution. We show a Hornet implementation for GPU architectures and compare it to the most widely used static and dynamic data structures. BibTeX: @inproceedings{Busato2018, author = {Federico Busato and Oded Green and Nicola Bombieri and David A. Bader}, title = {Hornet: An Efficient Data Structure for Dynamic Sparse Graphs and Matrices on GPUs}, booktitle = {Proceedings of the 2018 IEEE High Performance extreme Computing Conference (HPEC)}, publisher = {IEEE}, year = {2018}, doi = {10.1109/hpec.2018.8547541} }  Buttari A (2018), "Scalability of parallel sparse direct solvers: methods, memory and performance". Thesis at: Toulouse INP., September, 2018. [Abstract] [BibTeX] [URL] Abstract: The fast and accurate solution of large size sparse systems of linear equations is at the heart of numerical applications from a very broad range of domains including structural mechanics, fluid dynamics, geophysics, medical imaging, chemistry. Among the most commonly used techniques, direct methods, based on the factorization of the system matrix, are generally appreciated for their numerical robustness and ease of use. These advantages, however, come at the price of a considerable operations count and memory footprint. The work presented in this thesis is concerned with improving the scalability of sparse direct solvers, intended as the ability to solve problems of larger and larger size. More precisely, our work aims at developing solvers which are scalable in performance, memory consumption and complexity. We address performance scalability, that is the ability to reduce the execution time as more computational resources are available, introducing algorithms that improve parallelism by reducing communications and synchronizations. We discuss the use of novel parallel programming paradigms and tools to achieve their implementation in an efficient and portable way on modern, heterogeneous supercomputers. We present methods that make sparse direct solvers memory-scalable, that is, capable of taking advantage of parallelism without increasing the overall memory footprint. Finally we show how it is possible to use data sparsity to achieve an asymptotic reduction of the cost of such methods. The presented algorithms have been implemented in the freely distributed MUMPS and qr_mumps solver packages and their effectiveness assessed on real life problems from academic and industrial applications. BibTeX: @techreport{Buttari2018, author = {Buttari, Alfredo}, title = {Scalability of parallel sparse direct solvers: methods, memory and performance}, school = {Toulouse INP}, year = {2018}, url = {https://hal.archives-ouvertes.fr/tel-01913033} }  Carson E, Rozložník M, Strakoš Z, Tichý P and Tůma M (2018), "The Numerical Stability Analysis of Pipelined Conjugate Gradient Methods: Historical Context and Methodology", SIAM Journal on Scientific Computing. Vol. 40(5), pp. A3549-A3580. [Abstract] [BibTeX] [DOI] Abstract: Algebraic solvers based on preconditioned Krylov subspace methods are among the most powerful tools for large-scale numerical computations in applied mathematics, sciences, technology, as well as in emerging applications in social sciences. As the name suggests, Krylov subspace methods can be viewed as a sequence of projections onto nested subspaces of increasing dimension. They are therefore by their nature implemented as synchronized recurrences. This is the fundamental obstacle to efficient parallel implementation. Standard approaches to overcoming this obstacle described in the literature involve reducing the number of global synchronization points and increasing parallelism in performing arithmetic operations within individual iterations. One such approach, employed by the so-called pipelined Krylov subspace methods, involves overlapping the global communication needed for computing inner products with local arithmetic computations. Inexact computations in Krylov subspace methods, due to either floating point roundoff error or intentional action motivated by savings in computing time or energy consumption, have two basic effects, namely, slowing down convergence and limiting attainable accuracy. Although the methodologies for their investigation are different, these phenomena are closely related and cannot be separated from one another. The study of mathematical properties of Krylov subspace methods, in both the cases of exact and inexact computations, is a very active area of research and many issues in the analytic theory of Krylov subspace methods remain open. Numerical stability issues have been studied since the formulation of the conjugate gradient method in the middle of the last century, with many remarkable results achieved since then. Recently, the issues of attainable accuracy and delayed convergence caused by inexact computations became of interest in relation to pipelined conjugate gradient methods and their generalizations. In this contribution we recall the related early results and developments in synchronization-reducing conjugate gradient methods, identify the main factors determining possible numerical instabilities, and present a methodology for the analysis and understanding of pipelined conjugate gradient methods. We derive an expression for the residual gap that applies to any conjugate gradient method variant that uses a particular auxiliary vector in updating the residual, including pipelined conjugate gradient methods, and show how this result can be used to perform a full-scale analysis for a particular implementation. The paper concludes with a brief perspective on Krylov subspace methods in the forthcoming exascale era. BibTeX: @article{Carson2018, author = {Carson, E. and Rozložník, M. and Strakoš, Z. and Tichý, P. and Tůma, M.}, title = {The Numerical Stability Analysis of Pipelined Conjugate Gradient Methods: Historical Context and Methodology}, journal = {SIAM Journal on Scientific Computing}, year = {2018}, volume = {40}, number = {5}, pages = {A3549-A3580}, doi = {10.1137/16M1103361} }  Cartis C, Gould NIM and Toint PL (2018), "Sharp worst-case evaluation complexity bounds for arbitrary-order nonconvex optimization with inexpensive constraints" [Abstract] [BibTeX] Abstract: We provide sharp worst-case evaluation complexity bounds for non convex minimization problems with general inexpensive constraints, i.e. problems where the cost of evaluating/enforcing of the (possibly nonconvex or even disconnected) constraints, if any, is negligible compared to that of evaluating the objective function. These bounds unify, extend or improve all known upper and lower complexity bounds for unconstrained and convexly-constrained problems. It is shown that, given an accuracy level 𝜖, a degree of highest available Lipschitz continuous derivatives p and a desired optimality order q between one and p, a conceptual regularization algorithm requires no more than O(-\frac{p+1}{p-q+1} evaluations of the objective function and its derivatives to compute a suitably approximate q-th order minimizer. With an appropriate choice of the regularization, a similar result also holds if the p-th derivative is merely Holder rather than Lipschitz continuous. We provide an example that shows that the above complexity bound is sharp for unconstrained and a wide class of constrained problems; we also give reasons for the optimality of regularization methods from a worst-case complexity point of view, within a large class of algorithms that use the same derivative information. BibTeX: @article{Cartis2018, author = {Cartis, Coralia and Gould, Nick I. M. and Toint, Philippe L.}, title = {Sharp worst-case evaluation complexity bounds for arbitrary-order nonconvex optimization with inexpensive constraints}, year = {2018} }  Cayrols S, Duff I and Lopez F (2018), "Parallelization of the solve phase in a task-based Cholesky solver using a sequential task flow model". Thesis at: Science & Technology Facilities Council, UK. [Abstract] [BibTeX] Abstract: We describe the parallelization of the solve phase in the sparse Cholesky solver SpLLT [Duff, Hogg, and Lopez. Numerical Algebra, Control and Optimization. Volume 8, 235-237, 2018] when using a sequential task flow (STF) model. In the context of direct methods, the solution of a sparse linear system is achieved through three main phases: the analyse, the factorization and the solve phases. In the last two phases which involve numerical computation, the factorization corresponds to the most computationally costly phase, and it is therefore crucial to parallelize this phase in order to reduce the time-to-solution on modern architectures. As a consequence, the solve phase is often not as optimized as the factorization in state-of-the-art solvers and opportunities for parallelism are often not exploited in this phase. However, in some applications, the time spent in the solve phase is comparable or even greater than the time for the factorization and the user could dramatically benefit from a faster solve routine. This is the case, for example, for a CG solver using a block Jacobi preconditioner. The diagonal blocks are factorized once only but their factors are used to solve subsystems at each CG iteration. In this study we design and implement a parallel version of a task-based solve routine for an OpenMP version of the SpLLT solver. We show that we can obtain good scalability on a multicore architecture enabling a dramatic reduction of the overall time-to-solution in some applications. BibTeX: @report{Cayrols2018, author = {Cayrols, Sébastien and Duff, Iain and Lopez, Florent}, title = {Parallelization of the solve phase in a task-based Cholesky solver using a sequential task flow model}, school = {Science & Technology Facilities Council, UK}, year = {2018} }  Cheshmi K, Kamil S, Strout MM and Dehnavi MM (2018), "ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism", In Proceedings of The International Conference for High Performance Computing, Networking, Storage, and Analysis. Dallas, TX, US [Abstract] [BibTeX] Abstract: ParSy is a framework that generates parallel code for sparse matrix computations. It uses a novel inspection strategy along with code transformations to generate parallel code for shared memory processors that is optimized for locality and load balance. Code generated by existing automatic parallelism approaches for sparse algorithms can suffer from load imbalance and excessive synchronization, resulting in performance that does not scale well on multi-core systems. We propose a novel task coarsening strategy that creates well-balanced tasks that can execute in parallel. ParSy-generated code outperforms existing highly-optimized sparse matrix codes such as the Cholesky factorization on multi-core processors with speed-ups of 2.8× and 3.1× over the MKL Pardiso and PaStiX libraries respectively. BibTeX: @inproceedings{Cheshmi2018, author = {Cheshmi, Kazem and Kamil, Shoaib and Strout, Michelle Mills and Dehnavi, Maryam Mehri}, title = {ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism}, booktitle = {Proceedings of The International Conference for High Performance Computing, Networking, Storage, and Analysis}, year = {2018} }  Chow E, Anzt H, Scott J and Dongarra J (2018), "Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning", Journal of Parallel and Distributed Computing. Vol. 119, pp. 219-230. [Abstract] [BibTeX] [DOI] [URL] Abstract: When using incomplete factorization preconditioners with an iterative method to solve large sparse linear systems, each application of the preconditioner involves solving two sparse triangular systems. These triangular systems are challenging to solve efficiently on computers with high levels of concurrency. On such computers, it has recently been proposed to use Jacobi iterations, which are highly parallel, to approximately solve the triangular systems from incomplete factorizations. The effectiveness of this approach, however, is problem-dependent: the Jacobi iterations may not always converge quickly enough for all problems. Thus, as a necessary and important step to evaluate this approach, we experimentally test the approach on a large number of realistic symmetric positive definite problems. We also show that by using block Jacobi iterations, we can extend the range of problems for which such an approach can be effective. For block Jacobi iterations, it is essential for the blocking to be cognizant of the matrix structure. BibTeX: @article{Chow2018, author = {Chow, Edmond and Anzt, Hartwig and Scott, Jennifer and Dongarra, Jack}, title = {Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning}, journal = {Journal of Parallel and Distributed Computing}, year = {2018}, volume = {119}, pages = {219--230}, url = {http://www.sciencedirect.com/science/article/pii/S0743731518303034}, doi = {10.1016/j.jpdc.2018.04.017} }  Coutinho D, Xavier-de-Souza S and Aloise D (2018), "A Scalable Shared-Memory Parallel Simplex for Large-Scale Linear Programming" [Abstract] [BibTeX] Abstract: We present a shared-memory parallel implementation of the Simplex tableau algorithm for dense large-scale Linear Programming (LP) problems. We present the general scheme and explain each parallelization step of the standard simplex algorithm, emphasizing important solutions for solving performance bottlenecks. We analyzed the speedup and the parallel efficiency for the proposed implementation relative to the standard Simplex algorithm using a shared-memory system with 64 processing cores. The experiments were performed for several different problems, with up to 8192 variables and constraints, in their primal and dual formulations. The results show that the performance is mostly much better when we use the formulation with more variables than inequality constraints. Also, they show that the parallelization strategies applied to avoid bottlenecks caused the implementation to scale well with the problem size and the core count up to a certain limit of problem size. Further analysis showed that this was an effect of resource limitation. Even though, our implementation was able to reach speedups in the order of 19×. BibTeX: @online{Coutinho2018, author = {Coutinho, Demetrios and Xavier-de-Souza, Samuel and Aloise, Daniel}, title = {A Scalable Shared-Memory Parallel Simplex for Large-Scale Linear Programming}, year = {2018} }  Dahiya Y, Konomis D and Woodruff DP (2018), "An Empirical Evaluation of Sketching for Numerical Linear Algebra", In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, USA , pp. 1292-1300. ACM. [Abstract] [BibTeX] [DOI] Abstract: Over the last ten years, tremendous speedups for problems in randomized numerical linear algebra such as low rank approximation and regression have been made possible via the technique of randomized data dimensionality reduction, also known as sketching. In theory, such algorithms have led to optimal input sparsity time algorithms for a wide array of problems. While several scattered implementations of such methods exist, the goal of this work is to provide a comprehensive comparison of such methods to alternative approaches. We investigate least squares regression, iteratively reweighted least squares, logistic regression, robust regression with Huber and Bisquare loss functions, leverage score computation, Frobenius norm low rank approximation, and entrywise _1-low rank approximation. We give various implementation techniques to speed up several of these algorithms, and the resulting implementations demonstrate the tradeoffs of such techniques in practice. BibTeX: @inproceedings{Dahiya2018, author = {Dahiya, Yogesh and Konomis, Dimitris and Woodruff, David P.}, title = {An Empirical Evaluation of Sketching for Numerical Linear Algebra}, booktitle = {Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining}, publisher = {ACM}, year = {2018}, pages = {1292--1300}, doi = {10.1145/3219819.3220098} }  Davis TA (2018), "Graph algorithms via SuiteSparse:GraphBLAS:triangle counting and K-truss", In Proceedings of the IEEE High Performance Extreme Computing Conference. [Abstract] [BibTeX] Abstract: SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which defines a set of sparse matrix operations on an extended algebra of semirings using an almost unlimited variety of operators and types. When applied to sparse adjacency matrices, these algebraic operations are equivalent tocomputations on graphs. GraphBLAS provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. To illustrate GraphBLAS, two graph algorithms are constructed in GraphBLAS and compared with efficient implementations without GraphBLAS: triangle counting and constructing the k-truss of a graph. BibTeX: @inproceedings{Davis2018a, author = {Davis, Timothy A.}, title = {Graph algorithms via SuiteSparse:GraphBLAS:triangle counting and K-truss}, booktitle = {Proceedings of the IEEE High Performance Extreme Computing Conference}, year = {2018} }  Dufrechou E and Ezzatti P (2018), "Solving Sparse Triangular Linear Systems in Modern GPUs: A Synchronization-Free Algorithm", In Proceedings of the 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing. Cambridge, UK , pp. 196-203. [Abstract] [BibTeX] [DOI] Abstract: Sparse triangular linear systems are ubiquitous in a wide range of science and engineering fields, and represent one of the most important building blocks of Sparse Numerical Lineal Algebra methods. For this reason, their parallel solution has been subject of exhaustive study, and efficient implementations of this kernel can be found for almost every hardware platform. However, the strong data dependencies that serialize a great deal of the execution and the load imbalance inherent to the triangular structure poses serious difficulties for its parallel performance, specially in the context of massively- parallel processors such as GPUs. To this day, the most widespread GPU implementation of this kernel is the one distributed in NVIDIA CUSPARSE library, which relies on a preprocessing stage to determine the parallel execution schedule. Although the solution phase is highly efficient, this strategy pays the cost of constant synchronizations with the CPU. In this work, we present a synchronization-free GPU al- gorithm to solve sparse triangular linear systems for the CSR format. The experimental evaluation shows performance improvements over CUSPARSE and a recently proposed synchronization-free method for the CSC matrix format. BibTeX: @inproceedings{Dufrechou2018, author = {Dufrechou, Ernesto and Ezzatti, Pablo}, title = {Solving Sparse Triangular Linear Systems in Modern GPUs: A Synchronization-Free Algorithm}, booktitle = {Proceedings of the 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing}, year = {2018}, pages = {196--203}, doi = {10.1109/PDP2018.2018.00034} }  Dufrechou E and Ezzatti P (2018), "A New GPU Algorithm to Compute a Level Set-Based Analysis for the Parallel Solution of Sparse Triangular Systems", In Proceedings of the IEEE International Parallel and Distributed Processing Symposium., May, 2018. , pp. 920-929. [Abstract] [BibTeX] [DOI] Abstract: A myriad of problems in science and engineering, involve the solution of sparse triangular linear systems. They arise frequently as part of direct and iterative solvers for linear systems and eigenvalue problems, and hence can be considered as a key building block of sparse numerical linear algebra. This is why, since the early days, their parallel solution has been exhaustively studied, and efficient implementations of this kernel can be found for almost every hardware platform. In the GPU context, the most widespread implementation of this kernel is the one distributed in NVIDIA CUSPARSE library, which relies on a preprocessing stage to aggregate the unknowns of the triangular system into level sets. This determines an execution schedule for the solution of the system, where the level sets have to be processed sequentially while the unknowns that belong to one level set can be solved in parallel. One of the disadvantages of the CUSPARSE implementation is that this preprocessing stage is often extremely slow in comparison to the runtime of the solving phase. In this work, we present a parallel GPU algorithm that is able to compute the same level sets as CU S PARSE but takes significantly less runtime. Our experiments on a set of matrices from the SuiteSparse collection show acceleration factors of up to 44×. Additionally, we provide a routine capable of solving a triangular linear system on the same pass used to calculate the level sets, yielding important performance benefits. BibTeX: @inproceedings{Dufrechou2018a, author = {Dufrechou, E. and Ezzatti, P.}, title = {A New GPU Algorithm to Compute a Level Set-Based Analysis for the Parallel Solution of Sparse Triangular Systems}, booktitle = {Proceedings of the IEEE International Parallel and Distributed Processing Symposium}, year = {2018}, pages = {920--929}, doi = {10.1109/IPDPS.2018.00101} }  Dziekonski A and Mrozowski M (2018), "A GPU Solver for Sparse Generalized Eigenvalue Problems with Symmetric Complex-Valued Matrices Obtained Using Higher-Order FEM", IEEE Access. [Abstract] [BibTeX] [DOI] Abstract: The paper discusses a fast implementation of the stabilized locally optimal block preconditioned conjugate gradient (sLOBPCG) method, using a hierarchical multilevel preconditioner to solve non-Hermitian sparse generalized eigenvalue problems with large symmetric complex-valued matrices obtained using the higher-order finite-element method (FEM), applied to the analysis of a microwave resonator. The resonant frequencies of the low-order modes are the eigenvalues of the smallest real part of a complex symmetric (though non-Hermitian) matrix pencil. These type of pencils arise in the FEM analysis of resonant cavities loaded with a lossy material. To accelerate the computations, graphics processing units (GPU, NVIDIA Pascal P100) were used. Single and dual-GPU variants are considered and a GPU-memory-saving implementation is proposed. An efficient sliced ELLR-T sparse matrix storage format was used and operations were performed on blocks of vectors for best performance on a GPU. As a result, significant speedups (exceeding a factor of six in some computational scenarios) were achieved over the reference parallel implementation using a multicore central processing unit (CPU, Intel Xeon E5-2680 v3, twelve cores). These results indicate that the solution of generalized eigenproblems needs much more GPU memory than iterative techniques when solving a sparse system of equations, and also requires a second GPU to store some data structures in order to reduce the footprint, even for a moderately large systems. BibTeX: @article{Dziekonski2018, author = {Dziekonski, A. and Mrozowski, M.}, title = {A GPU Solver for Sparse Generalized Eigenvalue Problems with Symmetric Complex-Valued Matrices Obtained Using Higher-Order FEM}, journal = {IEEE Access}, year = {2018}, doi = {10.1109/access.2018.2871219} }  Ekström S-E and Garoni C (2018), "A matrix-less and parallel interpolation--extrapolation algorithm for computing the eigenvalues of preconditioned banded symmetric Toeplitz matrices", Numerical Algorithms.. Thesis at: Uppsala University, Division of Scientific Computing. [Abstract] [BibTeX] [DOI] Abstract: In the past few years, Bogoya, Bottcher, Grudsky, and Maximenko obtained the precise asymptotic expansion for the eigenvalues of a Toeplitz matrix T_n(f), under suitable assumptions on the generating function f, as the matrix size n goes to infinity. On the basis of several numerical experiments, it was conjectured by Serra-Capizzano that a completely analogous expansion also holds for the eigenvalues of the preconditioned Toeplitz matrix T_n(u)^-1T_n(v), provided f = v/u is monotone and further conditions on u and v are satisfied. Based on this expansion, we here propose and analyze an interpolation--extrapolation algorithm for computing the eigenvalues of T_n(u)^-1T_n(v). The algorithm is suited for parallel implementation and it may be called “matrix-less” as it does not need to store the entries of the matrix. We illustrate the performance of the algorithm through numerical experiments and we also present its generalization to the case where f = v/u is non-monotone. BibTeX: @article{Ekstrom2018, author = {Ekström, Sven-Erik and Garoni, Carlo}, title = {A matrix-less and parallel interpolation--extrapolation algorithm for computing the eigenvalues of preconditioned banded symmetric Toeplitz matrices}, journal = {Numerical Algorithms}, school = {Uppsala University, Division of Scientific Computing}, year = {2018}, doi = {10.1007/s11075-018-0508-0} }  Franceschini A, Ferronato M, Janna C and Magri VAP (2018), "Recent advancements in preconditioning techniques for large size linear systems suited for High Performance Computing", In Seminari Padovani di Analisi Numerica 2018. Vol. 11, pp. 11-22. [Abstract] [BibTeX] Abstract: The numerical simulations of real-world engineering problems create models with several millions or even billions of degrees of freedom. Most of these simulations are centered on the solution of systems of non-linear equations, that, once linearized, become a sequence of linear systems, whose solution is often the most time-demanding task. Thus, in order to increase the capability of modeling larger cases, it is of paramount importance to exploit the resources of High Performance Computing architectures. In this framework, the development of new algorithms to accelerate the solution of linear systems for many-core architectures is a really active research field. Our main focus is algebraic preconditioning and, among the various options, we elect to develop approximate inverses for symmetric and positive definite (SPD) linear systems [22], both as stand-alone preconditioner or smoother for AMG techniques. This choice is mainly supported by the almost perfect parallelism that intrinsically characterizes these algorithms. As basic kernel, the Factorized Sparse Approximate Inverse (FSAI) developed in its adaptive form by Janna and Ferronato [18] is selected. Recent developments are i) a robust multilevel approach for SPD problems based on FSAI preconditioning, which eliminates the chance of algorithmic breakdowns independently of the preconditioner sparsity [14] and ii) a novel AMG approach featuring the adaptive FSAI method as a flexible smoother as well as new approaches to adaptively compute the prolongation operator. In this latter work, a new technique to build the prolongation is also presented. BibTeX: @inproceedings{Franceschini2018, author = {Franceschini, Andrea and Ferronato, Massimiliano and Janna, Carlo and Magri, Victor A. P.}, title = {Recent advancements in preconditioning techniques for large size linear systems suited for High Performance Computing}, booktitle = {Seminari Padovani di Analisi Numerica 2018}, year = {2018}, volume = {11}, pages = {11--22} }  Franchetti F, Moura JMF, Padua DA and Dongarra J (2018), "From High-Level Specification to High-Performance Code", Proceedings of the IEEE. Vol. 106, pp. 1875-1878. [Abstract] [BibTeX] Abstract: Computer architectures and systems are becoming ever more powerful but increasingly more complex. With the end of frequency scaling (about 2004) and the era of multicores/manycores/accelerators, it is exceedingly hard to extract the promised performance, in particular, at a reasonable energy budget. Only highly trained and educated experts can hope to conquer this barrier that, if not appropriately dealt with, can translate into multiple orders of magnitude of underutilization of computer systems when programmed by less specialized programmers or domain scientists. To overcome this challenge, the last ten years have seen a flurry of activity to automate the design and generation of highly efficient implementations for these multicore/ manycore architectures, and to translate high level descriptions of programs into high performance and power efficiency BibTeX: @article{Franchetti2018, author = {Franchetti, Franz and Moura, José M. F. and Padua, David A. and Dongarra, Jack}, title = {From High-Level Specification to High-Performance Code}, journal = {Proceedings of the IEEE}, year = {2018}, volume = {106}, pages = {1875--1878} }  Fukaya T, Kannan R, Nakatsukasa Y, Yamamoto Y and Yanagisawa Y (2018), "Shifted CholeskyQR for computing the QR factorization of ill-conditioned matrices" [Abstract] [BibTeX] Abstract: The Cholesky QR algorithm is an efficient communication-minimizing algorithm for computing the QR factorization of a tall-skinny matrix. Unfortunately it has the inherent numerical instability and breakdown when the matrix is ill-conditioned. A recent work establishes that the instability can be cured by repeating the algorithm twice (called CholeskyQR2). However, the applicability of CholeskyQR2 is still limited by the requirement that the Cholesky factorization of the Gram matrix runs to completion, which means it does not always work for matrices X with _2(X) ge u^-12 where u is the unit roundoff. In this work we extend the applicability to _2(X)=𝒪(u^-1) by introducing a shift to the computed Gram matrix so as to guarantee the Cholesky factorization R^T R = A^T A+sI succeeds numerically. We show that the computed A R^-1 has reduced condition number le u^-12*, for which CholeskyQR2 safely computes the QR factorization, yielding a computed Q of orthogonality mid Q^T Q - I_2 and residual mid A-QRmid _F/mid A_F both 𝒪(u). Thus we obtain the required QR factorization by essentially running Cholesky QR thrice. We extensively analyze the resulting algorithm shiftedCholeskyQR to reveal its excellent numerical stability. shiftedCholeskyQR is also highly parallelizable, and applicable and effective also when working in an oblique inner product space. We illustrate our findings through experiments, in which we achieve significant (up to ×40) speedup over alternative methods. BibTeX: @online{Fukaya2018, author = {Fukaya, Takeshi and Kannan, Ramaseshan and Nakatsukasa, Yuji and Yamamoto, Yusaku and Yanagisawa, Yuka}, title = {Shifted CholeskyQR for computing the QR factorization of ill-conditioned matrices}, year = {2018} }  Goddard A and Wathen A (2018), "A note on parallel preconditioning for all-at-once evolutionary PDEs", Electronic Transactions on Numerical Analysis. Vol. XX [Abstract] [BibTeX] Abstract: McDonald, Pestana and Wathen (SIAM J. Sci. Comput. 40 (2), pp. A2012--A1033, 2018) present a method for preconditioning of time-dependent PDEs via approximation by a nearby time-periodic problem, that is, they employ circulant-related matrices as preconditioners for the non-symmetric block Toeplitz matrices which arise from an all-at-once formulation. They suggest that such an approach might be efficiently implemented in parallel. In this short article, we present parallel numerical results for their preconditioner which exhibit strong scaling. We also extend their preconditioner via a Neumann series approach, which also allows for efficient parallel execution. Our simple implementation (in C++ and MPI) is available at the Git repository PARALAAOMPI. BibTeX: @article{Goddard2018, author = {Goddard, Anthony and Wathen, Andy}, title = {A note on parallel preconditioning for all-at-once evolutionary PDEs}, journal = {Electronic Transactions on Numerical Analysis}, year = {2018}, volume = {XX} }  Haidar A, Tomov S, Dongarra J and Higham NJ (2018), "Harnessing a GPU's Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers", In Proceedings of NVIDIA's GPU Technology Conference. [Abstract] [BibTeX] Abstract: Recent in-hardware GPU acceleration of half precision arithmetic (FP16) -- motivated by various machine learning (ML) and artificial intelligence (AI) applications -- has reinvigorated a great interest in the mixed-precision iterative refinement technique. The technique is based on use of low precision arithmetic to accelerate the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP64 accuracy. While being a well known technique, its successful modification, software development, and adjustment to match architecture specifics, is challenging. For current manycore GPUs the challenges range from efficient parallelization to scaling, and using the FP16 arithmetic. Here, we address these challenges by showing how to algorithmically modify, develop high-performance implementations, and in general, how to use the FP16 arithmetic to significantly accelerate, as well as make more energy efficient, FP64-precision Ax = b solvers. One can reproduce our results as the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach stressing on the use of the Volta V100 Tensor Cores that provide additional FP16 performance boost BibTeX: @inproceedings{Haidar2018, author = {Haidar, Azzam and Tomov, Stanimire and Dongarra, Jack and Higham, Nicholas J.}, title = {Harnessing a GPU's Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers}, booktitle = {Proceedings of NVIDIA's GPU Technology Conference}, year = {2018} }  Hong C, Sukumaran-Rajam A, Bandyopadhyay B, Kim J, Kurt SE, Nisa I, Sabhlok S, Çatalyürek ÜV, Parthasarathy S and Sadayappan P (2018), "Efficient Sparse-matrix Multi-vector Product on GPUs", In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. Tempe, Arizona , pp. 66-79. ACM. [Abstract] [BibTeX] [DOI] Abstract: Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high performance on sparse computations on GPUs is very challenging. A tremendous amount of recent research has focused on various GPU implementations of the SpMV kernel. But the multi-vector SpMM kernel has received much less attention. In this paper, we present an in-depth analysis to contrast SpMV and SpMM, and develop a new sparse-matrix representation and computation approach suited to achieving high data-movement efficiency and effective GPU parallelization of SpMM. Experimental evaluation using the entire SuiteSparse matrix suite demonstrates significant performance improvement over existing SpMM implementations from vendor libraries. BibTeX: @inproceedings{Hong2018, author = {Hong, Changwan and Sukumaran-Rajam, Aravind and Bandyopadhyay, Bortik and Kim, Jinsung and Kurt, Süreyya Emre and Nisa, Israt and Sabhlok, Shivani and Çatalyürek, Ümit V. and Parthasarathy, Srinivasan and Sadayappan, P.}, title = {Efficient Sparse-matrix Multi-vector Product on GPUs}, booktitle = {Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing}, publisher = {ACM}, year = {2018}, pages = {66--79}, doi = {10.1145/3208040.3208062} }  Jacquelin M, Ng EG and Peyton BW (2018), "Fast and effective reordering of columns within supernodes using partition refinement", In Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing. , pp. 76-86. [Abstract] [BibTeX] [DOI] Abstract: In this paper, we consider the problem of computing a triangular factorization of a sparse symmetric matrix using Gaussian elimination. We assume that the sparse matrix has been permuted using a fill-reducing ordering algorithm. When the matrix is symmetric positive definite, the sparsity structure of the triangular factor can be determined once the fill-reducing ordering has been computed. Thus, an efficient numerical factorization scheme can be designed so that only the nonzero entries are stored and operated on. On modern architectures, the positions of the nonzero entries in the triangular factor play a big role in determining the efficiency. It is desirable to have dense blocks in the factor so that the computation can be cast in terms of level-3 BLAS as much as possible. On architectures with GPUs, for example, it is also desirable for these dense blocks to be as large as possible in order to reduce the times to transfer data between the main CPU and the GPUs. We address the problem of locally refining the ordering so that the number of dense blocks is reduced and the sizes of these dense blocks are increased in the triangular factor. BibTeX: @inbook{Jacquelin2018, author = {Jacquelin, Mathias and Ng, Esmond G. and Peyton, Barry W.}, title = {Fast and effective reordering of columns within supernodes using partition refinement}, booktitle = {Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing}, year = {2018}, pages = {76--86}, doi = {10.1137/1.9781611975215.8} }  Jaiganesh J and Burtscher M (2018), "A High-performance Connected Components Implementation for GPUs", In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. New York, NY, USA , pp. 92-104. ACM. [Abstract] [BibTeX] [DOI] Abstract: Computing connected components is an important graph algorithm that is used, for example, in medicine, image processing, and biochemistry. This paper presents a fast connected-components implementation for GPUs called ECL-CC. It builds upon the best features of prior algorithms and augments them with GPU-specific optimizations. For example, it incorporates a parallelism-friendly version of pointer jumping to speed up union-find operations and uses several compute kernels to exploit the multiple levels of hardware parallelism. The resulting CUDA code is asynchronous and lock free, employs load balancing, visits each edge exactly once, and only processes edges in one direction. It is 1.8 times faster on average than the fastest prior GPU implementation running on a Titan X and faster on most of the eighteen real-world and synthetic graphs we tested. BibTeX: @inproceedings{Jaiganesh2018, author = {Jaiganesh, Jayadharini and Burtscher, Martin}, title = {A High-performance Connected Components Implementation for GPUs}, booktitle = {Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing}, publisher = {ACM}, year = {2018}, pages = {92--104}, doi = {10.1145/3208040.3208041} }  Jia Z, Maggioni M, Staiger B and Scarpazza DP (2018), "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking", April, 2018. [Abstract] [BibTeX] Abstract: Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU software designers to remain up-to-date with the technological advances at a microarchitectural level. To address this dearth of public, microarchitectural-level information on the novel NVIDIA GPUs, independent researchers have resorted to microbenchmarks-based dissection and discovery. This has led to a prolific line of publications that shed light on instruction encoding, and memory hierarchy's geometry and features at each level. Namely, research that describes the performance and behavior of the Kepler, Maxwell and Pascal architectures. In this technical report, we continue this line of research by presenting the microarchitectural details of the NVIDIA Volta architecture, discovered through microbenchmarks and instruction set disassembly. Additionally, we compare quantitatively our Volta findings against its predecessors, Kepler, Maxwell and Pascal. BibTeX: @article{Jia2018, author = {Jia, Zhe and Maggioni, Marco and Staiger, Benjamin and Scarpazza, Daniele P.}, title = {Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking}, year = {2018} }  Jun S, Wright A, Zhang S, Xu S and Arvind (2018), "GraFBoost: Using Accelerated Flash Storage for External Graph Analytics", In ACM/IEEE 45th Annual International Symposium on Computer Architecture., June, 2018. , pp. 411-424. [Abstract] [BibTeX] [DOI] Abstract: We describe GraFBoost, a flash-based architecture with hardware acceleration for external analytics of multi-terabyte graphs. We compare the performance of GraFBoost with 1 GB of DRAM against various state-of-the-art graph analytics software including FlashGraph, running on a 32-thread Xeon server with 128 GB of DRAM. We demonstrate that despite the relatively small amount of DRAM, GraFBoost achieves high performance with very large graphs no other system can handle, and rivals the performance of the fastest software platforms on sizes of graphs that existing platforms can handle. Unlike in-memory and semi-external systems, GraFBoost uses a constant amount of memory for all problems, and its performance decreases very slowly as graph sizes increase, allowing GraFBoost to scale to much larger problems than possible with existing systems while using much less resources on a single-node system. The key component of GraFBoost is the sort-reduce accelerator, which implements a novel method to sequentialize fine-grained random accesses to flash storage. The sort-reduce accelerator logs random update requests and then uses hardware-accelerated external sorting with interleaved reduction functions. GraFBoost also stores newly updated vertex values generated in each superstep of the algorithm lazily with the old vertex values to further reduce I/O traffic. We evaluate the performance of GraFBoost for PageRank, breadth-first search and betweenness centrality on our FPGA-based prototype (Xilinx VC707 with 1 GB DRAM and 1 TB flash) and compare it to other graph processing systems including a pure software implementation of GrapFBoost. BibTeX: @inproceedings{Jun2018, author = {Jun, S. and Wright, A. and Zhang, S. and Xu, S. and Arvind}, title = {GraFBoost: Using Accelerated Flash Storage for External Graph Analytics}, booktitle = {ACM/IEEE 45th Annual International Symposium on Computer Architecture}, year = {2018}, pages = {411--424}, doi = {10.1109/ISCA.2018.00042} }  Kaya O, Kannan R and Ballard G (2018), "Partitioning and Communication Strategies for Sparse Non-negative Matrix Factorization". Thesis at: INRIA Bordeaux Sud-Ouest. [Abstract] [BibTeX] Abstract: Non-negative matrix factorization (NMF), the problem of finding two non-negative low-rank factors whose product approximates an input matrix, is a useful tool for many data mining and scientific applications such as topic modeling in text mining and blind source separation in microscopy. In this paper, we focus on scaling algorithms for NMF to very large sparse datasets and massively parallel machines by employing effective algorithms, communication patterns, and partitioning schemes that leverage the sparsity of the input matrix. In the case of machine learning workflow, the computations after SpMM must deal with dense matrices, as Sparse-Dense matrix multiplication will result in a dense matrix. Hence, the partitioning strategy considering only SpMM will result in a huge imbalance in the overall workflow especially on computations after SpMM and in this specific case of NMF on non-negative least squares computations. Towards this, we consider two previous works developed for related problems, one that uses a fine-grained partitioning strategy using a point-to-point communication pattern and on that uses a checkerboard partitioning strategy using a collective-based communication pattern. We show that a combination of the previous approaches balances the demands of the various computations within NMF algorithms and achieves high efficiency and scalability. From the experiments, we could see that our proposed algorithm communicates atleast 4x less than the collective and achieves upto 100× speed up over the baseline FAUN on real world datasets. Our algorithm was experimented in two different super computing platforms and we could scale up to 32000 processors on Bluegene/Q. BibTeX: @techreport{Kaya2018, author = {Kaya, Oguz and Kannan, Ramakrishnan and Ballard, Grey}, title = {Partitioning and Communication Strategies for Sparse Non-negative Matrix Factorization}, school = {INRIA Bordeaux Sud-Ouest}, year = {2018} }  Kirmani S, Sun H and Raghavan P (2018), "A Scalability and Sensitivity Study of Parallel Geometric Algorithms for Graph Partitioning" [Abstract] [BibTeX] [DOI] Abstract: Graph partitioning arises in many computational simulation workloads, including those that involve finite difference or finite element methods, where partitioning enables efficient parallel processing of the entire simulation. We focus on parallel geometric algorithms for partitioning large graphs whose vertices are associated with coordinates in two-or three-dimensional space on multi-core processors. Compared with other types of partitioning algorithms, geometric schemes generally show better scalability on a large number of processors or cores. This paper studies the scalability and sensitivity of two parallel algorithms, namely, recursive coordinate bisection (denoted by pRCB) and geometric mesh partitioning (denoted by pGMP), in terms of their robustness to several key factors that affect the partition quality, including coordinate perturbation, approximate embedding, mesh quality and graph planarity. Our results indicate that the quality of a partition as measured by the size of the edge separator (or cutsize) remains consistently better for pGMP compared to pRCB. On average for our test suite, relative to pRCB, pGMP yields 25% smaller cutsizes on the original embedding, and across all perturbations cutsizes that are smaller by at least 8% and by as much as 50%. Not surprisingly, higher quality cuts are obtained at the expense of longer execution times; on a single core, pGMP has an average execution time that is almost 10 times slower than that of pRCB, but it scales better and catches up at 32-cores to be slower by less than 20%. With the current trends in core counts that continue to increase per chip, these results suggest that pGMP presents an attractive solution if a modest number of cores can be deployed to reduce execution times while providing high quality partitions. BibTeX: @article{Kirmani2018, author = {Kirmani, Shad and Sun, Hongyang and Raghavan, Padma}, title = {A Scalability and Sensitivity Study of Parallel Geometric Algorithms for Graph Partitioning}, year = {2018}, doi = {10.13140/RG.2.2.32020.96644} }  Knigge TE and Bisseling RH (2018), "An improved exact algorithm and an NP-completeness proof for sparse matrix bipartitioning" [Abstract] [BibTeX] Abstract: We formulate the sparse matrix bipartitioning problem of minimizing the communication volume in parallel sparse matrix-vector multiplication. We prove its NP-completeness in the perfectly balanced case, where both parts of the partitioned matrix must have an equal number of nonzeros, by reduction from the graph bisection problem. We present an improved exact branch-and-bound algorithm which finds the minimum communication volume for a given maximum allowed imbalance. The algorithm is based on a maximum-flow bound and a pack- ing bound, which extend previous matching and packing bounds. We implemented the algorithm in a new program called MP (Matrix Partitioner), which solved 839 matrices from the SuiteSparse collection to optimality, each within 24 hours of CPU-time. Furthermore, MP solved the difficult problem of the matrix cage6 in about 3 days. The new program is about 13.8 times faster than the previous program MondriaanOpt. BibTeX: @online{Knigge2018, author = {Knigge, Timon E. and Bisseling, Rob H.}, title = {An improved exact algorithm and an NP-completeness proof for sparse matrix bipartitioning}, year = {2018} }  Kurzak J, Gates M, Yamazaki I, Charara A, YarKhan A, Finney J, Ragghianti G, Luszczek P and Dongarra J (2018), "SLATE Working Note 8: Linear Systems Performance Report". Thesis at: Innovative Computing Laboratory, University of Tennessee., September, 2018. (ICL-UT-XX-XX) [Abstract] [BibTeX] Abstract: Software for Linear Algebra Targeting Exascale (SLATE) is being developed as part of the Exascale Computing Project (ECP), which is a collaborative effort between two US Department of Energy (DOE) organizations, the Office of Science and the National Nuclear Security Administration (NNSA). The purpose of SLATE is to serve as a replacement for ScaLAPACK for the upcoming pre-exascale and exascale DOE machines. SLATE will accomplish this objective by leveraging recent progress in parallel programming models and by strongly focusing on supporting hardware accelerators. This report focuses on the set of SLATE routines that solve linear systems of equations. Specifically, initial performance numbers are reported, alongside ScaLAPACK performance numbers, on the SummitDev machine at the Oak Ridge Leadership Computing Facility (OLCF). More details about the design of the SLATE software infrastructure can be found in the report by Kurzak et al. BibTeX: @techreport{Kurzak2018, author = {Kurzak, Jakub and Gates, Mark and Yamazaki, Ichitaro and Charara, Ali and YarKhan, Asim and Finney, Jamie and Ragghianti, Gerald and Luszczek, Piotr and Dongarra, Jack}, title = {SLATE Working Note 8: Linear Systems Performance Report}, school = {Innovative Computing Laboratory, University of Tennessee}, year = {2018}, number = {ICL-UT-XX-XX}, note = {revision 09-2018} }  Li R, Xi Y, Erlandson L and Saad Y (2018), "The Eigenvalues Slicing Library (EVSL): Algorithms, Implementation, and Software" [Abstract] [BibTeX] Abstract: This paper describes a software package called EVSL (for EigenValues Slicing Library) for solving large sparse real symmetric standard and generalized eigenvalue problems. As its name indicates, the package exploits spectrum slicing, a strategy that consists of dividing the spectrum into a number of subintervals and extracting eigenpairs from each subinterval independently. In order to enable such a strategy, the methods implemented in EVSL rely on a quick calculation of the spectral density of a given matrix, or a matrix pair. What distinguishes EVSL from other currently available packages is that EVSL relies entirely on filtering techniques. Polynomial and rational filtering are both implemented and are coupled with Krylov subspace methods and the subspace iteration algorithm. On the implementation side, the package offers interfaces for various scenarios including matrix-free modes, whereby the user can supply his/her own functions to perform matrix-vector operations or to solve sparse linear systems. The paper describes the algorithms in EVSL, provides details on their implementations, and discusses performance issues for the various methods. BibTeX: @online{Li2018, author = {Li, Ruipeng and Xi, Yuanzhe and Erlandson, Lucas and Saad, Yousef}, title = {The Eigenvalues Slicing Library (EVSL): Algorithms, Implementation, and Software}, year = {2018} }  Loncaric C, Ernst MD and Torlak E (2018), "Generalized Data Structure Synthesis", In Proceedings of the 40th International Conference on Software Engineering. New York, NY, USA , pp. 958-968. ACM. [Abstract] [BibTeX] [DOI] Abstract: Data structure synthesis is the task of generating data structure implementations from high-level specifications. Recent work in this area has shown potential to save programmer time and reduce the risk of defects. Existing techniques focus on data structures for manipulating subsets of a single collection, but real-world programs often track multiple related collections and aggregate properties such as sums, counts, minimums, and maximums.\This paper shows how to synthesize data structures that track subsets and aggregations of multiple related collections. Our technique decomposes the synthesis task into alternating steps of query synthesis and incrementalization. The query synthesis step implements pure operations over the data structure state by leveraging existing enumerative synthesis techniques, specialized to the data structures domain. The incrementalization step implements imperative state modifications by re-framing them as fresh queries that determine what to change, coupled with a small amount of code to apply the change. As an added benefit of this approach over previous work, the synthesized data structure is optimized for not only the queries in the specification but also the required update operations. We have evaluated our approach in four large case studies, demonstrating that these extensions are broadly applicable. BibTeX: @inproceedings{Loncaric2018, author = {Loncaric, Calvin and Ernst, Michael D. and Torlak, Emina}, title = {Generalized Data Structure Synthesis}, booktitle = {Proceedings of the 40th International Conference on Software Engineering}, publisher = {ACM}, year = {2018}, pages = {958--968}, doi = {10.1145/3180155.3180211} }  Miyata T (2018), "On Correction-Based Iterative Methods for Eigenvalue Problems", IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences. Vol. E101.A(10), pp. 1668-1675. [Abstract] [BibTeX] [DOI] Abstract: The Jacobi-Davidson method and the Riccati method for eigenvalue problems are studied. In the methods, one has to solve a nonlinear equation called the correction equation per iteration, and the difference between the methods comes from how to solve the equation. In the Jacobi-Davidson/Riccati method the correction equation is solved with/without linearization. In the literature, avoiding the linearization is known as an improvement to get a better solution of the equation and bring the faster convergence. In fact, the Riccati method showed superior convergence behavior for some problems. Nevertheless the advantage of the Riccati method is still unclear, because the correction equation is solved not exactly but with low accuracy. In this paper, we analyzed the approximate solution of the correction equation and clarified the point that the Riccati method is specialized for computing particular solutions of eigenvalue problems. The result suggests that the two methods should be selectively used depending on target solutions. Our analysis was verified by numerical experiments. BibTeX: @article{Miyata2018, author = {Miyata, Takafumi}, title = {On Correction-Based Iterative Methods for Eigenvalue Problems}, journal = {IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences}, year = {2018}, volume = {E101.A}, number = {10}, pages = {1668--1675}, doi = {10.1587/transfun.E101.A.1668} }  Mohammad H, Waziri MY and Santos SA (2018), "A brief survey of methods for solving nonlinear least-squares problems", Numerical Algebra, Control & Optimization. Vol. 9(2155-3289_2019_1_1), pp. 1. [Abstract] [BibTeX] [DOI] [URL] Abstract: In this paper, we present a brief survey of methods for solving nonlinear least-squares problems. We pay specific attention to methods that take into account the special structure of the problems. Most of the methods discussed belong to the quasi-Newton family (i.e. the structured quasi-Newton methods (SQN)). Our survey comprises some of the traditional and modern developed methods for nonlinear least-squares problems. At the end, we suggest a few topics for further research. BibTeX: @article{Mohammad2018, author = {Mohammad, Hassan and Waziri, Mohammed Yusuf and Santos, Sandra Augusta}, title = {A brief survey of methods for solving nonlinear least-squares problems}, journal = {Numerical Algebra, Control & Optimization}, year = {2018}, volume = {9}, number = {2155-3289_2019_1_1}, pages = {1}, url = {http://aimsciences.org//article/id/7e25fb2d-b50c-46dd-9be5-23deee2b4242}, doi = {10.3934/naco.2019001} }  Mohammadi MS, Cheshmi K, Gopalakrishnan G, Hall MW, Dehnavi MM, Venkat A, Yuki T and Strout MM (2018), "Sparse Matrix Code Dependence Analysis Simplification at Compile Time" [Abstract] [BibTeX] Abstract: Analyzing array-based computations to determine data dependences is useful for many applications including automatic parallelization, race detection, computation and communication overlap, verification, and shape analysis. For sparse matrix codes, array data dependence analysis is made more difficult by the use of index arrays that make it possible to store only the nonzero entries of the matrix (e.g., in A[B[i]], B is an index array). Here, dependence analysis is often stymied by such indirect array accesses due to the values of the index array not being available at compile time. Consequently, many dependences cannot be proven unsatisfiable or determined until runtime. Nonetheless, index arrays in sparse matrix codes often have properties such as monotonicity of index array elements that can be exploited to reduce the amount of runtime analysis needed. In this paper, we contribute a formulation of array data dependence analysis that includes encoding index array properties as universally quantified constraints. This makes it possible to leverage existing SMT solvers to determine whether such dependences are unsatisfiable and significantly reduces the number of dependences that require runtime analysis in a set of eight sparse matrix kernels. Another contribution is an algorithm for simplifying the remaining satisfiable data dependences by discovering equalities and/or subset relationships. These simplifications are essential to make a runtime-inspection-based approach feasible. BibTeX: @online{Mohammadi2018, author = {Mohammadi, Mahdi Soltan and Cheshmi, Kazem and Gopalakrishnan, Ganesh and Hall, Mary W. and Dehnavi, Maryam Mehri and Venkat, Anand and Yuki, Tomofumi and Strout, Michelle Mills}, title = {Sparse Matrix Code Dependence Analysis Simplification at Compile Time}, year = {2018} }  Nagasaka Y, Matsuoka S, Azad A and Buluç A (2018), "High-performance sparse matrix-matrix products on Intel KNL and multicore architectures", In Proceedings of the 47th International Conference on Parallel Processing (Workshops). [Abstract] [BibTeX] Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix. BibTeX: @inproceedings{Nagasaka2018, author = {Nagasaka, Yusuke and Matsuoka, Satoshi and Azad, Ariful and Buluç, Aydın}, title = {High-performance sparse matrix-matrix products on Intel KNL and multicore architectures}, booktitle = {Proceedings of the 47th International Conference on Parallel Processing (Workshops)}, year = {2018} }  Neumann C and Stein O (2018), "Generating feasible points for mixed-integer convex optimization problems by inner parallel cuts", Optimization Online. [Abstract] [BibTeX] Abstract: In this article we introduce an inner parallel cutting plane method (IPCP) to compute good feasible points for mixed-integer convex optimization problems. The method iteratively generates polyhedral outer approximations of an enlarged inner parallel set (EIPS) of the continuously relaxed feasible set. This EIPS possesses the crucial property that any rounding of any of its elements is feasible for the original problem. The outer approximations are refined in each iteration by using modified Kelley cutting planes, which are defined via rounded optimal points of linear opti- mization problems (LPs).\ We show that the method either computes a feasible point or certifies that the EIPS is empty. Moreover, we provide bounds on the objective value of the generated feasible point. As there exist consistent problems which possess an empty EIPS, the IPCP is not guaranteed to find a feasible point for the latter. Yet, the crucial advantage of the method lies in the complexity of each iteration: While other approaches need to solve a mixed-integer linear optimization problem, the IPCP only needs to solve an LP, which can be carried out efficiently. Our computational study indicates that the IPCP is able to quickly find feasible points for many practical applications. It further demonstrates that the objective values of the computed feasible points are generally of good quality and sometimes not easily obtainable by other methods. BibTeX: @article{Neumann2018, author = {Neumann, Christoph and Stein, Oliver}, title = {Generating feasible points for mixed-integer convex optimization problems by inner parallel cuts}, journal = {Optimization Online}, year = {2018}, note = {Preprint ID 2018-11-6947} }  Rong H (2018), "Expressing Sparse Matrix Computations for Productive Performance on Spatial Architectures" [Abstract] [BibTeX] Abstract: This paper addresses spatial programming of sparse matrix computations for productive performance. The challenge is how to express an irregular computation and its optimizations in a regular way. A sparse matrix has (non-zero) values and a structure. In this paper, we propose to classify the implementations of a computation on a sparse matrix into two categories: (1) structure-driven, or top-down, approach, which traverses the structure with given row and column indices and locates the corresponding values, and (2) values-driven, or bottom-up, approach, which loads and processes the values in parallel streams, and decodes the structure for the values' corresponding row and column indices. On a spatial architecture like FPGAs, the values-driven approach is the norm. We show how to express a sparse matrix computation and its optimizations for a values-driven implementation. A compiler automatically synthesizes a code to decode the structure. In this way, programmers focus on optimizing the processing of the values, using familiar optimizations for dense matrices, while leaving the complex, irregular structure traversal to an automatic compiler. We also attempt to regularize the optimizations of the reduction for a dynamic number of values, which is common in a sparse matrix computation. BibTeX: @online{Rong2018, author = {Rong, Hongbo}, title = {Expressing Sparse Matrix Computations for Productive Performance on Spatial Architectures}, year = {2018} }  Schlag S, Schulz C, Seemaier D and Strash D (2018), "Scalable Edge Partitioning" [Abstract] [BibTeX] Abstract: Edge-centric distributed computations have appeared as a recent technique to improve the shortcomings of think-like-a-vertex algorithms on large scale-free networks. In order to increase parallelism on this model, edge partitioning - partitioning edges into roughly equally sized blocks - has emerged as an alternative to traditional (node-based) graph partitioning. In this work, we give a distributed memory parallel algorithm to compute high-quality edge partitions in a scalable way. Our algorithm scales to networks with billions of edges, and runs efficiently on thousands of PEs. Our technique is based on a fast parallelization of split graph construction and a use of advanced node partitioning algorithms. Our extensive experiments show that our algorithm has high quality on large real-world networks and large hyperbolic random graphs, which have a power law degree distribution and are therefore specifically targeted by edge partitioning BibTeX: @online{Schlag2018, author = {Schlag, Sebastian and Schulz, Christian and Seemaier, Daniel and Strash, Darren}, title = {Scalable Edge Partitioning}, year = {2018} }  Simpson T, Dimosthenis Pasadakis ad Kourounis D, Fujita K, Yamaguchi T, Tsuyoshi I and Schenk O (2018), "Balanced Graph Partition Refinement using the Graph p-Laplacian", In Proceedings of the ACM Platform for Advanced Scientific Computing Conference., June, 2018. [Abstract] [BibTeX] [DOI] Abstract: A continuous formulation of the optimal 2-way graph partitioning based on the p-norm minimization of the graph Laplacian Rayleigh quotient is presented, which provides a sharp approximation to the balanced graph partitioning problem, the optimality of which is known to be NP-hard. The minimization is initialized from a cut provided by a state-of-the-art multilevel recursive bisection algorithm, and then a continuation approach reduces the p-norm from a 2-norm towards a 1-norm, employing for each value of p a feasibility-preserving steepest-descent method that converges on the p-Laplacian eigenvector. A filter favors iterates advancing towards minimum edgecut and partition load imbalance. The complexity of the suggested approach is linear in graph edges. The simplicity of the steepest-descent algorithm renders the overall approach highly scalable and efficient in parallel distributed architectures. Parallel implementation of recursive bisection on multi-core CPUs and GPUs are presented for large-scale graphs with up to 1.9 billion tetrahedra. The suggested approach exhibits improvements of up to 52.8% over METIS for graphs originating from triangular Delaunay meshes, 34.7% over METIS and 21.9% over KaHIP for power network graphs, 40.8% over METIS and 20.6% over KaHIP for sparse matrix graphs, and finally 93.2% over METIS for graphs emerging from social networks. BibTeX: @inproceedings{Simpson2018, author = {Simpson, Toby and Dimosthenis, Pasadakis ad Kourounis, Drosos and Fujita, Kohei and Yamaguchi, Takuma and Tsuyoshi, Ichimura and Schenk, Olaf}, title = {Balanced Graph Partition Refinement using the Graph p-Laplacian}, booktitle = {Proceedings of the ACM Platform for Advanced Scientific Computing Conference}, year = {2018}, doi = {10.1145/3218176.3218232} }  Solovyev A, Baranowski MS, Briggs I, Jacobsen C, Rakamariundefined Z and Gopalakrishnan G (2018), "Rigorous Estimation of Floating-Point Round-Off Errors with Symbolic Taylor Expansions", ACM Transaction Programming Language Systems. New York, NY, USA, December, 2018. Vol. 41(1) Association for Computing Machinery. [Abstract] [BibTeX] [DOI] [URL] Abstract: Rigorous estimation of maximum floating-point round-off errors is an important capability central to many formal verification tools. Unfortunately, available techniques for this task often provide overestimates. Also, there are no available rigorous approaches that handle transcendental functions. We have developed a new approach called Symbolic Taylor Expansionsthat avoids this difficulty, and implemented a new tool called FPTaylor embodying this approach. Key to our approach is the use of rigorous global optimization, instead of the more familiar interval arithmetic, affine arithmetic, and/or SMT solvers. In addition to providing far tighter upper bounds of round-off error in a vast majority of cases, FPTaylor also emits analysis certificates in the form of HOL Light proofs. We release FPTaylor along with our benchmarks for evaluation. BibTeX: @article{Solovyev2018, author = {Solovyev, Alexey and Baranowski, Marek S. and Briggs, Ian and Jacobsen, Charles and Rakamariundefined, Zvonimir and Gopalakrishnan, Ganesh}, title = {Rigorous Estimation of Floating-Point Round-Off Errors with Symbolic Taylor Expansions}, journal = {ACM Transaction Programming Language Systems}, publisher = {Association for Computing Machinery}, year = {2018}, volume = {41}, number = {1}, url = {https://doi.org/10.1145/3230733}, doi = {10.1145/3230733} }  Stoltzfus L, Emani M, Lin P-H and Liao C (2018), "Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling", In Proceedings of the Workshop on Memory Centric High Performance Computing. New York, NY, USA , pp. 45-49. ACM. [Abstract] [BibTeX] [DOI] Abstract: Modern supercomputers often use Graphic Processing Units (or GPUs) to meet the ever-growing demands for high performance computing. GPUs typically have a complex memory architecture with various types of memories and caches, such as global memory, shared memory, constant memory, and texture memory. The placement of data on these memories has a tremendous impact on the performance of the HPC applications and identifying the optimal placement location is non-trivial.\ In this paper, we propose a machine learning-based approach to build a classifier to determine the best class of GPU memory that will minimize GPU kernel execution time. This approach utilizes a set of performance counters obtained from profiling runs along with hardware features to generate the trained model. We evaluate our approach on several generations of NVIDIA GPUs, including Kepler, Maxwell, Pascal, and Volta on a set of benchmarks. The results show that the trained model achieves prediction accuracy over 90% and given a global version, the classifier can accurately determine which data placement variant would yield the best performance. BibTeX: @inproceedings{Stoltzfus2018, author = {Stoltzfus, Larisa and Emani, Murali and Lin, Pei-Hung and Liao, Chunhua}, title = {Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling}, booktitle = {Proceedings of the Workshop on Memory Centric High Performance Computing}, publisher = {ACM}, year = {2018}, pages = {45--49}, doi = {10.1145/3286475.3286482} }  Su X, Liao X, Jiang H, Yang C and Xue J (2018), "SCP: Shared Cache Partitioning for High-Performance GEMM", ACM Transactions on Architecture and Code Optimization. New York, NY, USA, October, 2018. Vol. 15(4), pp. 43:1-43:21. ACM. [Abstract] [BibTeX] [DOI] Abstract: GEneral Matrix Multiply (GEMM) is the most fundamental computational kernel routine in the BLAS library. To achieve high performance, in-memory data must be prefetched into fast on-chip caches before they are used. Two techniques, software prefetching and data packing, have been used to effectively exploit the capability of on-chip least recent used (LRU) caches, which are popular in traditional high-performance processors used in high-end servers and supercomputers. However, the market has recently witnessed a new diversity in processor design, resulting in high-performance processors equipped with shared caches with non-LRU replacement policies. This poses a challenge to the development of high-performance GEMM in a multithreaded context. As several threads try to load data into a shared cache simultaneously, interthread cache conflicts will increase significantly. We present a Shared Cache Partitioning (SCP) method to eliminate interthread cache conflicts in the GEMM routines, by partitioning a shared cache into physically disjoint sets and assigning different sets to different threads. We have implemented SCP in the OpenBLAS library and evaluated it on Phytium 2000+, a 64-core AArch64 processor with private LRU L1 caches and shared pseudo-random L2 caches (per four-core cluster). Our evaluation shows that SCP has effectively reduced the conflict misses in both L1 and L2 caches in a highly optimized GEMM implementation, resulting in an improvement of its performance by 2.75% to 6.91%. BibTeX: @article{Su2018, author = {Su, Xing and Liao, Xiangke and Jiang, Hao and Yang, Canqun and Xue, Jingling}, title = {SCP: Shared Cache Partitioning for High-Performance GEMM}, journal = {ACM Transactions on Architecture and Code Optimization}, publisher = {ACM}, year = {2018}, volume = {15}, number = {4}, pages = {43:1--43:21}, doi = {10.1145/3274654} }  Sun X, Wei K, Lai L, Tsai S and Wu C (2018), "Optimizing Sparse Matrix-Vector Multiplication on GPUs via Index Compression", In Proceedings of the 3rd IEEE Advanced Information Technology, Electronic and Automation Control Conference., October, 2018. , pp. 598-602. [Abstract] [BibTeX] [DOI] Abstract: Sparse matrix-vector multiplication (SpMV) as one of the most significant scientific kernels has been widely used in many scientific disciplines. In practical applications, large-scale spare matrices are usually used for calculation. During these years, Graphic Processing Unit (GPU) has become a powerful platform for high-performance computing, and optimizingSpMV on GPU based systems for efficient performance is the principal interest in many researches. In this paper, we proposed a new method to optimize SpMV on GPUs via index compression. Our index compression method can reduce the index value of the access space. The memory space for recording each column index is significantly reduced from two bytes to one byte, which outperforms the previous work on access performance. The main contributions we make are as follows: (1) Only one byte for each column index is required, which can significantly reduce the working set of the column index and further improve the cache hit ration. (2) Our method can be applied to any kind of matrices, while the previous work can only apply to subset of the matrices. Computational experiments on problems according to the previous work reveal that the best performance improvement ration for ours is up to about 1.5. BibTeX: @inproceedings{Sun2018, author = {Sun, X. and Wei, K. and Lai, L. and Tsai, S. and Wu, C.}, title = {Optimizing Sparse Matrix-Vector Multiplication on GPUs via Index Compression}, booktitle = {Proceedings of the 3rd IEEE Advanced Information Technology, Electronic and Automation Control Conference}, year = {2018}, pages = {598--602}, doi = {10.1109/IAEAC.2018.8577693} }  Tan G, Liu J and Li J (2018), "Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture", ACM Trans. Math. Softw.. New York, NY, USA, August, 2018. Vol. 44(4), pp. 46:1-46:25. ACM. [Abstract] [BibTeX] [DOI] Abstract: Sparse matrix vector multiplication (SpMV) is an important computational kernel in traditional high-performance computing and emerging data-intensive applications. Previous SpMV libraries are optimized by either application-specific or architecture-specific approaches but present difficulties for use in real applications. In this work, we develop an auto-tuning system (SMATER) to bridge the gap between specific optimizations and general-purpose use. SMATER provides programmers a unified interface based on the compressed sparse row (CSR) sparse matrix format by implicitly choosing the best format and fastest implementation for any input sparse matrix during runtime. SMATER leverages a machine-learning model and retargetable back-end library to quickly predict the optimal combination. Performance parameters are extracted from 2,386 matrices in the SuiteSparse matrix collection. The experiments show that SMATER achieves good performance (up to 10 times that of the Intel Math Kernel Library (MKL) on Intel E5-2680 v3) while being portable on state-of-the-art x86 multicore processors, NVIDIA GPUs, and Intel Xeon Phi accelerators. Compared with the Intel MKL library, SMATER runs faster by more than 2.5 times on average. We further demonstrate its adaptivity in an algebraic multigrid solver from the Hypre library and report greater than 20% performance improvement. BibTeX: @article{Tan2018, author = {Tan, Guangming and Liu, Junhong and Li, Jiajia}, title = {Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture}, journal = {ACM Trans. Math. Softw.}, publisher = {ACM}, year = {2018}, volume = {44}, number = {4}, pages = {46:1--46:25}, doi = {10.1145/3218823} }  Tavernier J, Simm J, Meerbergen K and Moreau Y (2018), "Multilevel preconditioning for Ridge Regression" [Abstract] [BibTeX] Abstract: Solving linear systems is often the computational bottleneck in real-life problems. Iterative solvers are the only option due to the complexity of direct algorithms or because the system matrix is not explicitly known. Here, we develop a multilevel preconditioner for regularized least squares linear systems involving a feature or data matrix. Variants of this linear system may appear in machine learning applications, such as ridge regression, logistic regression, support vector machines and matrix factorization with side information. We use clustering algorithms to create coarser levels that preserve the principal components of the covariance or Gram matrix. These coarser levels approximate the dominant eigenvectors and are used to build a multilevel preconditioner accelerating the Conjugate Gradient method. We observed speed-ups for artificial and real-life data. For a specific data set, we achieved speed-up up to a factor 100. BibTeX: @online{Tavernier2018, author = {Tavernier, Joris and Simm, Jaak and Meerbergen, Karl and Moreau, Yves}, title = {Multilevel preconditioning for Ridge Regression}, year = {2018} }  Wang X, Liu W, Xue W and Wu L (2018), "swSpTRSV: A Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures", In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York, NY, USA , pp. 338-353. ACM. [Abstract] [BibTeX] [DOI] Abstract: Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work.\In this paper, we propose a novel data layout called Sparse Level Tile to make all data reuse under control, and design a Producer-Consumer pairing method to make any inter-level synchronization only happen in very fast register communication. We implement our data layout and algorithms on an SW26010 many-core processor, which is the main building-block of the current world fastest supercomputer Sunway Taihulight. The experimental results of testing all 2057 square matrices from the Florida Matrix Collection show that our method achieves an average speedup of 6.9 and the best speedup of 38.5 over parallel level-set method. Our method also outperforms the latest methods on a KNC many-core processor in 1856 matrices and the latest methods on a K80 GPU in 1672 matrices, respectively. BibTeX: @inproceedings{Wang2018, author = {Wang, Xinliang and Liu, Weifeng and Xue, Wei and Wu, Li}, title = {swSpTRSV: A Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures}, booktitle = {Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, year = {2018}, pages = {338--353}, doi = {10.1145/3178487.3178513} }  Wang X, Xu P, Xue W, Ao Y, Yang C, Fu H, Gan L, Yang G and Zheng W (2018), "A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010", In Proceedings of the 47th International Conference on Parallel Processing. New York, NY, USA , pp. 53:1-53:11. ACM. [Abstract] [BibTeX] [DOI] Abstract: The sparse triangular solver (SpTRSV) is one of the most essential kernels in many scientific and engineering applications. Efficiently parallelizing the SpTRSV on modern many-core architectures is considerably difficult due to inherent dependency of computation and discontinuous memory accesses. Achieving high performance of SpTRSV is even more challenging for SW26010, the new-generation customized heterogeneous many-core processor equipped in the top-rank Sunway TaihuLight supercomputer. Owing to regular sparse pattern, structured-grid triangular problems show much different computing characteristics with general ones as well as new opportunities to algorithm design on many-core architectures, which ever lacks attention. In this work, we focus on how to design and implement fast SpTRSV for structured-grid problems on SW26010. A generalized algorithm framework of parallel SpTRSV is proposed for best utilization of the features and flexibilities of SW26010 many-core architecture according to the fine-grained Producer-Consumer model. Moreover, a novel parallel structured-grid SpTRSV is presented by using direct data transfers across registers of the computing elements of SW26010. Experiments on four typical structured-grid triangular problems with different problem sizes demonstrate that our SpTRSV can achieve an average momory bandwidth utilization of 79.7% according to the stream benchmark, which leads to a speedup of 17.7 over serial version on SW26010. Furthermore, experiments with real world sparse linear problems show that our proposed SpTRSV can achieve superior preconditioning performance over the Intel Xeon E5-2670 v3 CPU and Intel Xeon Phi 7210 KNL over DDR4 memory. BibTeX: @inproceedings{Wang2018a, author = {Wang, Xinliang and Xu, Ping and Xue, Wei and Ao, Yulong and Yang, Chao and Fu, Haohuan and Gan, Lin and Yang, Guangwen and Zheng, Weimin}, title = {A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010}, booktitle = {Proceedings of the 47th International Conference on Parallel Processing}, publisher = {ACM}, year = {2018}, pages = {53:1--53:11}, doi = {10.1145/3225058.3225071} }  Yang C (2018), "Linear Algebra is the Right Way to Think About Graphs", In Proceedings of the 2018 ACM/IEEE Supercomputing Conference. [Abstract] [BibTeX] Abstract: Graph algorithms are challenging to implement on new accelerators such as GPUs. To address this problem, GraphBLAS is an innovative on-going effort by the graph analytics community to formulate graph algorithms as sparse linear algebra, so that they can be expressed in a performant, succinct and in a backend-agnostic manner. Initial research efforts in implementing GraphBLAS on GPUs for graph processing and analytics have been promising, but challenges such as feature-incompleteness and poor performance still exist compared to their vertex-centric (“think like a vertex”) graph framework counterparts. For our thesis, we propose a multi-language graph framework aiming to simplify the development of graph algorithms, which 1) provides a multi-language GraphBLAS interface for the end-users to express, develop, and refine graph algorithms more succinctly than existing distributed graph frameworks; 2) abstracts away from the end-users performance tuning decisions; 3) utilizes the advantages of existing low-level GPU computing primitives to maintain high performance. BibTeX: @inproceedings{Yang2018, author = {Yang, Carl}, title = {Linear Algebra is the Right Way to Think About Graphs}, booktitle = {Proceedings of the 2018 ACM/IEEE Supercomputing Conference}, year = {2018} }  Yang C, Buluç A and Owens JD (2018), "Design Principles for Sparse Matrix Multiplication on the GPU" [Abstract] [BibTeX] Abstract: We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients -- (i) merge-based load-balancing and (ii) row-major coalesced memory access -- we demonstrate a 3.6x peak speedup and a 23.5% geomean speedup over state-of-the-art SpMM implementations on real-world datasets. BibTeX: @online{Yang2018b, author = {Yang, Carl and Buluç, Aydın and Owens, John D.}, title = {Design Principles for Sparse Matrix Multiplication on the GPU}, year = {2018} }  Yang C, Buluç A and Owens JD (2018), "Implementing Push-Pull Efficiently in GraphBLAS" Vol. abs/1804.03327 [Abstract] [BibTeX] Abstract: We factor Beamer's push-pull, also known as direction-optimized breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them for generalizability, asymptotic speedup, and contribution to overall speedup. We demonstrate that masking is critical for high performance and can be generalized to all graph algorithms where the sparsity pattern of the output is known a priori. We show that these graph algorithm optimizations, which together constitute DOBFS, can be neatly and separably described using linear algebra and can be expressed in the GraphBLAS linear-algebra-based framework. We provide experimental evidence that with these optimizations, a DOBFS expressed in a linear-algebra-based graph framework attains competitive performance with state-of-the-art graph frameworks on the GPU and on a multi-threaded CPU, achieving 101 GTEPS on a Scale 22 RMAT graph. BibTeX: @article{Yang2018d, author = {Yang, Carl and Buluç, Aydın and Owens, John D.}, title = {Implementing Push-Pull Efficiently in GraphBLAS}, year = {2018}, volume = {abs/1804.03327} }  Zhang Y, Yang M, Baghdadi R, Kamil S, Shun J and Amarasinghe S (2018), "GraphIt: A High-performance Graph DSL", Proceedings of the ACM Conference on Programming Languages. New York, NY, USA, October, 2018. Vol. 2, pp. 121:1-121:30. ACM. [Abstract] [BibTeX] [DOI] Abstract: The performance bottlenecks of graph applications depend not only on the algorithm and the underlying hardware, but also on the size and structure of the input graph. As a result, programmers must try different combinations of a large set of techniques, which make tradeoffs among locality, work-efficiency, and parallelism, to develop the best implementation for a specific algorithm and type of graph. Existing graph frameworks and domain specific languages (DSLs) lack flexibility, supporting only a limited set of optimizations.\ This paper introduces GraphIt, a new DSL for graph computations that generates fast implementations for algorithms with different performance characteristics running on graphs with different sizes and structures. GraphIt separates what is computed (algorithm) from how it is computed (schedule). Programmers specify the algorithm using an algorithm language, and performance optimizations are specified using a separate scheduling language. The algorithm language simplifies expressing the algorithms, while exposing opportunities for optimizations. We formulate graph optimizations, including edge traversal direction, data layout, parallelization, cache, NUMA, and kernel fusion optimizations, as tradeoffs among locality, parallelism, and work-efficiency. The scheduling language enables programmers to easily search through this complicated tradeoff space by composing together a large set of edge traversal, vertex data layout, and program structure optimizations. The separation of algorithm and schedule also enables us to build an autotuner on top of GraphIt to automatically find high-performance schedules. The compiler uses a new scheduling representation, the graph iteration space, to model, compose, and ensure the validity of the large number of optimizations. We evaluate GraphIt's performance with seven algorithms on graphs with different structures and sizes. GraphIt outperforms the next fastest of six state-of-the-art shared-memory frameworks (Ligra, Green-Marl, GraphMat, Galois, Gemini, and Grazelle) on 24 out of 32 experiments by up to 4.8×, and is never more than 43% slower than the fastest framework on the other experiments. GraphIt also reduces the lines of code by up to an order of magnitude compared to the next fastest framework. BibTeX: @article{Zhang2018, author = {Zhang, Yunming and Yang, Mengjiao and Baghdadi, Riyadh and Kamil, Shoaib and Shun, Julian and Amarasinghe, Saman}, title = {GraphIt: A High-performance Graph DSL}, journal = {Proceedings of the ACM Conference on Programming Languages}, publisher = {ACM}, year = {2018}, volume = {2}, pages = {121:1--121:30}, doi = {10.1145/3276491} }