# A selection of papers on numerical linear algebra and parallel computing published in 2019

This year has been particularly exciting for the fields of numerical linear algebra and parallel computing. I managed to add 267 entries to my bibliography, which means a new paper about every 1.7 days on average! To me, this is a big step up compared to last year’s 56 papers.

This year, I slightly changed my base Google Scholar query to algorithm sparse method parallel performance matrix numerical optimization graph

and I supplemented the query's results with interesting publications I came across.

## Some general trends

• Machine/deep learning is still (by far) the most popular application in the fields of numerical linear algebra and parallel computing.
• Several papers discuss BLAS-level operations on sparse matrices and show that the community is still looking for good ways to tailor data structures to applications. Parallel graph algorithms are often used in this context to optimise sparse representations.
• Machine learning is also being used to select the best data structure to use for a given application. I find this quite interesting, although I think there's value in understanding why some data structures are more suited than others in specific cases.
• Mixed-precision algorithms are gaining traction quite quickly, especially on GPUs. Also, the new IEEE standard for floating-point operations was released this year. I'm looking forward to seeing the impact this will have on GPU architectures and the robustness of numerical software.

Matching entries: 0
settings...
 Abdelfattah A, Tomov S and Dongarra J (2019), "Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed-Precision Solvers on GPUs". Thesis at: Innovative Computing Laboratory, University of Tennessee. [Abstract] [BibTeX] Abstract: The use of low-precision computations is popular in accelerating machine learning and artificial intelligence (AI) applications. Hardware architectures, such as high-end graphics processing units (GPUs), now support native 16-bit floating-point arithmetic (i.e., half-precision). While half precision provides a natural 2×/4× speedup against the performance of single/double precisions, respectively, modern GPUs are equipped with hardware accelerators that further boost the FP16 performance. These accelerators, known as tensor cores (TCs), have a theoretical peak performance that is 8×/16× faster than FP32/FP64 performance, respectively. Such a high level of performance has encouraged researchers to harness the compute power of TCs outside AI applications. \ This paper presents a mixed-precision dense linear solver (Ax = b) for complex matrices using the GPU's TC units. Unlike similar efforts that have discussed accelerating Ax = b in real FP16 arithmetic, this paper focuses on complex FP16 precisions. The developed solution uses a “half-complex” precision to accelerate the solution of Ax = b while maintaining complex FP32 precision accuracy. The proposed solver requires the development of a high-performance mixed-precision matrix multiplication (CGEMM-FP16) that accepts half-complex inputs, and uses the TCs' full-precision products and FP32 accumulations for the computation. We discuss two designs and their performance. Similar to the way fast GEMMs power the performance of LAPACK, the mixed-precision CGEMMFP16 can enable the development of mixed-precision LAPACK algorithms. We illustrate this by integrating both CGEMM-FP16s into the development of mixed-precision LU factorizations of complex matrices. Finally, an iterative refinement solver is used to deliver complex FP32 accuracy using a preconditioned GMRES solver. Our experiments, conducted on V100 GPUs, show that the mixed-precision solver can be up to 2.5× faster than a full single-complex precision solver. BibTeX: @techreport{Abdelfattah2019, author = {Abdelfattah, Ahmad and Tomov, Stanimire and Dongarra, Jack}, title = {Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed-Precision Solvers on GPUs}, school = {Innovative Computing Laboratory, University of Tennessee}, year = {2019} } Acer S, Yaşar A, Rajamanickam S, Wolf M and Çatalyürek ÜV (2019), "Scalable Triangle Counting on Distributed-Memory Systems", In Proceedings of the IEEE High Performance Extreme Computing Conference., September, 2019. , pp. 1-5. [Abstract] [BibTeX] [DOI] Abstract: Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the “Static Graph Challenge”. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk to exploit the benefits of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate a speed up of 2.7× on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated. BibTeX: @inproceedings{Acer2019, author = {Acer, S. and Yaşar, A. and Rajamanickam, S. and Wolf, M. and Çatalyürek, Ü. V.}, title = {Scalable Triangle Counting on Distributed-Memory Systems}, booktitle = {Proceedings of the IEEE High Performance Extreme Computing Conference}, year = {2019}, pages = {1--5}, doi = {10.1109/HPEC.2019.8916302} } Adoni HWY, Nahhal T, Krichen M, Aghezzaf B and Elbyed A (2019), "A survey of current challenges in partitioning and processing of graph-structured data in parallel and distributed systems", Distributed and Parallel Databases., November, 2019. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: One of the concepts that attracts attention since entering of big data era is the graph-structured data. Suitable frameworks to handle such data would face several constraints, especially scalability, partitioning challenges, processing complexity and hardware configurations. Unfortunately, although several works deal with big data issues, there is a lack of literature review concerning the challenges related to query answering on large-scale graph data. In this survey paper, we review current problems related to the partitioning and processing of graph-structured data. We discuss existing graph processing systems and provide some insights to know how to choose the right system for parallel and distributed processing of large-scale graph data. Finally, we survey current open challenges in this field. BibTeX: @article{Adoni2019, author = {Adoni, Hamilton Wilfried Yves and Nahhal, Tarik and Krichen, Moez and Aghezzaf, Brahim and Elbyed, Abdeltif}, title = {A survey of current challenges in partitioning and processing of graph-structured data in parallel and distributed systems}, journal = {Distributed and Parallel Databases}, publisher = {Springer Science and Business Media LLC}, year = {2019}, doi = {10.1007/s10619-019-07276-9} } Afibuzzaman M, Rabbi F, Özkaya MY, Aktulga HM and Çatalyürek UV (2019), "DeepSparse: A Task-Parallel Framework for SparseSolvers on Deep Memory Architectures", In 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)., December, 2019. , pp. 373-382. [Abstract] [BibTeX] [DOI] Abstract: Data movement is an important bottleneck against efficiency and energy consumption in large-scale sparse matrix computations that are commonly used in linear solvers, eigensolvers and graph analytics. We introduce a novel task-parallel sparse solver framework, named DeepSparse, which adopts a fully integrated task-parallel approach. DeepSparse framework differs from existing work in that it adopts a holistic approach that targets all computational steps in a sparse solver rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). We present the implementation details of DeepSparse and demonstrate its merit in two popular eigensolvers, LOBPCG and Lanczos algorithms. We observe that DeepSparse achieves 2× - 16× fewer cache misses across different cache layers (L1, L2 and L3) over implementations of the same solvers based on optimized library function calls. We also achieve 2× - 3.9× improvement in execution time when using DeepSparse over the same library versions. BibTeX: @inproceedings{Afibuzzaman2019, author = {M. Afibuzzaman and F. Rabbi and M. Y. Özkaya and H. M. Aktulga and U. V. Çatalyürek}, title = {DeepSparse: A Task-Parallel Framework for SparseSolvers on Deep Memory Architectures}, booktitle = {2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)}, year = {2019}, pages = {373-382}, doi = {10.1109/HiPC.2019.00052} } Agarwal A, Peng J and Milenkovic O (2019), "Online Convex Matrix Factorization with Representative Regions", In Proceddings of the 33rd Conference on Neural Information Processing Systems. Vancouver, CA [Abstract] [BibTeX] Abstract: Matrix factorization (MF) is a versatile learning method that has found wide applications in various data-driven disciplines. Still, many MF algorithms do not adequately scale with the size of available datasets and/or lack interpretability. To improve the computational efficiency of the method, an online (streaming) MF algorithm was proposed in [1]. To enable data interpretability, a constrained version of MF, termed convex MF, was introduced in [2]. In the latter work, the basis vectors are required to lie in the convex hull of the data samples, thereby ensuring that every basis can be interpreted as a weighted combination of data samples. No current algorithmic solutions for online convex MF are known as it is challenging to find adequate convex bases without having access to the complete dataset. We address both problems by proposing the first online convex MF algorithm that maintains a collection of constant-size sets of representative data samples needed for interpreting each of the basis [2] and has the same almost sure convergence guarantees as the online learning algorithm of [1]. Our proof techniques combine random coordinate descent algorithms with specialized quasi-martingale convergence analysis. Experiments on synthetic and real world datasets show significant computational savings of the proposed online convex MF method compared to classical convex MF. Since the proposed method maintains small representative sets of data samples needed for convex interpretations, it is related to a body of work in theoretical computer science, pertaining to generating point sets [3], and in computer vision, pertaining to archetypal analysis [4]. Nevertheless, it differs from these lines of work both in terms of the objective and algorithmic implementations. BibTeX: @inproceedings{Agarwal2019, author = {Agarwal, Abhishek and Peng, Jianhao and Milenkovic, Olgica}, title = {Online Convex Matrix Factorization with Representative Regions}, booktitle = {Proceddings of the 33rd Conference on Neural Information Processing Systems}, year = {2019} } Agullo E, Giraud L and Poirel L (2019), "Robust preconditioners via generalized eigenproblems for hybrid sparse linear solvers", SIAM Journal on Matrix Analysis and Applications. [Abstract] [BibTeX] [URL] Abstract: The solution of large sparse linear systems is one of the most time consuming kernels in many numerical simulations. The domain decomposition community has developed many efficient and robust methods in the last decades. While many of these solvers fall into the abstract Schwarz (aS) framework, their robustness has originally been demonstrated on a case-by-case basis. In this paper, we propose a bound for the condition number of all deflated aS methods provided that the coarse grid consists of the assembly of local components that contain the kernel of some local operators. We show that classical results from the literature on particular instances of aS methods can be retrieved from this bound. We then show that such a coarse grid correction can be explicitly obtained algebraically via generalized eigenproblems, leading to a condition number independent of the number of domains. This result can be readily applied to retrieve or improve the bounds previously obtained via generalized eigenproblems in the particular cases of Neumann-Neumann (NN), Additive Schwarz (AS) and optimized Robin but also generalizes them when applied with approximate local solvers. Interestingly, the proposed methodology turns out to be a comparison of the considered particular aS method with generalized versions of both NN and AS for tackling the lower and upper part of the spectrum, respectively. We furthermore show that the application of the considered grid corrections in an additive fashion is robust in the AS case although it is not robust for aS methods in general. In particular, the proposed framework allows for ensuring the robustness of the AS method applied on the Schur complement (AS/S), either with deflation or additively, and with the freedom of relying on an approximate local Schur complement. Numerical experiments illustrate these statements. BibTeX: @article{Agullo2019, author = {Agullo, Emmanuel and Giraud, Luc and Poirel, Louis}, title = {Robust preconditioners via generalized eigenproblems for hybrid sparse linear solvers}, journal = {SIAM Journal on Matrix Analysis and Applications}, year = {2019}, url = {https://hal.inria.fr/hal-02074474/document} } Ahmad K, Sundar H and Hall M (2019), "Data-driven Mixed Precision Sparse Matrix Vector Multiplication for GPUs", ACM Transactions on Architecture and Code Optimization. New York, NY, USA, December, 2019. Vol. 16(4), pp. 51:1-51:24. ACM. [Abstract] [BibTeX] [DOI] [URL] Abstract: We optimize Sparse Matrix Vector multiplication (SpMV) using a mixed precision strategy (MpSpMV) for Nvidia V100 GPUs. The approach has three benefits: (1) It reduces computation time, (2) it reduces the size of the input matrix and therefore reduces data movement, and (3) it provides an opportunity for increased parallelism. MpSpMV's decision to lower to single precision is data driven, based on individual nonzero values of the sparse matrix. On all real-valued matrices from the Sparse Matrix Collection, we obtain a maximum speedup of 2.61× and average speedup of 1.06× over double precision, while maintaining higher accuracy compared to single precision. BibTeX: @article{Ahmad2019, author = {Ahmad, Khalid and Sundar, Hari and Hall, Mary}, title = {Data-driven Mixed Precision Sparse Matrix Vector Multiplication for GPUs}, journal = {ACM Transactions on Architecture and Code Optimization}, publisher = {ACM}, year = {2019}, volume = {16}, number = {4}, pages = {51:1--51:24}, url = {http://doi.acm.org/10.1145/3371275}, doi = {10.1145/3371275} } Alappat C, Hager G, Schenk O, Thies J, Basermann A, Bishop AR, Fehske H and Wellein G (2019), "A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication" [Abstract] [BibTeX] Abstract: The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today's multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation, which eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially and behaves in accordance with the roofline model. Outliers are discussed and analyzed in detail. While we focus on SymmSpMV in this paper, our algorithm and software is applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring. BibTeX: @article{Alappat2019, author = {Alappat, Christie and Hager, Georg and Schenk, Olaf and Thies, Jonas and Basermann, Achim and Bishop, Alan R. and Fehske, Holger and Wellein, Gerhard}, title = {A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication}, year = {2019} } Aliaga JI, Dufrechou E, Ezzatti P and Quintana-Ortí ES (2019), "Accelerating the task/data-parallel version of ILUPACK's BiCG in multi-CPU/GPU configurations", Parallel Computing. [Abstract] [BibTeX] [DOI] [URL] Abstract: ILUPACK is a valuable tool for the solution of sparse linear systems via iterative Krylov subspace-based methods. Its relevance for the solution of real problems has motivated several efforts to enhance its performance on parallel machines. In this work we focus on exploiting the task-level parallelism derived from the structure of the BiCG method, in addition to the data-level parallelism of the internal matrix computations, with the goal of boosting the performance of a GPU (graphics processing unit) implementation of this solver. First, we revisit the use of dual-GPU systems to execute independent stages of the BiCG concurrently on both accelerators, while leveraging the extra memory space to improve the data access patterns. In addition, we extend our ideas to compute the BiCG method efficiently in multicore platforms with a single GPU. In this line, we study the possibilities offered by hybrid CPU-GPU computations, as well as a novel synchronization-free sparse triangular linear solver. The experimental results with the new solvers show important acceleration factors with respect to the previous data-parallel CPU and GPU versions. BibTeX: @article{Aliaga2019, author = {Aliaga, José I. and Dufrechou, Ernesto and Ezzatti, Pablo and Quintana-Ortí, Enrique S.}, title = {Accelerating the task/data-parallel version of ILUPACK's BiCG in multi-CPU/GPU configurations}, journal = {Parallel Computing}, year = {2019}, url = {http://www.sciencedirect.com/science/article/pii/S0167819118301777}, doi = {10.1016/j.parco.2019.02.005} } Anastos M, Lamaison A, Steiner R and Szabó T (2019), "Majority Colorings of Sparse Digraphs", November, 2019. [Abstract] [BibTeX] Abstract: A majority coloring of a directed graph is a vertex-coloring in which every vertex has the same color as at most half of its out-neighbors. Kreutzer, Oum, Seymour, van der Zypen and Wood proved that every digraph has a majority 4-coloring and conjectured that every digraph admits a majority 3-coloring. We verify this conjecture for digraphs with chromatic number at most 6 or dichromatic number at most 3. We obtain analogous results for list coloring: We show that every digraph with list chromatic number at most 6 or list dichromatic number at most 3 is majority 3-choosable. We deduce that digraphs with maximum out-degree at most 4 or maximum degree at most 7 are majority 3-choosable. On the way to these results we investigate digraphs admitting a majority 2-coloring. We show that every digraph without odd directed cycles is majority 2-choosable. We answer an open question posed by Kreutzer et al. negatively, by showing that deciding whether a given digraph is majority 2-colorable is NP-complete. Finally we deal with a fractional relaxation of majority coloring proposed by Kreutzer et al. and show that every digraph has a fractional majority 3.9602-coloring. We show that every digraph with minimum out-degree Ω((1/)^2(1/)) has a fractional majority (2+)-coloring. BibTeX: @article{Anastos2019, author = {Anastos, Michael and Lamaison, Ander and Steiner, Raphael and Szabó, Tibor}, title = {Majority Colorings of Sparse Digraphs}, year = {2019} } Anzalone E, Capra M, Peloso R, Martina M and Masera G (2019), "Low-power Hardware Accelerator for Sparse Matrix Convolution in Deep Neural Network"" [Abstract] [BibTeX] Abstract: Deep Neural Networks (DNN) have reached an outstanding accuracy in the past years, often going beyond human abilities. Nowadays, DNNs are widely used in many Artificial Intelligence (AI) applications such as computer vision, natural language processing and autonomous driving. However, these incredible performance come at a high computational cost, requiring complex hardware platforms. Therefore, the need for dedicated hardware accelerators able to drastically speed up the execution by preserving a low-power attitude arise. This paper presents innovative techniques able to tackle matrix sparsity in convolutional DNNs due to non-linear activation functions. Developed architectures allow to skip unnecessary operations, like zero multiplications, without sacrificing accuracy or throughput and improving the energy efficiency. Such improvement could enhance the performance of embedded limited-budget battery applications, where cost-effective hardware, accuracy and duration are critical to expanding the deployment of AI. BibTeX: @inproceedings{Anzalone2019, author = {Anzalone, E. and Capra, M. and Peloso, R. and Martina, M. and Masera, G.}, title = {Low-power Hardware Accelerator for Sparse Matrix Convolution in Deep Neural Network"}, year = {2019} } Anzt H, Flegar G, Grützmacher T and Quintana-Ortí ES (2019), "Toward a modular precision ecosystem for high-performance computing", The International Journal of High Performance Computing Applications. [Abstract] [BibTeX] [DOI] Abstract: With the memory bandwidth of current computer architectures being significantly slower than the (floating point) arithmetic performance, many scientific computations only leverage a fraction of the computational power in today's high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between floating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientific applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a “modular precision ecosystem” that allows for more flexibility in terms of customized data access. For memory-bounded scientific applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm -- two applications that are popular in the communities and at the same characteristic representatives for the field of numerical linear algebra and data analytics, respectively. BibTeX: @article{Anzt2019, author = {Anzt, Hartwig and Flegar, Goran and Grützmacher, Thomas and Quintana-Ortí, Enrique S.}, title = {Toward a modular precision ecosystem for high-performance computing}, journal = {The International Journal of High Performance Computing Applications}, year = {2019}, doi = {10.1177/1094342019846547} } Anzt H, Chen Y-C, Cojean T, Dongarra J, Flegar G, Nayak P, Quintana-Ortí ES, Tsai YM and Wang W (2019), "Towards Continuous Benchmarking: An Automated Performance Evaluation Framework for High Performance Software", In Proceedings of the Platform for Advanced Scientific Computing Conference. New York, NY, USA , pp. 1-11. ACM. [Abstract] [BibTeX] [DOI] Abstract: We present an automated performance evaluation framework that enables an automated workflow for testing and performance evaluation of software libraries. Integrating this component into an ecosystem enables sustainable software development, as a community effort, via a web application for interactively evaluating the performance of individual software components. The performance evaluation tool is based exclusively on web technologies, which removes the burden of downloading performance data or installing additional software. We employ this framework for the Ginkgo software ecosystem, but the framework can be used with essentially any software project, including the comparison between different software libraries. The Continuous Integration (CI) framework of Ginkgo is also extended to automatically run a benchmark suite on predetermined HPC systems, store the state of the machine and the environment along with the compiled binaries, and collect results in a publicly accessible performance data repository based on Git. The Ginkgo performance explorer (GPE) can be used to retrieve the performance data from the repository, and visualizes it in a web browser. GPE also implements an interface that allows users to write scripts, archived in a Git repository, to extract particular data, compute particular metrics, and visualize them in many different formats (as specified by the script). The combination of these approaches creates a workflow which enables performance reproducibility and software sustainability of scientific software. In this paper, we present example scripts that extract and visualize performance data for Ginkgo's SpMV kernels that allow users to identify the optimal kernel for specific problem characteristics. BibTeX: @inproceedings{Anzt2019a, author = {Anzt, Hartwig and Chen, Yen-Chen and Cojean, Terry and Dongarra, Jack and Flegar, Goran and Nayak, Pratik and Quintana-Ortí, Enrique S. and Tsai, Yuhsiang M. and Wang, Weichung}, title = {Towards Continuous Benchmarking: An Automated Performance Evaluation Framework for High Performance Software}, booktitle = {Proceedings of the Platform for Advanced Scientific Computing Conference}, publisher = {ACM}, year = {2019}, pages = {1--11}, doi = {10.1145/3324989.3325719} } Anzt H and Flegar G (2019), "Are we Doing the Right Thing? A Critical Analysis of the Academic HPC Community", In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops., May, 2019. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Like in any other research field, academically surviving in the High Performance Computing (HPC) community generally requires to publish papers, in the bast case many of them and in high-ranked journals or at top-tier conferences. As a result, the number of scientific papers published each year in this relatively small community easily outnumbers what a single researcher can read. At the same time, many of the proposed and analyzed strategies, algorithms, and hardware-optimized implementations never make it beyond the prototype stage, as they are abandoned once they served the single purpose of yielding (another) publication. In a time and field where high-quality manpower is a scarce resource, this is extremely inefficient. In this position paper we promote a radical paradigm shift towards accepting high-quality software patches to community software packages as legitimate conference contributions. In consequence, the reputation and appointability of researchers is no longer based on the classical scientific metrics, but on the quality and documentation of open source software contributions -- effectively improving and accelerating the collaborative development of community software. BibTeX: @inproceedings{Anzt2019b, author = {Anzt, Hartwig and Flegar, Goran}, title = {Are we Doing the Right Thing? A Critical Analysis of the Academic HPC Community}, booktitle = {2019 IEEE International Parallel and Distributed Processing Symposium Workshops}, publisher = {IEEE}, year = {2019}, doi = {10.1109/ipdpsw.2019.00122} } Anzt H, Dongarra J, Flegar G, Higham NJ and Quintana-Orti ES (2019), "Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers", Concurrency and Computation: Practice and Experience. Vol. 31(6), pp. e4460. [Abstract] [BibTeX] [DOI] Abstract: We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block-Jacobi preconditioner in different precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the effects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth-bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem’s data in memory. Given this observation, we propose a model that quantifies the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a floating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential benefits of the adaptive block-Jacobi preconditioning scheme. BibTeX: @article{Anzt2019c, author = {Anzt, Hartwig and Dongarra, Jack and Flegar, Goran and Higham, Nicholas J. and Quintana-Orti, Enrique S.}, title = {Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers}, journal = {Concurrency and Computation: Practice and Experience}, year = {2019}, volume = {31}, number = {6}, pages = {e4460}, doi = {10.1002/cpe.4460} } Anzt H, Cojean T and Kühn E (2019), "Towards a New Peer Review Concept for Scientific Computing ensuring Technical Quality, Software Sustainability, and Result Reproducibility", Proceedings of Applied Mathematics and Mechanics. Vol. 19(1) [Abstract] [BibTeX] [DOI] Abstract: In this position paper we argue for implementing an alternative peer review process for scientific computing contributions that promotes high quality scientific software developments as fully-recognized conference submission. The idea is based on leveraging the code reviewers' feedback on scientific software contributions to community software developments as a third-party review involvement. Providing open access to this technical review would complement the scientific review of the contribution, efficiently reduce the workload of the undisclosed reviewers, improve the algorithm implementation quality and software sustainability, and ensure full reproducibility of the reported results. Using this process creates incentives to publish scientific algorithms in open source software – instead of designing prototype algorithms with the unique purpose of publishing a paper. In addition, the comments and suggestions of the community being archived in the versioning control systems ensure that also community reviewers are receiving credit for the review contributions – unlike reviewers in the traditional peer review process. Finally, it reflects the particularity of the scientific computing community using conferences rather than journals as the main publication venue. BibTeX: @inproceedings{Anzt2019d, author = {Anzt, Hartwig and Cojean, Terry and Kühn, Eileen}, title = {Towards a New Peer Review Concept for Scientific Computing ensuring Technical Quality, Software Sustainability, and Result Reproducibility}, journal = {Proceedings of Applied Mathematics and Mechanics}, year = {2019}, volume = {19}, number = {1}, doi = {10.1002/pamm.201900490} } Apers S and de Wolf R (2019), "Quantum Speedup for Graph Sparsification, Cut Approximation and Laplacian Solving", November, 2019. [Abstract] [BibTeX] Abstract: Graph sparsification underlies a large number of algorithms, ranging from approximation algorithms for cut problems to solvers for linear systems in the graph Laplacian. In its strongest form, "spectral sparsification" reduces the number of edges to near-linear in the number of nodes, while approximately preserving the cut and spectral structure of the graph. The breakthrough work by Benczúr and Karger (STOC'96) and Spielman and Teng (STOC'04) showed that sparsification can be done optimally in time near-linear in the number of edges of the original graph. \ In this work we show that quantum algorithms allow to speed up spectral sparsification, and thereby many of the derived algorithms. Given adjacency-list access to a weighted graph with n nodes and m edges, our algorithm outputs an 𝜖-spectral sparsifier in time O(mn/). We prove that this is tight up to polylog-factors. The algorithm builds on a string of existing results, most notably sparsification algorithms by Spielman and Srivastava (STOC'08) and Koutis and Xu (TOPC'16), a spanner construction by Thorup and Zwick (STOC'01), a single-source shortest-paths quantum algorithm by Dürr et al. (ICALP'04) and an efficient k-wise independent hash construction by Christiani, Pagh and Thorup (STOC'15). Combining our sparsification algorithm with existing classical algorithms yields the first quantum speedup, roughly from O(m) to O(mn), for approximating the max cut, min cut, min st-cut, sparsest cut and balanced separator of a graph. Combining our algorithm with a classical Laplacian solver, we demonstrate a similar speedup for Laplacian solving, for approximating effective resistances, cover times and eigenvalues of the Laplacian, and for spectral clustering. BibTeX: @article{Apers2019, author = {Apers, Simon and de Wolf, Ronald}, title = {Quantum Speedup for Graph Sparsification, Cut Approximation and Laplacian Solving}, year = {2019} } Argueta A and Chiang D (2019), "Accelerating Sparse Matrix Operations in Neural Networks on Graphics Processing Units", In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, IT , pp. 6215-6224. [Abstract] [BibTeX] Abstract: Graphics Processing Units (GPUs) are commonly used to train and evaluate neural networks efficiently. While previous work in deep learning has focused on accelerating operations on dense matrices/tensors on GPUs, efforts have concentrated on operations involving sparse data structures. Operations using sparse structures are common in natural language models at the input and output layers, because these models operate on sequences over discrete alphabets. We present two new GPU algorithms: one at the input layer, for multiplying a matrix by a few-hot vector (generalizing the more common operation of multiplication by a one-hot vector) and one at the output layer, for a fused softmax and top-N selection (commonly used in beam search). Our methods achieve speedups over state-of-theart parallel GPU baselines of up to 7× and 50×, respectively. We also illustrate how our methods scale on different GPU architectures. BibTeX: @inproceedings{Argueta2019, author = {Argueta, Arturo and Chiang, David}, title = {Accelerating Sparse Matrix Operations in Neural Networks on Graphics Processing Units}, booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, year = {2019}, pages = {6215--6224} } Arjevani Y, Carmon Y, Duchi JC, Foster DJ, Srebro N and Woodworth B (2019), "Lower Bounds for Non-Convex Stochastic Optimization", December, 2019. [Abstract] [BibTeX] Abstract: We lower bound the complexity of finding 𝜖-stationary points (with gradient norm at most 𝜖) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least -4 queries to find an 𝜖 stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of -3 queries, establishing the optimality of recently proposed variance reduction techniques. BibTeX: @article{Arjevani2019, author = {Arjevani, Yossi and Carmon, Yair and Duchi, John C. and Foster, Dylan J. and Srebro, Nathan and Woodworth, Blake}, title = {Lower Bounds for Non-Convex Stochastic Optimization}, year = {2019} } Artemov AG (2019), "Sparse approximate matrix multiplication in a fully recursive distributed task-parallel framework" [Abstract] [BibTeX] Abstract: In this paper we consider parallel implementations of approximate multiplication of large matrices with exponential decay of elements. Such matrices arise in computations related to electronic structure calculations and some other fields of science. Commonly, sparsity is introduced by truncation of input matrices. In turn, the sparse approximate multiplication algorithm [M. Challacombe and N. Bock, arXiv preprint 1011.3534, 2010] performs truncation of sub-matrix products. We consider these two methods and their combination, i.e. truncation of both input matrices and sub-matrix products. Implementations done using the Chunks and Tasks programming model and library [E. H. Rubensson and E. Rudberg, Parallel Comput., 40:328343, 2014] are presented and discussed. The absolute error asymptotic behavior is derived. A comparison between the three methods in terms of performance is done on a model problem. The algorithms are also applied to matrices coming from large chemical systems with ≈106 atoms. BibTeX: @online{Artemov2019, author = {Artemov, Anton G.}, title = {Sparse approximate matrix multiplication in a fully recursive distributed task-parallel framework}, year = {2019} } Augustine T, Sarma J, Pouchet L-N and Rodríguez G (2019), "Generating Piecewise-regular Code from Irregular Structures", In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York, NY, USA , pp. 625-639. ACM. [Abstract] [BibTeX] [DOI] Abstract: Irregular data structures, as exemplified with sparse matrices, have proved to be essential in modern computing. Numerous sparse formats have been investigated to improve the overall performance of Sparse Matrix-Vector multiply (SpMV). But in this work we propose instead to take a fundamentally different approach: to automatically build sets of regular sub-computations by mining for regular sub-regions in the irregular data structure. Our approach leads to code that is specialized to the sparsity structure of the input matrix, but which does not need anymore any indirection array, thereby improving SIMD vectorizability. We particularly focus on small sparse structures (below 10M nonzeros), and demonstrate substantial performance improvements and compaction capabilities compared to a classical CSR implementation and Intel MKL IE's SpMV implementation, evaluating on 200+ different matrices from the SuiteSparse repository. BibTeX: @inproceedings{Augustine2019, author = {Augustine, Travis and Sarma, Janarthanan and Pouchet, Louis-Noël and Rodríguez, Gabriel}, title = {Generating Piecewise-regular Code from Irregular Structures}, booktitle = {Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation}, publisher = {ACM}, year = {2019}, pages = {625--639}, doi = {10.1145/3314221.3314615} } Ayala A, Tomov S, Luo X, Shaiek H, Haidar A, Bosilca G and Dongarra J (2019), "Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation", In Proceedings of the Workshop on Exascale MPI (ExaMPI). Denver, CO [Abstract] [BibTeX] Abstract: Most applications targeting exascale, such as those part of the Exascale Computing Project (ECP), are designed for heterogeneous architectures and rely on the Message Passing Interface (MPI) as their underlying parallel programming model. In this paper we analyze the limitations of collective MPI communication for the computation of fast Fourier transforms (FFTs), which are relied on heavily for large-scale particle simulations. We present experiments made at one of the largest heterogeneous platforms, the Summit supercomputer at ORNL. We discuss communication models from state-of-the-art FFT libraries, and propose a new FFT library, named HEFFTE (Highly Efficient FFTs for Exascale), which supports heterogeneous architectures and yields considerable speedups compared with CPU libraries, while maintaining good weak as well as strong scalability. BibTeX: @inproceedings{Ayala19, author = {Ayala, Alan and Tomov, Stanimire and Luo, Xi and Shaiek, Hejer and Haidar, Azzam and Bosilca, George and Dongarra, Jack}, title = {Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation}, booktitle = {Proceedings of the Workshop on Exascale MPI (ExaMPI)}, year = {2019} } Ayall T, Duan H and Liu C (2019), "Edge Property Based Stream Order Reduce the Performance of Stream Edge Graph Partition", Journal of Physics: Conference Series., November, 2019. Vol. 1395, pp. 1-8. IOP Publishing. [Abstract] [BibTeX] [DOI] Abstract: The graph data exist everywhere in various disciplines such as in social network, biological network, chemical compound, and computer vision, etc. Currently, the size of a graph data have dramatically increased; all disciplines have extracted knowledge from a graph by partitioning and distributing a graph into different clusters node using the distributed graph processing system or graph database, however the graph partition has reduced the performance of those system. Even if the stream edge graph partition has shown better partition quality than vertex graph partition for skew degree distribution of a graph and support big graph partition, the stream order has affected the quality of stream edge graph partition. In this study,we propose two edge properties based stream order models such as TFB (Tree edges First then Backward edges follow), BFT (Backward edges First then Tree edges follow) and study the effect of stream order on stream edge graph partition algorithms. The result shows that TFB and BFT models significantly affect the quality of stream edge partition except hashing and all algorithms show best quality of partition on Random order than other order such as TFB, BFT, BFS (Breadth First Search), and DFS (Depth First Search) BibTeX: @article{Ayall2019, author = {Ayall, T. and Duan, H. and Liu, C.}, title = {Edge Property Based Stream Order Reduce the Performance of Stream Edge Graph Partition}, journal = {Journal of Physics: Conference Series}, publisher = {IOP Publishing}, year = {2019}, volume = {1395}, pages = {1--8}, doi = {10.1088/1742-6596/1395/1/012010} } Balaji V and Lucia B (2019), "Combining Data Duplication and Graph Reordering to Accelerate Parallel Graph Processing", In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. Phoenix, AX, USA, June, 2019. [Abstract] [BibTeX] Abstract: Performance of single-machine, shared memory graph processing is affected by expensive atomic updates and poor cache locality. Data duplication, a popular approach to eliminate atomic updates by creating thread-local copies of shared data, incurs extreme memory overheads due to the large sizes of typical input graphs. Even memory-efficient duplication strategies that exploit the power-law structure common to many graphs (by duplicating only the highly-connected "hub" vertices) suffer from overheads for having to dynamically identify the hub vertices. Degree Sorting, a popular graph reordering technique that re-assigns hub vertices consecutive IDs in a bid to improve spatial locality, is effective for single-threaded graph applications but suffers from increased false sharing in parallel executions. \ The main insight of this work is that the combination of data duplication and Degree Sorting eliminates the overheads of each optimization. Degree Sorting improves the efficiency of data duplication by assigning hub vertices consecutive IDs which enables easy identification of the hub vertices. Additionally, duplicating the hub vertex data eliminates false sharing in Degree Sorting since each thread updates its local copy of the hub vertex data. We evaluate this mutually-enabling combination of power-law-specific data duplication and Degree Sorting in a system called RADAR. RADAR improves performance by eliminating atomic updates for hub vertices and improving the cache locality of graph applications, providing speedups of up to 166x (1.88x on average) across different graph applications and input graphs. BibTeX: @inproceedings{Balaji2019, author = {Balaji, Vignesh and Lucia, Brandon}, title = {Combining Data Duplication and Graph Reordering to Accelerate Parallel Graph Processing}, booktitle = {Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing}, year = {2019} } Ballard G, Demmel J, Dumitriu I and Rusciano A (2019), "A Generalized Randomized Rank-Revealing Factorization" [Abstract] [BibTeX] Abstract: We introduce a Generalized Randomized QR-decomposition that may be applied to arbitrary products of matrices and their inverses, without needing to explicitly compute the products or inverses. This factorization is a critical part of a communication-optimal spectral divide-and-conquer algorithm for the nonsymmetric eigenvalue problem. In this paper, we establish that this randomized QR-factorization satisfies the strong rank-revealing properties. We also formally prove its stability, making it suitable in applications. Finally, we present numerical experiments which demonstrate that our theoretical bounds capture the empirical behavior of the factorization. BibTeX: @article{Ballard2019, author = {Ballard, Grey and Demmel, James and Dumitriu, Ioana and Rusciano, Alexander}, title = {A Generalized Randomized Rank-Revealing Factorization}, year = {2019} } Barkalov K and Lebedev I (2019), "Parallel Global Optimization for Non-convex Mixed-Integer Problems", In Proceedings of the 5th Russian Supercomputing Days Conference. , pp. 98-109. Springer International Publishing. [Abstract] [BibTeX] Abstract: The paper considers the mixed-integer global optimization problems. A novel parallel algorithm for solving the problems of this class based on the index algorithm for solving the continuous global optimization problems has been proposed. The comparison of this algorithm with known analogs demonstrates the efficiency of the developed approach. The proposed algorithm allows an efficient parallelization including the employment of the graphics accelerators. The results of performed numerical experiments (solving a series of 100 multiextremal mixed-integer problems) confirm a good speedup of the algorithm with the use of GPU. BibTeX: @inproceedings{Barkalov, author = {Barkalov, Konstantin and Lebedev, Ilya}, editor = {Voevodin, Vladimir and Sobolev, Sergey}, title = {Parallel Global Optimization for Non-convex Mixed-Integer Problems}, booktitle = {Proceedings of the 5th Russian Supercomputing Days Conference}, publisher = {Springer International Publishing}, year = {2019}, pages = {98--109} } Barratt S and Boyd S (2019), "Least Squares Auto-Tuning" [Abstract] [BibTeX] Abstract: Least squares is by far the simplest and most commonly applied computational method in many fields. In almost all applications, the least squares objective is rarely the true objective. We account for this discrepancy by parametrizing the least squares problem and automatically adjusting these parameters using an optimization algorithm. We apply our method, which we call least squares auto-tuning, to data fitting. BibTeX: @online{Barratt2019, author = {Barratt, Shane and Boyd, Stephen}, title = {Least Squares Auto-Tuning}, year = {2019} } Behnezhad S, Brandt S, Derakhshan M, Fischer M, Hajiaghayi M, Karp RM and Uitto J (2019), "Massively Parallel Computation of Matching and MIS in Sparse Graphs", In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing. New York, NY, USA , pp. 481-490. ACM. [Abstract] [BibTeX] [DOI] Abstract: The Massively Parallel Computation (MPC) model serves as a common abstraction of many modern large-scale parallel computation frameworks and has recently gained a lot of importance, especially in the context of classic graph problems. In this work, we mainly consider maximal matching and maximal independent set problems in the MPC model.\ These problems are known to admit efficient MPC algorithms if the space available per machine is near-linear in the number n of nodes. This is not only often significantly more than what we can afford, but also allows for easy if not trivial solutions for sparse graphs -- which are common in real-world large-scale graphs. We are, therefore, interested in the low-memory MPC model, where the space per machine is restricted to be strongly sublinear, that is, n^δ for any constant 0 < δ < 1. \ We parametrize our algorithms by the arboricity λ of the input graph. Our key ingredient is a degree reduction technique that reduces these problems in graphs with arboricity λ to the corresponding problems in graphs with maximum degree poly(, log n) in O(log^2 log n) rounds, giving rise to O(log \lambda ⋅ log log λ + log 2 log n)-round algorithms.\ Our result is particularly interesting for graphs with poly log n arboricity as for such graphs, we get O(log^2 log n)-round algorithms. This covers most natural families of sparse graphs and almost exponentially improves over previous algorithms that all required log O(1) n rounds in this regime of MPC.\ Finally, our maximal matching algorithm can be employed to obtain a (1+)-approximate maximum cardinality matching, a (2+)-approximate maximum weighted matching, as well as a 2-approximate minimum vertex cover in essentially the same number of rounds. BibTeX: @inproceedings{Behnezhad2019, author = {Behnezhad, Soheil and Brandt, Sebastian and Derakhshan, Mahsa and Fischer, Manuela and Hajiaghayi, MohammadTaghi and Karp, Richard M. and Uitto, Jara}, title = {Massively Parallel Computation of Matching and MIS in Sparse Graphs}, booktitle = {Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing}, publisher = {ACM}, year = {2019}, pages = {481--490}, doi = {10.1145/3293611.3331609} } Khalifa DB, Martel M and Adjé A (2019), "POP: A Tuning Assistant for Mixed-Precision Floating-Point Computations", In Proceedings of the 7th International Workshop on Formal Techniques for Safety-Critical Systems. [Abstract] [BibTeX] Abstract: In this article, we describe a static program analysis to determine the lowest floating-point precisions on inputs and intermediate results that guarantees a desired accuracy of the output values. A common practice used by developers without advanced training in computer arithmetic consists in using the highest precision available in hardware (double precision on most CPU's) which can be exorbitant in terms of energy consumption, memory traffic, and bandwidth capacity. To overcome this difficulty, we propose a new precision tuning tool for the floatingpoint programs integrating a static forward and backward analysis, done by abstract interpretation. Next, our analysis will be expressed as a set of linear constraints easily checked by an SMT solver. BibTeX: @inproceedings{BenKhalifa2019, author = {Dorra Ben Khalifa and Matthieu Martel and Assalé Adjé}, title = {POP: A Tuning Assistant for Mixed-Precision Floating-Point Computations}, booktitle = {Proceedings of the 7th International Workshop on Formal Techniques for Safety-Critical Systems}, year = {2019} } Bento J and Ioannidis S (2019), "A family of tractable graph metrics", Applied Network Science., November, 2019. Vol. 4(1), pp. 107. [Abstract] [BibTeX] [DOI] Abstract: Important data mining problems such as nearest-neighbor search and clustering admit theoretical guarantees when restricted to objects embedded in a metric space. Graphs are ubiquitous, and clustering and classification over graphs arise in diverse areas, including, e.g., image processing and social networks. Unfortunately, popular distance scores used in these applications, that scale over large graphs, are not metrics and thus come with no guarantees. Classic graph distances such as, e.g., the chemical distance and the Chartrand-Kubiki-Shultz distance are arguably natural and intuitive, and are indeed also metrics, but they are intractable: as such, their computation does not scale to large graphs. We define a broad family of graph distances, that includes both the chemical and the Chartrand-Kubiki-Shultz distances, and prove that these are all metrics. Crucially, we show that our family includes metrics that are tractable. Moreover, we extend these distances by incorporating auxiliary node attributes, which is important in practice, while maintaining both the metric property and tractability. BibTeX: @article{Bento2019, author = {Bento, José and Ioannidis, Stratis}, title = {A family of tractable graph metrics}, journal = {Applied Network Science}, year = {2019}, volume = {4}, number = {1}, pages = {107}, doi = {10.1007/s41109-019-0219-z} } Bernaschi M, Carrozzo M, Franceschini A and Janna C (2019), "A Dynamic Pattern Factored Sparse Approximate Inverse Preconditioner on Graphics Processing Units", SIAM Journal on Scientific Computing. Vol. 41(3), pp. C139-C160. [Abstract] [BibTeX] [DOI] Abstract: One of the most time-consuming tasks in the procedures for the numerical study of PDEs is the solution to linear systems of equations. To that purpose, iterative solvers are viewed as a promising alternative to direct methods on high-performance computers since, in theory, they are almost perfectly parallelizable. Their main drawback is the need of finding a suitable preconditioner to accelerate convergence. The factorized sparse approximate inverse (FSAI), mainly in its adaptive form, has proven to be an effective parallel preconditioner for several problems. In the present work, we report about two novel ideas to dynamically compute, on graphics processing units (GPUs), the FSAI sparsity pattern, which is the main task in its setup. The first approach, borrowed from the CPU implementation, uses a global array as a nonzero indicator, whereas the second one relies on a merge-sort procedure of multiple arrays. We will show that the second approach requires significantly less memory and overcomes issues related to the limited global memory available on GPUs. Numerical tests prove that the GPU implementation of FSAI allows for an average speed-up of 7.5 over a parallel CPU implementation. Moreover, we will show that the preconditioner computation is still feasible using single precision arithmetic with a further 20% reduction of the setup cost. Finally, the strong scalability of the overall approach in shown in a multi-GPU setting. BibTeX: @article{Bernaschi2019, author = {Bernaschi, M. and Carrozzo, M. and Franceschini, A. and Janna, C.}, title = {A Dynamic Pattern Factored Sparse Approximate Inverse Preconditioner on Graphics Processing Units}, journal = {SIAM Journal on Scientific Computing}, year = {2019}, volume = {41}, number = {3}, pages = {C139--C160}, doi = {10.1137/18M1197461} } Berry JW, Butcher N, Çatalyürek ÜV, Hammond SD, Kogge P, Lin P, Olivier SL, Phillips CA, Rajamanickam S, Slota GM, Voskuilen GR, Yaşar A and Young JS (2019), "Multi-Level Memory Algorithmics for Large, Sparse Problems". Thesis at: Sandia National Laboratories. (SAND2019-13871) [Abstract] [BibTeX] [URL] Abstract: In this report, we abstract eleven papers published during the project and describe preliminary unpublished results that warrant follow-up work. The topic is multi-level memory algorithmics, or how to efectively use multiple layers of main memory. Modern compute nodes all have this feature in some form. BibTeX: @techreport{Berry2019, author = {Jonathan W. Berry and Neil Butcher and Ümit V. Çatalyürek and Simon D. Hammond and Peter Kogge and Paul Lin and Stephen L. Olivier and Cynthia A. Phillips and Siva Rajamanickam and George M. Slota and Gwen R. Voskuilen and Abdurrahman Yaşar and Jefrey S Young}, title = {Multi-Level Memory Algorithmics for Large, Sparse Problems}, school = {Sandia National Laboratories}, year = {2019}, number = {SAND2019--13871}, url = {https://www.osti.gov/servlets/purl/1574408} } Bertaccini D and Durastante F (2019), "Computing function of large matrices by a preconditioned rational Krylov method", Numerical Mathematics and Advanced Applications. [Abstract] [BibTeX] Abstract: Rational Krylov methods are a powerful alternative for computing the product of a function of a large matrix times a given vector. However, the creation of the underlying rational subspaces requires solving sequences of large linear sysetms, a delicate task that can require intensive computational resources and should be monitored to avoid the creation of subspace different to those required whenever, e.g., the underlying matrices are ill-conditioned. We propose the use of robust preconditioned iterative techniques to speedup the underlying process. We also discuss briefly how the inexact solution of these linear systems can affect the computed subspace. A preliminary test approximating a fractional power of the Laplacian matrix is included. BibTeX: @article{Bertaccini2019, author = {D. Bertaccini and F. Durastante}, title = {Computing function of large matrices by a preconditioned rational Krylov method}, journal = {Numerical Mathematics and Advanced Applications}, year = {2019} } Bertsimas D and Stellato B (2019), "Online Mixed-Integer Optimization in Milliseconds" [Abstract] [BibTeX] Abstract: We propose a method to solve online mixed-integer optimization (MIO) problems at very high speed using machine learning. By exploiting the repetitive nature of online optimization, we are able to greatly speedup the solution time. Our approach encodes the optimal solution into a small amount of information denoted as strategy using the Voice of Optimization framework proposed in [BS18]. In this way the core part of the optimization algorithm becomes a multiclass classification problem which can be solved very quickly. In this work we extend that framework to real-time and high-speed applications focusing on parametric mixed-integer quadratic optimization (MIQO). We propose an extremely fast online optimization algorithm consisting of a feedforward neural network (NN) evaluation and a linear system solution where the matrix has already been factorized. Therefore, this online approach does not require any solver nor iterative algorithm. We show the speed of the proposed method both in terms of total computations required and measured execution time. We estimate the number of floating point operations (flops) required to completely recover the optimal solution as a function of the problem dimensions. Compared to state-of-the-art MIO routines, the online running time of our method is very predictable and can be lower than a single matrix factorization time. We benchmark our method against the state-ofthe-art solver Gurobi obtaining from two to three orders of magnitude speedups on benchmarks with real-world data. BibTeX: @article{Bertsimas2019, author = {Bertsimas, Dimitris and Stellato, Bartolomeo}, title = {Online Mixed-Integer Optimization in Milliseconds}, year = {2019} } Besta M, Weber S, Gianinazzi L, Gerstenberger R, Ivanov A, Oltchik Y and Hoefler T (2019), "Slim Graph: Practical Lossy Graph Compression for Approximate Graph Processing, Storage, and Analytics", In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA , pp. 35:1-35:25. ACM. [Abstract] [BibTeX] [DOI] Abstract: We propose Slim Graph: the first programming model and framework for practical lossy graph compression that facilitates high-performance approximate graph processing, storage, and analytics. Slim Graph enables the developer to express numerous compression schemes using small and programmable compression kernels that can access and modify local parts of input graphs. Such kernels are executed in parallel by the underlying engine, isolating developers from complexities of parallel programming. Our kernels implement novel graph compression schemes that preserve numerous graph properties, for example connected components, minimum spanning trees, or graph spectra. Finally, Slim Graph uses statistical divergences and other metrics to analyze the accuracy of lossy graph compression. We illustrate both theoretically and empirically that Slim Graph accelerates numerous graph algorithms, reduces storage used by graph datasets, and ensures high accuracy of results. Slim Graph may become the common ground for developing, executing, and analyzing emerging lossy graph compression schemes. BibTeX: @inproceedings{Besta2019, author = {Besta, Maciej and Weber, Simon and Gianinazzi, Lukas and Gerstenberger, Robert and Ivanov, Andrey and Oltchik, Yishai and Hoefler, Torsten}, title = {Slim Graph: Practical Lossy Graph Compression for Approximate Graph Processing, Storage, and Analytics}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, publisher = {ACM}, year = {2019}, pages = {35:1--35:25}, doi = {10.1145/3295500.3356182} } Blanchard P, Higham DJ and Higham NJ (2019), "Accurate Computation of the Log-Sum-Exp and Softmax Functions", September, 2019. [Abstract] [BibTeX] Abstract: Evaluating the log-sum-exp function or the softmax function is a key step in many modern data science algorithms, notably in inference and classification. Because of the exponentials that these functions contain, the evaluation is prone to overflow and underflow, especially in low precision arithmetic. Software implementations commonly use alternative formulas that avoid overflow and reduce the chance of harmful underflow, employing a shift or another rewriting. Although mathematically equivalent, these variants behave differently in floating-point arithmetic. We give rounding error analyses of different evaluation algorithms and interpret the error bounds using condition numbers for the functions. We conclude, based on the analysis and numerical experiments, that the shifted formulas are of similar accuracy to the unshifted ones and that the shifted softmax formula is typically more accurate than a division-free variant. BibTeX: @article{Blanchard2019, author = {Blanchard, Pierre and Higham, Desmond J. and Higham, Nicholas J.}, title = {Accurate Computation of the Log-Sum-Exp and Softmax Functions}, year = {2019} } Blanchard P, Higham NJ, Lopez F, Mary T and Pranesh S (2019), "Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores" [Abstract] [BibTeX] [URL] Abstract: Block low-rank (BLR) matrices possess a blockwise low-rank property that can be exploited to reduce the complexity of numerical linear algebra algorithms. The impact of these lowrank approximations on the numerical stability of the algorithms in floating-point arithmetic has not previously been analyzed. We present rounding error analysis for the solution of a linear system by LU factorization of BLR matrices. Assuming that a stable pivoting scheme is used, we prove backward stability: the relative backward error is bounded by a modest constant times 𝜖, where the low-rank threshold 𝜖 is the parameter controlling the accuracy of the blockwise low-rank approximations. In addition to this key result, our analysis offers three new insights into the numerical behavior of BLR algorithms. First, we compare the use of a global or local low-rank threshold and find that a global one should be preferred. Second, we show that performing intermediate recompressions during the factorization can significantly reduce its cost without compromising numerical stability. Third, we consider different BLR factorization variants and determine the compress--factor--update (CFU) variant to be the best. Tests on a wide range of matrices from various real-life applications show that the predictions from the analysis are realized in practice. BibTeX: @article{Blanchard2019a, author = {Blanchard, Pierre and Higham, Nicholas J. and Lopez, Florent and Mary, Theo and Pranesh, Srikara}, title = {Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores}, year = {2019}, url = {http://eprints.maths.manchester.ac.uk/2727/1/paper.pdf} } Blaß T and Philippsen M (2019), "Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU", In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs. New York, NY, USA , pp. 22-31. ACM. [Abstract] [BibTeX] [DOI] Abstract: GPUs seem to be ideal for algorithms that work in parallel. A number of ways to represent graphs in GPU memory are known. But so far there are no guidelines to select the representation that is likely to result in the best performance.\ This a comprehensive study investigates for CUDA-capable GPUs how different graph representations influence the performance of highly optimized graph processing algorithms that traverse the graphs without modifying them. We evaluate three different graph exchange formats and how efficiently they can be imported into eight graph data structures. We use ten state-of-the-art benchmarks that employ different traversals pattern. We evaluate them on 19 input graphs with different characteristics. The measurements show that there is not a single best data structure; the runtime performance can vary up to a factor of 2 between two representations.\ The main contribution is a set of rules that helps in picking the best-performing graph representation for a given situation. BibTeX: @inproceedings{Blass2019, author = {Blaß, Thorsten and Philippsen, Michael}, title = {Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU}, booktitle = {Proceedings of the 12th Workshop on General Purpose Processing Using GPUs}, publisher = {ACM}, year = {2019}, pages = {22--31}, doi = {10.1145/3300053.3319416} } Bollhöfer M, Schenk O, Janalík R, Hamm S and Gullapalli K (2019), "State-of-The-Art Sparse Direct Solvers" [Abstract] [BibTeX] Abstract: In this chapter we will give an insight into modern sparse elimination methods. These are driven by a preprocessing phase based on combinatorial algorithms which improve diagonal dominance, reduce fill-in and improve concurrency to allow for parallel treatment. Moreover, these methods detect dense submatrices which can be handled by dense matrix kernels based on multi-threaded level-3 BLAS. We will demonstrate for problems arising from circuit simulation how the improvement in recent years have advanced direct solution methods significantly BibTeX: @inbook{Bollhofer2019, author = {Bollhöfer, Matthias and Schenk, Olaf and Janalík, Radim and Hamm, Steve and Gullapalli, Kiran}, title = {State-of-The-Art Sparse Direct Solvers}, year = {2019} } Bomze IM, Rinaldi F and Zeffiro D (2019), "Active set complexity of the Away-step Frank-Wolfe Algorithm", December, 2019. [Abstract] [BibTeX] Abstract: In this paper, we study active set identification results for the away-step Frank-Wolfe algorithm in different settings. We first prove a local identification property that we apply, in combination with a convergence hypothesis, to get an active set identification result. We then prove, in the nonconvex case, a novel O(1/k) convergence rate result and active set identification for different stepsizes (under suitable assumptions on the set of stationary points). By exploiting those results, we also give explicit active set complexity bounds for both strongly convex and nonconvex objectives. While we initially consider the probability simplex as feasible set, in the appendix we show how to adapt some of our results to generic polytopes. BibTeX: @article{Bomze2019, author = {Immanuel M. Bomze and Francesco Rinaldi and Damiano Zeffiro}, title = {Active set complexity of the Away-step Frank-Wolfe Algorithm}, year = {2019} } Bonami P, Salvagnin D and Tramontani A (2019), "Implementing Automatic Benders Decomposition in a Modern MIP Solver" [Abstract] [BibTeX] Abstract: We describe the automatic Benders decomposition implemented in the commercial solver IBM CPLEX. We propose several improvements to the state-of-the-art along two lines: making a numerically robust method able to deal with the general case and improving the efficiency of the method on models amenable to decomposition. For the former, we deal with: unboundedness, failures in generating cuts and scaling of the artificial variable representing the objective. For the latter, we propose a new technique to handle so-called generalized bound constraints and we use different types of normalization conditions in the Cut Generating LPs. We present computational experiments aimed at assessing the importance of the various enhancements. In particular, on our test bed of models amenable to a decomposition, our implementation is approximately 5 times faster than CPLEX default branch-and-cut. A remarkable result is that, on the same test bed, default branch-and-cut is faster than a Benders decomposition that doesn't implement our improvements. BibTeX: @article{Bonami2019, author = {Pierre Bonami and Domenico Salvagnin and Andrea Tramontani}, title = {Implementing Automatic Benders Decomposition in a Modern MIP Solver}, year = {2019} } Boulmier A, Raynaud F, Abdennadher N and Chopard B (2019), "On the Benefits of Anticipating Load Imbalance for Performance Optimization of Parallel Applications", September, 2019. [Abstract] [BibTeX] Abstract: In parallel iterative applications, computational efficiency is essential for addressing large problems. Load imbalance is one of the major performance degradation factors of parallel applications. Therefore, distributing, cleverly, and as evenly as possible, the workload among processing elements (PE) maximizes application performance. So far, the standard load balancing method consists in distributing the workload evenly between PEs and, when load imbalance appears, redistributing the extra load from overloaded PEs to underloaded PEs. However, this does not anticipate the load imbalance growth that may continue during the next iterations. In this paper, we present a first step toward a novel philosophy of load balancing that unloads the PEs that will be overloaded in the near future to let the application rebalance itself via its own dynamics. Herein, we present a formal definition of our new approach using a simple mathematical model and discuss its advantages compared to the standard load balancing method. In addition to the theoretical study, we apply our method to an application that reproduces the computation of a fluid model with non-uniform erosion. The performance validates the benefit of anticipating load imbalance. We observed up to 16% performance improvement compared to the standard load balancing method. BibTeX: @article{Boulmier2019, author = {Boulmier, Anthony and Raynaud, Franck and Abdennadher, Nabil and Chopard, Bastien}, title = {On the Benefits of Anticipating Load Imbalance for Performance Optimization of Parallel Applications}, year = {2019} } Brock B, Chen Y, Yan J, Owens J, Buluç A and Yelick K (2019), "RDMA vs. RPC for Implementing Distributed Data Structures", October, 2019. [Abstract] [BibTeX] Abstract: Distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributed data structure, e.g., accessing a hash table bucket or distributed queue, rather than global operations in which all processors collectively exchange information. We look at the trade-offs between the two styles through microbenchmarks and a performance model that approximates the cost of each. The RDMA operations have direct hardware support in the network and therefore lower latency and overhead, while the RPC operations are more expressive but higher cost and can suffer from lack of attentiveness from the remote side. We also run experiments to compare the real-world performance of RDMA- and RPC-based data structure operations with the predicted performance to evaluate the accuracy of our model, and show that while the model does not always precisely predict running time, it allows us to choose the best implementation in the examples shown. We believe this analysis will assist developers in designing data structures that will perform well on current network architectures, as well as network architects in providing better support for this class of distributed data structures. BibTeX: @article{Brock2019, author = {Brock, Benjamin and Chen, Yuxin and Yan, Jiakun and Owens, John and Buluç, Aydın and Yelick, Katherine}, title = {RDMA vs. RPC for Implementing Distributed Data Structures}, year = {2019} } Burkhardt P (2019), "Optimal algebraic Breadth-First Search for sparse graphs" [Abstract] [BibTeX] Abstract: There has been a rise in the popularity of algebraic methods for graph algorithms given the development of the GraphBLAS library and other sparse matrix methods. These are useful in practice because many graph algorithms are amenable to sparse matrix multiplication. An exemplar for these approaches is Breadth-First Search (BFS). The algebraic BFS algorithm is simply a recursion of matrix-vector multiplications with the n × n adjacency matrix. Despite many redundant operations over nonzeros that ultimately lead to suboptimal performance, the algebraic BFS is appealing for practical implementations because it is simple and embarrassingly parallel. By using highly tuned matrix libraries it can be faster in practice than the theoretically optimal combinatorial algorithm. Therefore an optimal algebraic BFS should be of keen interest especially if it is easily integrated with existing matrix methods.\ Current methods, notably in the GraphBLAS, use a Sparse Matrix Sparse Vector (SpMSpV) multiplication in which the input vector is kept in a sparse representation in each step of the BFS. But simply applying SpMSpV in BFS does not lead to optimal runtime. Each nonzero in the vector must be masked in subsequent steps. This has been an area of recent recent in GraphBLAS and other libraries. While in theory these masking methods are asymptotically optimal on sparse graphs, many add work that leads to suboptimal runtime. We give a new optimal, algebraic BFS for sparse graphs that is also a constant factor faster than theoretically optimal SpMSpV methods. We show how to eliminate redundant operations so an element in the adjacency matrix is operated upon no more than once, thus taking O(m) operations for a graph with O(m) edges.\ Our method multiplies progressively smaller submatrices of the adjacency matrix at each step. The matrix remains unchanged, rather we are masking the rows and columns in the matrix that corresponds to previously visited vertices. The input vector in each step is also effectively masked so it is a sparse vector. Thus our method multiplies a sparse submatrix by a sparse vector in decreasing size each step. Our sequential algebraic BFS algorithm takes O(m) algebraic operations on a sparse graph as opposed to O(mn) operations of other sparse matrix approaches. Our analysis closes a gap in the literature. BibTeX: @article{Burkhardt2019, author = {Burkhardt, Paul}, title = {Optimal algebraic Breadth-First Search for sparse graphs}, year = {2019} } Buttari A, Orban D, Ruiz D and Titley-Peloquin D (2019), "A Tridiagonalization Method for Symmetric Saddle-Point Systems", SIAM Journal on Scientific Computing. Vol. 41(5), pp. S409-S432. [Abstract] [BibTeX] [DOI] Abstract: We propose an iterative method for the solution of symmetric saddle-point systems that exploits the orthogonal tridiagonalization method of Saunders, Simon, and Yip (1988). By contrast with methods based on the Golub and Kahan (1965) bidiagonalization process, our method takes advantage of two initial vectors and splits the system into the sum of a least-squares and a least-norm problem. Our method typically requires fewer operator-vector products than MINRES, yet performs a comparable amount of work per iteration and has comparable storage requirements. BibTeX: @article{Buttari2019, author = {Buttari, Alfredo. and Orban, Dominique. and Ruiz, Daniel. and Titley-Peloquin, David.}, title = {A Tridiagonalization Method for Symmetric Saddle-Point Systems}, journal = {SIAM Journal on Scientific Computing}, year = {2019}, volume = {41}, number = {5}, pages = {S409--S432}, doi = {10.1137/18M1194900} } Buttari A, Hauberg S and Kodsi C (2019), "Parallel QR factorization of block-tridiagonal matrices". Thesis at: Institut de recherche en informatique de Toulouse (IRIT). [Abstract] [BibTeX] Abstract: In this work, we deal with the QR factorization of block-tridiagonal matrices, where the blocks are dense and rectangular. This work is motivated by a novel method for computing geodesics over Riemannian manifolds. If blocks are reduced sequentially along the diagonal, only limited parallelism is available. We propose a matrix permutation approach based on the Nested Dissection method which improves parallelism at the cost of additional computations and storage. We provide a detailed analysis of the approach showing that this extra cost is bounded. Finally, we present an implementation for shared memory systems relying on task parallelism and the use of a runtime system. Experimental results support the conclusions of our analysis and show that the proposed approach leads to good performance and scalability. BibTeX: @techreport{Buttari2019a, author = {Buttari, Alfredo and Hauberg, Søren and Kodsi, Costy}, title = {Parallel QR factorization of block-tridiagonal matrices}, school = {Institut de recherche en informatique de Toulouse (IRIT)}, year = {2019} } Caliciotti A, Fasano G, Potra F and Roma M (2019), "Issues on the use of a modified Bunch and Kaufmandecomposition for large scale Newton's equation" [Abstract] [BibTeX] [URL] Abstract: In this work, we deal with Truncated Newton methods for solving large scale (possibly nonconvex) unconstrained optimization problems. In particular, we consider the use of a modified Bunch and Kaufman factorization for solving the Newton equation, at each (outer) iteration of the method. The Bunch and Kaufman factorization of a tridiagonal matrix is an effective and stable matrix decomposition, which is well exploited in the widely adopted SYMMBK [2, 5, 6, 19, 20] routine. It can be used to provide conjugate directions, both in the case of 1 × 1 and 2 × 2 pivoting steps. The main drawback is that the resulting solution of Newton's equation might not be gradient--related, in the case the objective function is nonconvex. Here we first focus on some theoretical properties, in order to ensure that at each iteration of the Truncated Newton method, the search direction obtained by using an adapted Bunch and Kaufman factorization is gradient--related. This allows to perform a standard Armijo-type linesearch procedure, using a bounded descent direction. Furthermore, the results of an extended numerical experience using large scale CUTEst problems is reported, showing the reliability and the efficiency of the proposed approach, both on convex and nonconvex problems. BibTeX: @article{Caliciotti2019, author = {Andrea Caliciotti and Giovanni Fasano and Florian Potra and Massimo Roma}, title = {Issues on the use of a modified Bunch and Kaufmandecomposition for large scale Newton's equation}, year = {2019}, url = {https://arca.unive.it/retrieve/handle/10278/3729359/205899/CFPR-final2019_accepted.pdf} } Camacho J, Smilde AK, Saccenti E and Westerhuis JA (2019), "All Sparse PCA Models Are Wrong, But Some Are Useful. Part I: Computation of Scores, Residuals and Explained Variance" [Abstract] [BibTeX] Abstract: Sparse Principal Component Analysis (sPCA) is a popular matrix factorization approach based on Principal Component Analysis (PCA) that combines variance maximization and sparsity with the ultimate goal of improving data interpretation. When moving from PCA to sPCA, there are a number of implications that the practitioner needs to be aware of. A relevant one is that scores and loadings in sPCA may not be orthogonal. For this reason, the traditional way of computing scores, residuals and variance explained that is used in the classical PCA cannot directly be applied to sPCA models. This also affects how sPCA components should be visualized. In this paper we illustrate this problem both theoretically and numerically using simulations for several state-ofthe-art sPCA algorithms, and provide proper computation of the different elements mentioned. We show that sPCA approaches present disparate and limited performance when modeling noisefree, sparse data. In a follow-up paper, we discuss the theoretical properties that lead to this problem. BibTeX: @article{Camacho2019, author = {Camacho, J. and Smilde, A. K. and Saccenti, E. and Westerhuis, J. A.}, title = {All Sparse PCA Models Are Wrong, But Some Are Useful. Part I: Computation of Scores, Residuals and Explained Variance}, year = {2019} } Cao Q, Pei Y, Herauldt T, Akbudak K, Mikhalev A, Bosilca G, Ltaief H, Keyes D and Dongarra J (2019), "Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools", In Proceedings of the 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools., 11, 2019. IEEE. [Abstract] [BibTeX] [DOI] Abstract: This paper highlights the necessary development of new instrumentation tools within the PaRSE task-based runtime system to leverage the performance of low-rank matrix computations. In particular, the tile low-rank (TLR) Cholesky factorization represents one of the most critical matrix operations toward solving challenging large-scale scientific applications. The challenge resides in the heterogeneous arithmetic intensity of the various computational kernels, which stresses PaRSE's dynamic engine when orchestrating the task executions at runtime. Such irregular workload imposes the deployment of new scheduling heuristics to privilege the critical path, while exposing task parallelism to maximize hardware occupancy. To measure the effectiveness of PaRSE's engine and its various scheduling strategies for tackling such workloads, it becomes paramount to implement adequate performance analysis and profiling tools tailored to fine-grained and heterogeneous task execution. This permits us not only to provide insights from PaRSE, but also to identify potential applications' performance bottlenecks. These instrumentation tools may actually foster synergism between applications and PaRSE developers for productivity as well as high-performance computing purposes. We demonstrate the benefits of these amenable tools, while assessing the performance of TLR Cholesky factorization from data distribution, communication-reducing and synchronization-reducing perspectives. This tool-assisted performance analysis results in three major contributions: a new hybrid data distribution, a new hierarchical TLR Cholesky algorithm, and a new performance model for tuning the tile size. The new TLR Cholesky factorization achieves an 8X performance speedup over existing implementations on massively parallel supercomputers, toward solving large-scale 3D climate and weather prediction applications. BibTeX: @inproceedings{Cao2019, author = {Quinglei Cao and Yu Pei and Thomas Herauldt and Kadir Akbudak and Aleksandr Mikhalev and George Bosilca and Hatem Ltaief and David Keyes and Jack Dongarra}, title = {Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools}, booktitle = {Proceedings of the 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools}, publisher = {IEEE}, year = {2019}, doi = {10.1109/protools49597.2019.00009} } Carson EC (2019), "An Adaptive s-step Conjugate Gradient Algorithm with Dynamic Basis Updating", August, 2019. [Abstract] [BibTeX] Abstract: The adaptive s-step CG algorithm is a solver for sparse, symmetric positive definite linear systems designed to reduce the synchronization cost per iteration while still achieving a user-specified accuracy requirement. In this work, we improve the adaptive s-step conjugate gradient algorithm by use of iteratively updated estimates of the largest and smallest Ritz values, which give approximations of the largest and smallest eigenvalues of A, using a technique due to Meurant and Tichý [G. Meurant and P. Tichý, Numer. Algs. (2018), pp. 1--32]. The Ritz value estimates are used to dynamically update parameters for constructing Newton or Chebyshev polynomials so that the conditioning of the s-step bases can be continuously improved throughout the iterations. These estimates are also used to automatically set a variable related to the ratio of the sizes of the error and residual, which was previously treated as an input parameter. We show through numerical experiments that in many cases the new algorithm improves upon the previous adaptive s-step approach both in terms of numerical behavior and reduction in number of synchronizations. BibTeX: @article{Carson2019, author = {Carson, Erin C.}, title = {An Adaptive s-step Conjugate Gradient Algorithm with Dynamic Basis Updating}, year = {2019} } Chang Y-J, Fischer M, Ghaffari M, Uitto J and Zheng Y (2019), "The Complexity of Δ +1 Coloring in Congested Clique, Massively Parallel Computation, and Centralized Local Computation", In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing. New York, NY, USA , pp. 471-480. ACM. [Abstract] [BibTeX] [DOI] Abstract: In this paper, we present new randomized algorithms that improve the complexity of the classic (+1)-coloring problem, and its generalization (+1)-list-coloring, in three well-studied models of distributed, parallel, and centralized computation: Distributed Congested Clique: We present an O(1)-round randomized algorithm for (Δ + 1)-list-coloring in the congested clique model of distributed computing. This settles the asymptotic complexity of this problem. It moreover improves upon the O(log^star )-round randomized algorithms of Parter and Su [DISC'18] and O((log log ) ⋅ log^star )-round randomized algorithm of Parter [ICALP'18].\ Massively Parallel Computation: We present a randomized (Δ + 1)-list-coloring algorithm with round complexity O(log log n) in the Massively Parallel Computation (MPC) model with strongly sublinear memory per machine. This algorithm uses a memory of O(n) per machine, for any desirable constant α > 0, and a total memory of O (m), where m is the number of edges in the graph. Notably, this is the first coloring algorithm with sublogarithmic round complexity, in the sublinear memory regime of MPC. For the quasilinear memory regime of MPC, an O(1)-round algorithm was given very recently by Assadi et al. [SODA'19].\ Centralized Local Computation: We show that (Δ + 1)-list-coloring can be solved by a randomized algorithm with query complexity Δ O(1) dots O(log n), in the centralized local computation model. The previous state of the art for (+1)-list-coloring in the centralized local computation model are based on simulation of known LOCAL algorithms. The deterministic O(\Delta poly log \Delta + log^star n)-round LOCAL algorithm of Fraigniaud et al. [FOCS'16] can be implemented in the centralized local computation model with query complexity (\Delta poly log \Deltaa) dots O(log^star n); the randomized O(log^star ) + 2^O(log log n)-round LOCAL algorithm of Chang et al. [STOC'18] can be implemented in the centralized local computation model with query complexity (logstar ) dots O(log n). BibTeX: @inproceedings{Chang2019, author = {Chang, Yi-Jun and Fischer, Manuela and Ghaffari, Mohsen and Uitto, Jara and Zheng, Yufan}, title = {The Complexity of Δ +1 Coloring in Congested Clique, Massively Parallel Computation, and Centralized Local Computation}, booktitle = {Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing}, publisher = {ACM}, year = {2019}, pages = {471--480}, doi = {10.1145/3293611.3331607} } Charara A, Keyes D and Ltaief H (2019), "Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs", ACM Transactions on Mathematcal Software. New York, NY, USA, May, 2019. Vol. 45(2), pp. 15:1-15:28. ACM. [Abstract] [BibTeX] [DOI] Abstract: Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations. BibTeX: @article{Charara2019, author = {Charara, Ali and Keyes, David and Ltaief, Hatem}, title = {Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs}, journal = {ACM Transactions on Mathematcal Software}, publisher = {ACM}, year = {2019}, volume = {45}, number = {2}, pages = {15:1--15:28}, doi = {10.1145/3267101} } Chehade A and Shi Z (2019), "The Sparse Reverse of Principal Component Analysis for Fast Low-Rank Matrix Completion", October, 2019. [Abstract] [BibTeX] Abstract: Matrix completion constantly receives tremendous attention from many research fields. It is commonly applied for recommender systems such as movie ratings, computer vision such as image reconstruction or completion, multi-task learning such as collaboratively modeling time-series trends of multiple sensors, and many other applications. Matrix completion techniques are usually computationally exhaustive and/or fail to capture the heterogeneity in the data. For example, images usually contain a heterogeneous set of objects, and thus it is a challenging task to reconstruct images with high levels of missing data. In this paper, we propose the sparse reverse of principal component analysis for matrix completion. The proposed approach maintains smoothness across the matrix, produces accurate estimates of the missing data, converges iteratively, and it is computationally tractable with a controllable upper bound on the number of iterations until convergence. The accuracy of the proposed technique is validated on natural images, movie ratings, and multisensor data. It is also compared with common benchmark methods used for matrix completion. BibTeX: @article{Chehade2019, author = {Chehade, Abdallah and Shi, Zunya}, title = {The Sparse Reverse of Principal Component Analysis for Fast Low-Rank Matrix Completion}, year = {2019} } Chen L and Luo H (2019), "First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow", December, 2019. [Abstract] [BibTeX] Abstract: A novel dynamical inertial Newton system, which is called Hessian-driven Nesterov accelerated gradient (H-NAG) flow is proposed. Convergence of the continuous trajectory are established via tailored Lyapunov function, and new first-order accelerated optimization methods are proposed from ODE solvers. It is shown that (semi-)implicit schemes can always achieve linear rate and explicit schemes have the optimal(accelerated) rates for convex and strongly convex objectives. In particular, Nesterov's optimal method is recovered from an explicit scheme for our H-NAG flow. Furthermore, accelerated splitting algorithms for composite optimization problems are also developed. BibTeX: @article{Chen2019, author = {Long Chen and Hao Luo}, title = {First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow}, year = {2019} } Cheng W and Dai Y-H (2019), "An active set Newton-CG method for _1 optimization", Applied and Computational Harmonic Analysis. [Abstract] [BibTeX] [DOI] [URL] Abstract: In this paper, we investigate the active set identification technique of ISTA and provide some good properties. An active set Newton-CG method is then proposed for _1 optimization. Under appropriate conditions, we show that the proposed method is globally convergent with some nonmonotone line search. The numerical comparisons with several state-of-art methods demonstrate the efficiency of the proposed method. BibTeX: @article{Cheng2019, author = {Cheng, Wanyou and Dai, Yu-Hong}, title = {An active set Newton-CG method for _1 optimization}, journal = {Applied and Computational Harmonic Analysis}, year = {2019}, url = {http://www.sciencedirect.com/science/article/pii/S1063520318300459}, doi = {10.1016/j.acha.2019.08.005} } Choi B, Christlieb A and Wang Y (2019), "Multiscale High-Dimensional Sparse Fourier Algorithms for Noisy Data" [Abstract] [BibTeX] Abstract: We develop an efficient and robust high-dimensional sparse Fourier algorithm for noisy samples. Earlier in the paper Multi-dimensional sublinear sparse Fourier algorithm (2016) [3], an efficient sparse Fourier algorithm with O(ds log s) average-case runtime and O(ds) sampling complexity under certain assumptions was developed for signals that are s-sparse and bandlimited in the d-dimensional Fourier domain, i.e. there are at most s energetic frequencies and they are in [-N/2, N/2)^d ∩ Z^d. However, in practice the measurements of signals often contain noise, and in some cases may only be nearly sparse in the sense that they are well approximated by the best s Fourier modes. In this paper, we propose a multiscale sparse Fourier algorithm for noisy samples that proves to be both robust against noise and efficient. BibTeX: @article{Choi2019, author = {Choi, Bosu and Christlieb, Andrew and Wang, Yang}, title = {Multiscale High-Dimensional Sparse Fourier Algorithms for Noisy Data}, year = {2019} } Choi D, Jang J-G and Kang U (2019), "S3CMTF: Fast, accurate, and scalable method for incomplete coupled matrix-tensor factorization", PloS one. Vol. 14(6) [Abstract] [BibTeX] [DOI] Abstract: How can we extract hidden relations from a tensor and a matrix data simultaneously in a fast, accurate, and scalable way? Coupled matrix-tensor factorization (CMTF) is an important tool for this purpose. Designing an accurate and efficient CMTF method has become more crucial as the size and dimension of real-world data are growing explosively. However, existing methods for CMTF suffer from lack of accuracy, slow running time, and limited scalability. In this paper, we propose S3CMTF, a fast, accurate, and scalable CMTF method. In contrast to previous methods which do not handle large sparse tensors and are not parallelizable, S3CMTF provides parallel sparse CMTF by carefully deriving gradient update rules. S3CMTF asynchronously updates partial gradients without expensive locking. We show that our method is guaranteed to converge to a quality solution theoretically and empirically. S3CMTF further boosts the performance by carefully storing intermediate computation and reusing them. We theoretically and empirically show that S3CMTF is the fastest, outperforming existing methods. Experimental results show that S3CMTF is up to 930× faster than existing methods while providing the best accuracy. S3CMTF shows linear scalability on the number of data entries and the number of cores. In addition, we apply S3CMTF to Yelp rating tensor data coupled with 3 additional matrices to discover interesting patterns. BibTeX: @article{Choi2019a, author = {Choi, Dongjin and Jang, Jun-Gi and Kang, U.}, title = {S3CMTF: Fast, accurate, and scalable method for incomplete coupled matrix-tensor factorization}, journal = {PloS one}, year = {2019}, volume = {14}, number = {6}, doi = {10.1371/journal.pone.0217316} } Chung J, Chung M, Slagel JT and Tenorio L (2019), "Sampled Limited Memory Methods for Massive Linear Inverse Problems", December, 2019. [Abstract] [BibTeX] Abstract: In many modern imaging applications the desire to reconstruct high resolution images, coupled with the abundance of data from acquisition using ultra-fast detectors, have led to new challenges in image reconstruction. A main challenge is that the resulting linear inverse problems are massive. The size of the forward model matrix exceeds the storage capabilities of computer memory, or the observational dataset is enormous and not available all at once. Row-action methods that iterate over samples of rows can be used to approximate the solution while avoiding memory and data availability constraints. However, their overall convergence can be slow. In this paper, we introduce a sampled limited memory row-action method for linear least squares problems, where an approximation of the global curvature of the underlying least squares problem is used to speed up the initial convergence and to improve the accuracy of iterates. We show that this limited memory method is a generalization of the damped block Kaczmarz method, and we prove linear convergence of the expectation of the iterates and of the error norm up to a convergence horizon. Numerical experiments demonstrate the benefits of these sampled limited memory row-action methods for massive 2D and 3D inverse problems in tomography applications. BibTeX: @article{Chung2019, author = {Julianne Chung and Matthias Chung and J. Tanner Slagel and Luis Tenorio}, title = {Sampled Limited Memory Methods for Massive Linear Inverse Problems}, year = {2019} } Cifuentes D and Moitra A (2019), "Polynomial time guarantees for the Burer-Monteiro method", December, 2019. [Abstract] [BibTeX] Abstract: The Burer-Monteiro method is one of the most widely used techniques for solving large-scale semidefinite programs (SDP). The basic idea is to solve a nonconvex program in Y, where Y is an n × p matrix such that X = Y Y^T. In this paper, we show that this method can solve SDPs in polynomial time in an smoothed analysis setting. More precisely, we consider an SDP whose domain satisfies some compactness and smoothness assumptions, and slightly perturb the cost matrix and the constraints. We show that if p ≳ 2(1+\eta)m, where m is the number of constraints and >0 is any fixed constant, then the Burer-Monteiro method can solve SDPs to any desired accuracy in polynomial time, in the setting of smooth analysis. Our bound on p approaches the celebrated Barvinok-Pataki bound in the limit as η goes to zero, beneath which it is known that the nonconvex program can be suboptimal. Previous analyses were unable to give polynomial time guarantees for the Burer-Monteiro method, since they either assumed that the criticality conditions are satisfied exactly, or ignored the nontrivial problem of computing an approximately feasible solution. We address the first problem through a novel connection with tubular neighborhoods of algebraic varieties. For the feasibility problem we consider a least squares formulation, and provide the first guarantees that do not rely on the restricted isometry property. BibTeX: @article{Cifuentes2019, author = {Cifuentes, Diego and Moitra, Ankur}, title = {Polynomial time guarantees for the Burer-Monteiro method}, year = {2019} } Cook W (2019), "Computing in Combinatorial Optimization", In Lecture Notes in Computer Science. Vol. 10000 Springer, Cham.. [Abstract] [BibTeX] [DOI] Abstract: Research in combinatorial optimization successfully combines diverse ideas drawn from computer science, mathematics, and operations research. We give a tour of this work, focusing on the early development of the subject and the central role played by linear programming. The paper concludes with a short wish list of future research directions. BibTeX: @inbook{Cook2019, author = {Cook, William}, editor = {Steffen B., Woeginger G.}, title = {Computing in Combinatorial Optimization}, booktitle = {Lecture Notes in Computer Science}, publisher = {Springer, Cham.}, year = {2019}, volume = {10000}, doi = {10.1007/978-3-319-91908-9_3} } Cools S, Cornelis J, Ghysels P and Vanroose W (2019), "Improving strong scaling of the Conjugate Gradient method for solving large linear systems using global reduction pipelining", In Proceedings of the 2019 EuroMPI conference. [Abstract] [BibTeX] Abstract: This paper presents performance results comparing MPI-based implementations of the popular Conjugate Gradient (CG) method and several of its communication hiding (or “pipelined”) variants. Pipelined CG methods are designed to efficiently solve SPD linear systems on massively parallel distributed memory hardware, and typically display significantly improved strong scaling compared to classic CG. This increase in parallel performance is achieved by overlapping the global reduction phase (MPI_Iallreduce) required to compute the inner products in each iteration by (chiefly local) computational work such as the matrix-vector product as well as other global communication. This work includes a brief introduction to the deep pipelined CG method for readers that may be unfamiliar with the specifics of the method. A brief overview of implementation details provides the practical tools required for implementation of the algorithm. Subsequently, easily reproducible strong scaling results on the US Department of Energy (DoE) NERSC machine “Cori” (Phase I -- Haswell nodes) on up to 1024 nodes with 16 MPI ranks per node are presented using an implementation of p(l)-CG that is available in the open source PETSc library. Observations on the staggering and overlap of the asynchronous, non-blocking global communication phases with communication and computational kernels are drawn from the experiments. BibTeX: @inproceedings{Cools2019, author = {Cools, Siegfried and Cornelis, Jeffrey and Ghysels, Pieter and Vanroose, Wim}, title = {Improving strong scaling of the Conjugate Gradient method for solving large linear systems using global reduction pipelining}, booktitle = {Proceedings of the 2019 EuroMPI conference}, year = {2019} } Croci M, Giles MB and Farrell PE (2019), "Multilevel quasi Monte Carlo methods for elliptic PDEs with random field coefficients via fast white noise sampling" [Abstract] [BibTeX] Abstract: When solving partial differential equations with random fields as coefficients the efficient sampling of random field realisations can be challenging. In this paper we focus on the fast sampling of Gaussian fields using quasi-random points in a finite element and multilevel quasi Monte Carlo (MLQMC) setting. Our method uses the SPDE approach combined with a new fast (ML)QMC algorithm for white noise sampling. We express white noise as a wavelet series expansion that we divide in two parts. The first part is sampled using quasi-random points and contains a finite number of terms in order of decaying importance to ensure good QMC convergence. The second part is a correction term which is sampled using standard pseudo-random numbers. We show how the sampling of both terms can be performed in linear time and memory complexity in the number of mesh cells via a supermesh construction, yielding an overall linear cost. Furthermore, our technique can be used to enforce the MLQMC coupling even in the case of non-nested mesh hierarchies. We demonstrate the efficacy of our method with numerical experiments. BibTeX: @article{Croci2019, author = {Croci, M. and Giles, M. B. and Farrell, P. E.}, title = {Multilevel quasi Monte Carlo methods for elliptic PDEs with random field coefficients via fast white noise sampling}, year = {2019} } Crowley C, Rodriguez JI, Weiker J and Zoromski J (2019), "Regeneration graphs for polynomial system solving", December, 2019. [Abstract] [BibTeX] Abstract: Regeneration is a popular method for describing the solution set of a system of polynomial equations. In this paper we introduce regeneration graphs to solve polynomial systems. This translates the problem of solving a polynomial system to that of traversing a directed acyclic graph. Previous regeneration algorithms can be viewed in our context as breadth first traversal, and we formulate a depth first alternative which is useful in many applications because it quickly produces a subset of the solutions and is not all or nothing.'' BibTeX: @article{Crowley2019, author = {Crowley, Colin and Rodriguez, Jose Israel and Weiker, Jacob and Zoromski, Jacob}, title = {Regeneration graphs for polynomial system solving}, year = {2019} } Çuğu İ and Manguoğlu M (2019), "A parallel multithreaded sparse triangular linear system solver", Computers & Mathematics with Applications. [Abstract] [BibTeX] [DOI] [URL] Abstract: We propose a parallel sparse triangular linear system solver based on the Spike algorithm. Sparse triangular systems are required to be solved in many applications. Often, they are a bottleneck due to their inherently sequential nature. Furthermore, typically many successive systems with the same coefficient matrix and with different right hand side vectors are required to be solved. The proposed solver decouples the problem at the cost of extra arithmetic operations as in the banded case. Compared to the banded case, there are extra savings due to the sparsity of the triangular coefficient matrix. We show the parallel performance of the proposed solver against the state-of-the-art parallel sparse triangular solver in Intel's Math Kernel Library (MKL) on a multicore architecture. We also show the effect of various sparse matrix reordering schemes. Numerical results show that the proposed solver outperforms MKL's solver in ∼80% of cases by a factor of 2.47, on average. BibTeX: @article{Cugu2019, author = {Çuğu, İlke and Manguoğlu, Murat}, title = {A parallel multithreaded sparse triangular linear system solver}, journal = {Computers & Mathematics with Applications}, year = {2019}, url = {http://www.sciencedirect.com/science/article/pii/S0898122119304602}, doi = {10.1016/j.camwa.2019.09.012} } Curtis FE, Robinson DP, Royer C and Wright SJ (2019), "Trust-Region Newton-CG with Strong Second-Order Complexity Guarantees for Nonconvex Optimization", December, 2019. [Abstract] [BibTeX] Abstract: Worst-case complexity guarantees for nonconvex optimization algorithms have been a topic of growing interest. Multiple frameworks that achieve the best known complexity bounds among a broad class of first- and second-order strategies have been proposed. These methods have often been designed primarily with complexity guarantees in mind and, as a result, represent a departure from the algorithms that have proved to be the most effective in practice. In this paper, we consider trust-region Newton methods, one of the most popular classes of algorithms for solving nonconvex optimization problems. By introducing slight modifications to the original scheme, we obtain two methods---one based on exact subproblem solves and one exploiting inexact subproblem solves as in the popular "trust-region Newton-Conjugate-Gradient" (Newton-CG) method---with iteration and operation complexity bounds that match the best known bounds for the aforementioned class of first- and second-order methods. The resulting Newton-CG method also retains the attractive practical behavior of classical trust-region Newton-CG, which we demonstrate with numerical comparisons on a standard benchmark test set. BibTeX: @article{Curtis2019, author = {Curtis, Frank E. and Robinson, Daniel P. and Royer, Clément and Wright, Stephen J.}, title = {Trust-Region Newton-CG with Strong Second-Order Complexity Guarantees for Nonconvex Optimization}, year = {2019} } Das S, Demmel J, Fountoulakis K, Grigori L and Mahoney MW (2019), "Parallel and Communication Avoiding Least Angle Regression" [Abstract] [BibTeX] Abstract: We are interested in parallelizing the Least Angle Regression (LARS) algorithm for fitting linear regression models to high-dimensional data. We consider two parallel and communication avoiding versions of the basic LARS algorithm. The two algorithms apply to data that have different layout patterns (one is appropriate for row-partitioned data, and the other is appropriate for column-partitioned data), and they have different asymptotic costs and practical performance. The first is bLARS, a block version of LARS algorithm, where we update b columns at each iteration. Assuming that the data are row-partitioned, bLARS reduces the number of arithmetic operations, latency, and bandwidth by a factor of b. The second is Tournament-bLARS (T-bLARS), a tournament version of LARS, in which case processors compete, by running several LARS computations in parallel, to choose b new columns to be added into the solution. Assuming that the data are column-partitioned, T-bLARS reduces latency by a factor of b. Similarly to LARS, our proposed methods generate a sequence of linear models. We present extensive numerical experiments that illustrate speed-ups up to 25× compared to LARS. BibTeX: @article{Das2019, author = {Das, Swapnil and Demmel, James and Fountoulakis, Kimon and Grigori, Laura and Mahoney, Michael. W.}, title = {Parallel and Communication Avoiding Least Angle Regression}, year = {2019} } Davis TA, Aznaveh M and Kolodziej S (2019), "Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS", In Proceedings of the 23rd IEEE Conference on High Performance Extreme Computing. Waltham, MA, USA [Abstract] [BibTeX] Abstract: SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written in GraphBLAS achieve high performance with minimal development time. Using GraphBLAS, it took a mere 20 minutes to write a first-cut computational kernel that solves the Sparse Deep Neural Network Graph Challenge. Understanding the problem description and file format, writing code to read in the files that define the problem, and comparing our results with the reference solution took a full day. The kernel consists of a single for-loop around 4 lines of code, all of which are calls to GraphBLAS, and it worked perfectly the first time it was compiled. The sequential performance of the GraphBLAS solution is 3× to 5× faster than the MATLAB reference implementation. OpenMP parallelism gives an additional 10× to 15× speedup on a 20-core Intel processor, 17× on an IBM Power8 system, and 20× on a Power9 system, for the largest problems. Since SuiteSparse:GraphBLAS does not yet employ MPI, this was added at the application level, a development effort that took one week, primarily because of difficulties in resolving a load-balancing issue in the MPI-based parallel algorithm. BibTeX: @inproceedings{Davis2019, author = {Davis, Timothy A. and Aznaveh, Mohsen and Kolodziej, Scott}, title = {Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS}, booktitle = {Proceedings of the 23rd IEEE Conference on High Performance Extreme Computing}, year = {2019} } Davydov D and Kronbichler M (2019), "Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors", July, 2019. [Abstract] [BibTeX] Abstract: Traditional solution approaches for problems in quantum mechanics scale as O(M3), where M is the number of electrons. Various methods have been proposed to address this issue and obtain linear scaling O(M). One promising formulation is the direct minimization of energy. Such methods take advantage of physical localization of the solution, namely that the solution can be sought in terms of non-orthogonal orbitals with local support. In this work a numerically efficient implementation of sparse parallel vectors within the open-source finite element library deal.II is proposed. The main algorithmic ingredient is the matrix-free evaluation of the Hamiltonian operator by cell-wise quadrature. Based on an a-priori chosen support for each vector we develop algorithms and data structures to perform (i) matrix-free sparse matrix multivector products (SpMM), (ii) the projection of an operator onto a sparse sub-space (inner products), and (iii) post-multiplication of a sparse multivector with a square matrix. The node-level performance is analyzed using a roofline model. Our matrix-free implementation of finite element operators with sparse multivectors achieves the performance of 157 GFlop/s on Intel Cascade Lake architecture. Strong and weak scaling results are reported for a typical benchmark problem using quadratic and quartic finite element bases. BibTeX: @article{Davydov2019, author = {Davydov, Denis and Kronbichler, Martin}, title = {Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors}, year = {2019} } Deakin TJ, McIntosh-Smith SN, Price J, Poenaru A, Atkinson PR, Popa C and Salmon J (2019), "Performance Portability across Diverse Computer Architectures", In 2019 IEEE International Workshop on Performance, Portability and Productivity in HPC. United States, September, 2019. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] Abstract: Previous studies into performance portability have typically analysed a single application (and its various implementations) in isolation. In this study we explore the wider landscape of performance portability by considering a number of applications from across the space of dwarfs, written in multiple parallel programming models, and across a diverse set of architectures. We apply rigorous performance portability metrics, as defined by Pennycook et al [1]. We believe this is the broadest and most rigorous performance portability study to date, representing a far reaching exploration of the state of performance portability that is achievable today. We will present a summary of the performance portability of each application and programming model across our diverge range of twelve computer architectures, including six different server CPUs from five different vendors, five different GPUs from two different vendors, and one vector architecture. We will conclude with an analysis of the performance portability of key programming models in general, across different application spaces as well across differing architectures, allowing us to comment on more general performance portability principles. BibTeX: @inproceedings{Deakin2019, author = {Deakin, Tom J. and McIntosh-Smith, Simon N. and Price, James and Poenaru, Andrei and Atkinson, Patrick R. and Popa, Codrin and Salmon, Justin}, title = {Performance Portability across Diverse Computer Architectures}, booktitle = {2019 IEEE International Workshop on Performance, Portability and Productivity in HPC}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2019} } Demirci GV and Aykanat C (2019), "Scaling sparse matrix-matrix multiplication in the accumulo database", Distributed and Parallel Databases., January, 2019. [Abstract] [BibTeX] [DOI] Abstract: We propose and implement a sparse matrix-matrix multiplication (SpGEMM) algorithm running on top of Accumulo's iterator framework which enables high performance distributed parallelism. The proposed algorithm provides write-locality while ingesting the output matrix back to database via utilizing row-by-row parallel SpGEMM. The proposed solution also alleviates scanning of input matrices multiple times by making use of Accumulo's batch scanning capability which is used for accessing multiple ranges of key-value pairs in parallel. Even though the use of batch-scanning introduces some latency overheads, these overheads are alleviated by the proposed solution and by using node-level parallelism structures. We also propose a matrix partitioning scheme which reduces the total communication volume and provides a balance of workload among servers. The results of extensive experiments performed on both real-world and synthetic sparse matrices show that the proposed algorithm scales significantly better than the outer-product parallel SpGEMM algorithm available in the Graphulo library. By applying the proposed matrix partitioning, the performance of the proposed algorithm is further improved considerably. BibTeX: @article{Demirci2019, author = {Demirci, Gunduz Vehbi and Aykanat, Cevdet}, title = {Scaling sparse matrix-matrix multiplication in the accumulo database}, journal = {Distributed and Parallel Databases}, year = {2019}, doi = {10.1007/s10619-019-07257-y} } Demmel J, Grigori L and Rusciano A (2019), "An improved analysis and unified perspective on deterministic and randomized low rank matrix approximations", October, 2019. [Abstract] [BibTeX] Abstract: We introduce a Generalized LU-Factorization (GLU) for low-rank matrix approximation. We relate this to past approaches and extensively analyze its approximation properties. The established deterministic guarantees are combined with sketching ensembles satisfying Johnson-Lindenstrauss properties to present complete bounds. Particularly good performance is shown for the sub-sampled randomized Hadamard transform (SRHT) ensemble. Moreover, the factorization is shown to unify and generalize many past algorithms. It also helps to explain the effect of sketching on the growth factor during Gaussian Elimination. BibTeX: @article{Demmel2019, author = {Demmel, James and Grigori, Laura and Rusciano, Alexander}, title = {An improved analysis and unified perspective on deterministic and randomized low rank matrix approximations}, year = {2019} } Gonzaga De Oliveira SL and Abreu AAAM (2019), "An Experimental Analysis of Three Pseudo-peripheral Vertex Finders in conjunction with the Reverse Cuthill-McKee Method for Bandwidth Reduction", TEMA (São Carlos)., 12, 2019. Vol. 20, pp. 497-507. scielo. [Abstract] [BibTeX] [DOI] [URL] Abstract: The need to determine pseudoperipheral vertices arises from several graph-theoretical approaches for ordering sparse matrix equations. The results of two algorithms for finding such vertices, namely, the George-Liu and Kaveh-Bondarabady algorithms, are evaluated in this work along with a variant of the Kaveh-Bondarabady algorithm. The results suggest that the well-know George-Liu algorithm dominates the other two pseudoperipheral vertex finders mainly when considering the computational times of the algorithms. BibTeX: @article{DeOliveira2019, author = {Gonzaga De Oliveira, S. L. and Abreu, A. A. A. M.}, title = {An Experimental Analysis of Three Pseudo-peripheral Vertex Finders in conjunction with the Reverse Cuthill-McKee Method for Bandwidth Reduction}, journal = {TEMA (São Carlos)}, publisher = {scielo}, year = {2019}, volume = {20}, pages = {497--507}, url = {http://www.scielo.br/scielo.php?script=sci_arttext&pid=S2179-84512019000300497&nrm=iso}, doi = {10.5540/tema.2019.020.03.0497} } Dhulipala L, McGuffey C, Kang H, Gu Y, Blelloch GE, Gibbons PB and Shun J (2019), "Semi-Asymmetric Parallel Graph Algorithms for NVRAMs", October, 2019. [Abstract] [BibTeX] Abstract: Emerging non-volatile main memory (NVRAM) technologies provide novel features for large-scale graph analytics, combining byte-addressability, low idle power, and improved memory-density. Systems are likely to have an order of magnitude more NVRAM than traditional memory (DRAM), allowing large graph problems to be solved efficiently at a modest cost on a single machine. However, a significant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be significantly more expensive than NVRAM reads. In this paper, we propose an approach to parallel graph analytics in which the graph is stored as a read-only data structure (in NVRAM), and the amount of mutable memory is kept proportional to the number of vertices. Similar to the popular semi-external and semi-streaming models for graph analytics, the approach assumes that the vertices of the graph fit in a fast read-write memory (DRAM), but the edges do not. In NVRAM systems, our approach eliminates writes to the NVRAM, among other benefits. We present a model, the Parallel Semi-Asymmetric Model (PSAM), to analyze algorithms in the setting, and run experiments on a 48-core NVRAM system to validate the effectiveness of these algorithms. To this end, we study over a dozen graph problems. We develop parallel algorithms for each that are efficient, often work-optimal, in the model. Experimentally, we run all of the algorithms on the largest publicly-available graph and show that our PSAM algorithms outperform the fastest prior algorithms designed for DRAM or NVRAM. We also show that our algorithms running on NVRAM nearly match the fastest prior algorithms running solely in DRAM, by effectively hiding the costs of repeatedly accessing NVRAM versus DRAM. BibTeX: @article{Dhulipala2019, author = {Dhulipala, Laxman and McGuffey, Charlie and Kang, Hongbo and Gu, Yan and Blelloch, Guy E. and Gibbons, Phillip B. and Shun, Julian}, title = {Semi-Asymmetric Parallel Graph Algorithms for NVRAMs}, year = {2019} } Doikov N and Nesterov Y (2019), "Contracting Proximal Methods for Smooth Convex Optimization", CORE Discussion Papers ; 2019/27 (2019) 24 pages http://hdl.handle.net/2078.1/223949., December, 2019. [Abstract] [BibTeX] Abstract: In this paper, we propose new accelerated methods for smooth Convex Optimization, called Contracting Proximal Methods. At every step of these methods, we need to minimize a contracted version of the objective function augmented by a regularization term in the form of Bregman divergence. We provide global convergence analysis for a general scheme admitting inexactness in solving the auxiliary subproblem. In the case of using for this purpose high-order Tensor Methods, we demonstrate an acceleration effect for both convex and uniformly convex composite objective function. Thus, our construction explains acceleration for methods of any order starting from one. The augmentation of the number of calls of oracle due to computing the contracted proximal steps, is limited by the logarithmic factor in the worst-case complexity bound. BibTeX: @article{Doikov2019, author = {Nikita Doikov and Yurii Nesterov}, title = {Contracting Proximal Methods for Smooth Convex Optimization}, journal = {CORE Discussion Papers ; 2019/27 (2019) 24 pages http://hdl.handle.net/2078.1/223949}, year = {2019} } Dong X, Liu L, Zhao P, Li G, Li J, Wang X and Feng X (2019), "Acorns: A Framework for Accelerating Deep Neural Networks with Input Sparsity", In Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques., September, 2019. , pp. 178-191. [Abstract] [BibTeX] [DOI] Abstract: Deep neural networks have been employed in a broad range of applications, including face detection, natural language processing, and autonomous driving. Yet, the neural networks with the capability to tackle real-world problems are intrinsically expensive in computation, hindering the usage of these models. Sparsity in the input data of neural networks provides an optimizing opportunity. However, harnessing the potential performance improvement on modern CPU faces challenges raised by sparse computations of the neural network, such as cache-unfriendly memory accesses and efficient sparse kernel implementation. In this paper, we propose Acorns, a framework to accelerate deep neural networks with input sparsity. In Acorns, sparse input data is organized into our designed sparse data layout, which allows memory-friendly access for kernels in neural networks and opens the door for many performance-critical optimizations. Upon that, Acorns generates efficient sparse kernels for operators in neural networks from kernel templates, which combine directions that express specific optimizing transformations to be performed, and straightforward code that describes the computation. Comprehensive evaluations demonstrate Acorns can outperform state-of-the-art baselines by significant speedups. On the real-world detection task in autonomous driving, Acorns demonstrates 1.8-22.6× performance improvement over baselines. Specifically, the generated programs achieve 1.8-2.4× speedups over Intel MKL-DNN, 3.0-8.8× speedups over TensorFlow, and 11.1-13.2× speedups over Intel MKL-Sparse. BibTeX: @inproceedings{Dong2019, author = {Dong, X. and Liu, L. and Zhao, P. and Li, G. and Li, J. and Wang, X. and Feng, X.}, title = {Acorns: A Framework for Accelerating Deep Neural Networks with Input Sparsity}, booktitle = {Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques}, year = {2019}, pages = {178--191}, doi = {10.1109/PACT.2019.00022} } Donnat C and Holmes S (2019), "Convex Hierarchical Clustering for Graph-Structured Data", November, 2019. [Abstract] [BibTeX] Abstract: Convex clustering is a recent stable alternative to hierarchical clustering. It formulates the recovery of progressively coalescing clusters as a regularized convex problem. While convex clustering was originally designed for handling Euclidean distances between data points, in a growing number of applications, the data is directly characterized by a similarity matrix or weighted graph. In this paper, we extend the robust hierarchical clustering approach to these broader classes of similarities. Having defined an appropriate convex objective, the crux of this adaptation lies in our ability to provide: (a) an efficient recovery of the regularization path and (b) an empirical demonstration of the use of our method. We address the first challenge through a proximal dual algorithm, for which we characterize both the theoretical efficiency as well as the empirical performance on a set of experiments. Finally, we highlight the potential of our method by showing its application to several real-life datasets, thus providing a natural extension to the current scope of applications of convex clustering. BibTeX: @article{Donnat2019, author = {Donnat, Claire and Holmes, Susan}, title = {Convex Hierarchical Clustering for Graph-Structured Data}, year = {2019} } Iain Duff JH and Lopez F (2019), "A new sparse symmetric indefinite solver using A Posteriori Threshold Pivoting". Thesis at: Science & Technology Facilities Council, UK. (RAL-TR-2018-008) [Abstract] [BibTeX] Abstract: The factorization of sparse symmetric indefinite systems is particularly challenging since pivoting is required to maintain stability of the factorization. Pivoting techniques generally offer limited parallelism and are associated with significant data movement hindering the scalability of these methods. Variants of the Threshold Partial Pivoting (TPP) algorithm for example have been often used because of its numerical robustness but standard implementations exhibit poor parallel performance. On the other hand, some methods trade stability for performance on parallel architectures such as the Supernode Bunch-Kaufman (SBK) used in the PARDISO solver. In this case, however, the factors obtained might not be used to accurately compute the solution of the system. For this reason we have designed a task-based LDL^T factorization algorithm based on a new pivoting strategy called A Posteriori Threshold Pivoting (APTP) that is much more suitable for modern multicore architectures and has the same numerical robustness as the TPP strategy. We implemented our algorithm in a new version of the SPRAL Sparse Symmetric Indefinite Direct Solver (SSIDS) which initially supported GPU-only factorization. We have used OpenMP 4 task features to implement a multifrontal algorithm with dense factorizations using the novel APTP, and we show that it performs favourably compared to the state-of-the-art solvers HSL_MA86, HSL_MA97 and PARDISO both in terms of performance on a multicore machine and in terms of numerical robustness. Finally we show that this new solver is able to make use of GPU devices for accelerating the factorization on heterogeneous architectures. BibTeX: @report{Duff2019, author = {Iain Duff, Jonathan Hogg and Lopez, Florent}, title = {A new sparse symmetric indefinite solver using A Posteriori Threshold Pivoting}, school = {Science & Technology Facilities Council, UK}, year = {2019}, number = {RAL-TR-2018-008} } Dufrechou E, Ezzatti P and Quintana-Orti ES (2019), "Automatic Selection of Sparse Triangular Linear System Solvers on GPUs through Machine Learning Techniques", In Proceeding of the 31st International Symposium on Computer Architecture and High Performance Computing., October, 2019. , pp. 41-47. [Abstract] [BibTeX] [DOI] Abstract: The solution of sparse triangular linear systems is often the most time-consuming stage of preconditioned iterative methods to solve general sparse linear systems, where it has to be applied several times for the same sparse matrix. For this reason, its computational performance has a strong impact on a wide range of scientific and engineering applications, which has motivated the study of its efficient execution on massively parallel platforms. In this sense, several methods have been proposed to tackle this operation on graphics processing units (GPUs), which can be classified under either the level-set or the self-scheduling paradigms. The results obtained from the experimental evaluation of the different methods suggest that both paradigms perform well for certain problems but poorly for others. Additionally, the relation between the properties of the linear systems and the performance of the different solvers is not evident a-priori. In this context, techniques that allow to predict inexpensively which is be the best solver for a particular linear system can lead to important runtime reductions. Our approach leverages machine learning techniques to select the best sparse triangular solver for a given linear system, with focus on the case where a small number of triangular systems has to be solved for the same matrix. We study the performance of several methods using different features derived from the sparse matrices, obtaining models with more than 80% of accuracy and acceptable prediction speed. These results are an important advance towards the automatic selection of the best GPU solver for a given sparse triangular linear system, and the characterization of the performance of these kernels. BibTeX: @inproceedings{Dufrechou2019, author = {Dufrechou, E. and Ezzatti, P. and Quintana-Orti, E. S.}, title = {Automatic Selection of Sparse Triangular Linear System Solvers on GPUs through Machine Learning Techniques}, booktitle = {Proceeding of the 31st International Symposium on Computer Architecture and High Performance Computing}, year = {2019}, pages = {41--47}, doi = {10.1109/SBAC-PAD.2019.00020} } Dumitrasc A, Leleux P and Rüde U (2019), "Block partitioning of sparse rectangular matrices", Proceedings of Applied Mathematics and Mechanics. Vol. 19(1) [Abstract] [BibTeX] [DOI] Abstract: Abstract We present a means of reordering large, sparse rectangular matrices such that their nonzeros are closer to the diagonal. This enables a block-partitioning which is useful in parallel contexts. We use the Reverse Cuthill-McKee (RCM) algorithm on the adjacency matrix of the associated bipartite graph. The resulting, reordered matrix has a block bidiagonal structure. BibTeX: @inproceedings{Dumitrasc2019, author = {Dumitrasc, Andrei and Leleux, Philippe and Rüde, Ulrich}, title = {Block partitioning of sparse rectangular matrices}, journal = {Proceedings of Applied Mathematics and Mechanics}, year = {2019}, volume = {19}, number = {1}, doi = {10.1002/pamm.201900287} } Dvurechensky P, Gasnikov A, Ostroukhov P, Uribe CA and Ivanova A (2019), "Near-optimal tensor methods for minimizing the gradient norm of convex function", December, 2019. [Abstract] [BibTeX] Abstract: Motivated by convex problems with linear constraints and, in particular, by entropy-regularized optimal transport, we consider the problem of finding varepsilon-approximate stationary points, i.e. points with the norm of the objective gradient less than varepsilon, of convex functions with Lipschitz p-th order derivatives. Lower complexity bounds for this problem were recently proposed in [Grapiglia and Nesterov, arXiv:1907.07053]. However, the methods presented in the same paper do not have optimal complexity bounds. We propose two optimal up to logarithmic factors methods with complexity bounds O(-2(p+1)/(3p+1)) and O(-2/(3p+1)) with respect to the initial objective residual and the distance between the starting point and solution respectively. BibTeX: @article{Dvurechensky2019, author = {Dvurechensky, Pavel and Gasnikov, Alexander and Ostroukhov, Petr and Uribe, César A. and Ivanova, Anastasiya}, title = {Near-optimal tensor methods for minimizing the gradient norm of convex function}, year = {2019} } Elafrou A, Goumas G and Koziris N (2019), "Conflict-free Symmetric Sparse Matrix-vector Multiplication on Multicore Architectures", In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA , pp. 48:1-48:15. ACM. [Abstract] [BibTeX] [DOI] Abstract: Exploiting the numeric symmetry in sparse matrices to reduce their memory footprint is very tempting for optimizing the memory-bound Sparse Matrix-Vector Multiplication (SpMV) kernel. Despite being very beneficial for serial computation, storing the upper or lower triangular part of the matrix introduces race conditions in the updates to the output vector in a parallel execution. Previous work has suggested using local, per-thread vectors to circumvent this problem, introducing a work-inefficient reduction step that limits the scalability of SpMV. In this paper, we address this issue with Conflict-Free Symmetric (CFS) SpMV, an optimization strategy that organizes the parallel computation into phases of conflict-free execution. We identify such phases through graph coloring and propose heuristics to improve the coloring quality for SpMV in terms of load balancing and locality to the input and output vectors. We evaluate our approach on two multicore shared-memory systems and demonstrate improved performance over the state-of-the-art. BibTeX: @inproceedings{Elafrou2019, author = {Elafrou, Athena and Goumas, Georgios and Koziris, Nectarios}, title = {Conflict-free Symmetric Sparse Matrix-vector Multiplication on Multicore Architectures}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, publisher = {ACM}, year = {2019}, pages = {48:1--48:15}, doi = {10.1145/3295500.3356148} } Elgohary A, Boehm M, Haas PJ, Reiss FR and Reinwald B (2019), "Compressed Linear Algebra for Declarative Large-scale Machine Learning", Communications of the ACM. New York, NY, USA, April, 2019. Vol. 62(5), pp. 83-91. ACM. [Abstract] [BibTeX] [DOI] Abstract: Large-scale Machine Learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications. Hence, it is crucial for performance to fit the data into single-node or distributed main memory to enable fast matrix-vector operations. General-purpose compression struggles to achieve both good compression ratios and fast decompression for block-wise uncompressed operations. Therefore, we introduce Compressed Linear Algebra (CLA) for lossless matrix compression. CLA encodes matrices with lightweight, value-based compression techniques and executes linear algebra operations directly on the compressed representations. We contribute effective column compression schemes, cache-conscious operations, and an efficient sampling-based compression algorithm. Our experiments show good compression ratios and operations performance close to the uncompressed case, which enables fitting larger datasets into available memory. We thereby obtain significant end-to-end performance improvements. BibTeX: @article{Elgohary2019, author = {Elgohary, Ahmed and Boehm, Matthias and Haas, Peter J. and Reiss, Frederick R. and Reinwald, Berthold}, title = {Compressed Linear Algebra for Declarative Large-scale Machine Learning}, journal = {Communications of the ACM}, publisher = {ACM}, year = {2019}, volume = {62}, number = {5}, pages = {83--91}, doi = {10.1145/3318221} } Ellis M, Guidi G, Buluç A, Oliker L and Yelick K (2019), "diBELLA: Distributed Long Read to Long Read Alignment", In Proceedings of the 48th International Conference on Parallel Processing. ACM Press. [Abstract] [BibTeX] [DOI] Abstract: We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from “third generation” long read sequencers [29]. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies. BibTeX: @inproceedings{Ellis2019, author = {Marquita Ellis and Giulia Guidi and Aydın Buluç and Leonid Oliker and Katherine Yelick}, title = {diBELLA: Distributed Long Read to Long Read Alignment}, booktitle = {Proceedings of the 48th International Conference on Parallel Processing}, publisher = {ACM Press}, year = {2019}, doi = {10.1145/3337821.3337919} } Elsen E, Dukhan M, Gale T and Simonyan K (2019), "Fast Sparse ConvNets", November, 2019. [Abstract] [BibTeX] Abstract: Historically, the pursuit of efficient inference has been one of the driving forces behind research into new deep learning architectures and building blocks. Some recent examples include: the squeeze-and-excitation module, depthwise separable convolutions in Xception, and the inverted bottleneck in MobileNet v2. Notably, in all of these cases, the resulting building blocks enabled not only higher efficiency, but also higher accuracy, and found wide adoption in the field. In this work, we further expand the arsenal of efficient building blocks for neural network architectures; but instead of combining standard primitives (such as convolution), we advocate for the replacement of these dense primitives with their sparse counterparts. While the idea of using sparsity to decrease the parameter count is not new, the conventional wisdom is that this reduction in theoretical FLOPs does not translate into real-world efficiency gains. We aim to correct this misconception by introducing a family of efficient sparse kernels for ARM and WebAssembly, which we open-source for the benefit of the community as part of the XNNPACK library. Equipped with our efficient implementation of sparse primitives, we show that sparse versions of MobileNet v1, MobileNet v2 and EfficientNet architectures substantially outperform strong dense baselines on the efficiency-accuracy curve. On Snapdragon 835 our sparse networks outperform their dense equivalents by 1.3-2.4× -- equivalent to approximately one entire generation of MobileNet-family improvement. We hope that our findings will facilitate wider adoption of sparsity as a tool for creating efficient and accurate deep learning architectures. BibTeX: @article{Elsen2019, author = {Elsen, Erich and Dukhan, Marat and Gale, Trevor and Simonyan, Karen}, title = {Fast Sparse ConvNets}, year = {2019} } Enayati S and Özaltın OY (2019), "Optimal Influenza Vaccine Distribution with Equity", European Journal of Operational Research., November, 2019. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: This paper is concerned with the optimal influenza vaccine distribution in a heterogeneous population consisting of multiple subgroups. We employ a compartmental model for influenza transmission and formulate a mathematical program to minimize the number of vaccine doses distributed to effectively extinguish an emerging outbreak in its early stages. We propose an equity constraint to help public health authorities consider fairness when making vaccine distribution decisions. We develop an exact solution approach that generates a vaccine distribution policy with a solution quality guarantee. We perform sensitivity analyses on key epidemic parameters in order to illustrate the application of the proposed model. We then analyze the scalability of the solution approach for a population consisting of subgroups based on geographic location and age. We finally demonstrate the proposed model's ability to consider vaccine coverage inequity and discuss a derivative-free optimization approach, as an alternative solution method which can consider various different objective functions and constraints. Our results indicate that consideration of group-specific transmission dynamics is paramount to the optimal distribution of influenza vaccines. BibTeX: @article{Enayati2019, author = {Enayati, Shakiba and Özaltın, Osman Y.}, title = {Optimal Influenza Vaccine Distribution with Equity}, journal = {European Journal of Operational Research}, publisher = {Elsevier BV}, year = {2019}, doi = {10.1016/j.ejor.2019.11.025} } Dominik Ernst Georg Hager JT and Wellein G (2019), "Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs" [Abstract] [BibTeX] Abstract: General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall and skinny matrices, which are much taller than wide. Nvidia's current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU. BibTeX: @article{Ernst2019, author = {Dominik Ernst, Georg Hager, Jonas Thies and Wellein, Gerhard}, title = {Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs}, year = {2019} } Fagnon V, Kacem I, Lucarelli G and Simon B (2019), "Scheduling on Hybrid Platforms: Improved Approximability Window", December, 2019. [Abstract] [BibTeX] Abstract: Modern platforms are using accelerators in conjunction with standard processing units in order to reduce the running time of specific operations, such as matrix operations, and improve their performance. Scheduling on such hybrid platforms is a challenging problem since the algorithms used for the case of homogeneous resources do not adapt well. In this paper we consider the problem of scheduling a set of tasks subject to precedence constraints on hybrid platforms, composed of two types of processing units. We propose a (3+22)-approximation algorithm and a conditional lower bound of 3 on the approximation ratio. These results improve upon the 6-approximation algorithm proposed by Kedad-Sidhoum et al. as well as the lower bound of 2 due to Svensson for identical machines. Our algorithm is inspired by the former one and distinguishes the allocation and the scheduling phases. However, we propose a different allocation procedure which, although is less efficient for the allocation sub-problem, leads to an improved approximation ratio for the whole scheduling problem. This approximation ratio actually decreases when the number of processing units of each type is close and matches the conditional lower bound when they are equal. BibTeX: @article{Fagnon2019, author = {Vincent Fagnon and Imed Kacem and Giorgio Lucarelli and Bertrand Simon}, title = {Scheduling on Hybrid Platforms: Improved Approximability Window}, year = {2019} } Falco A (2019), "Bridging the Gap Between H-Matrices and Sparse Direct Methods for the Solution of Large Linear Systems". Thesis at: Université de Bordeaux. [Abstract] [BibTeX] Abstract: Many physical phenomena may be studied through modeling and numerical simulations, commonplace in scientific applications. To be tractable on a computer, appropriated discretization techniques must be considered, which often lead to a set of linear equations whose features depend on the discretization techniques. Among them, the Finite Element Method usually leads to sparse linear systems whereas the Boundary Element Method leads to dense linear systems. The size of the resulting linear systems depends on the domain where the studied physical phenomenon develops and tends to become larger and larger as the performance of the computer facilities increases. For the sake of numerical robustness, the solution techniques based on the factorization of the matrix associated with the linear system are the methods of choice when affordable. In that respect, hierarchical methods based on low-rank compression have allowed a drastic reduction of the computational requirements for the solution of dense linear systems over the last two decades. For sparse linear systems, their application remains a challenge which has been studied by both the community of hierarchical matrices and the community of sparse matrices. On the one hand, the first step taken by the community of hierarchical matrices most often takes advantage of the sparsity of the problem through the use of nested dissection. While this approach benefits from the hierarchical structure, it is not, however, as efficient as sparse solvers regarding the exploitation of zeros and the structural separation of zeros from non-zeros. On the other hand, sparse factorization is organized so as to lead to a sequence of smaller dense operations, enticing sparse solvers to use this property and exploit compression techniques from hierarchical methods in order to reduce the computational cost of these elementary operations. Nonetheless, the globally hierarchical structure may be lost if the compression of hierarchical methods is used only locally on dense submatrices. We here review the main techniques that have been employed by both those communities, trying to highlight their common properties and their respective limits with a special emphasis on studies that have aimed to bridge the gap between them. With these observations in mind, we propose a class of hierarchical algorithms based on the symbolic analysis of the structure of the factors of a sparse matrix. These algorithms rely on a symbolic information to cluster and construct a hierarchical structure coherent with the non-zero pattern of the matrix. Moreover, the resulting hierarchical matrix relies on low-rank compression for the reduction of the memory consumption of large submatrices as well as the time to solution of the solver. We also compare multiple ordering techniques based on geometrical or topological properties. Finally, we open the discussion to a coupling between the Finite Element Method and the Boundary Element Method in a unified computational framework. BibTeX: @phdthesis{Falco2019, author = {Falco, Aurélien}, title = {Bridging the Gap Between H-Matrices and Sparse Direct Methods for the Solution of Large Linear Systems}, school = {Université de Bordeaux}, year = {2019} } Févotte F and Lathuilière B (2019), "Debugging and optimization of HPC programs with the Verrou tool", In Proceedings ot the International Workshop on Software Correctness for HPC Applications. United States, 9, 2019. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] Abstract: The analysis of Floating-Point-related issues in HPC codes is becoming a topic of major interest: parallel computing and code optimization often break the reproducibility of numerical results across machines, compilers and even executions of the same program. \ This paper presents how the Verrou tool can help during all stages of the Floating-Point analysis of HPC codes: diagnostic, debugging and optimization. Recent developments of Verrou are presented, along with examples illustrating the interest of these new features for industrial codes such as code aster. \ More specifically, the Verrou arithmetic back-ends now allow analyzing or emulating mixed-precision programs. Interlibm, an interposition layer for the mathematical library, is introduced to mitigate long-standing issues with algorithms from the libm. Finally, debugging algorithms are extended in order to produce useful information as soon as it is available. All these features are available in released version 2.1.0 and upcoming version 2.2.0. BibTeX: @inproceedings{Fevotte2019, author = {François Févotte and Bruno Lathuilière}, title = {Debugging and optimization of HPC programs with the Verrou tool}, booktitle = {Proceedings ot the International Workshop on Software Correctness for HPC Applications}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2019} } de Fine Licht J, Kwasniewski G and Hoefler T (2019), "Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis", Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays., December, 2019. [Abstract] [BibTeX] [DOI] Abstract: Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms. BibTeX: @article{FineLicht2019, author = {Johannes de Fine Licht and Grzegorz Kwasniewski and Torsten Hoefler}, title = {Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis}, journal = {Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays}, year = {2019}, doi = {10.1145/3373087.3375296} } Fuchs A and Wentzlaff D (2019), "The Accelerator Wall: Limits of Chip Specialization", In Proceedings of the 25th IEEE International Symposium on High-Performance Computer Architecture. [Abstract] [BibTeX] Abstract: Specializing chips using hardware accelerators has become the prime means to alleviate the gap between the growing computational demands and the stagnating transistor budgets caused by the slowdown of CMOS scaling. Much of the benefits of chip specialization stems from optimizing a computational problem within a given chip's transistor budget. Unfortunately, the stagnation of the number of transistors available on a chip will limit the accelerator design optimization space, leading to diminishing specialization returns, ultimately hitting an accelerator wall.\ In this work, we tackle the question of what are the limits of future accelerators and chip specialization? We do this by characterizing how current accelerators depend on CMOS scaling, based on a physical modeling tool that we constructed using datasheets of thousands of chips. We identify key concepts used in chip specialization, and explore case studies to understand how specialization has progressed over time in different applications and chip platforms (e.g., GPUs, FPGAs, ASICs). Utilizing these insights, we build a model which projects forward to see what future gains can and cannot be enabled from chip specialization. A quantitative analysis of specialization returns and technological boundaries is critical to help researchers understand the limits of accelerators and develop methods to surmount them. BibTeX: @inproceedings{Fuchs2019, author = {Fuchs, Adi and Wentzlaff, David}, title = {The Accelerator Wall: Limits of Chip Specialization}, booktitle = {Proceedings of the 25th IEEE International Symposium on High-Performance Computer Architecture}, year = {2019} } Fujiki D, Chatterjee N, Lee D and O'Connor M (2019), "Near-memory Data Transformation for Efficient Sparse Matrix Multi-vector Multiplication", In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA , pp. 55:1-55:17. ACM. [Abstract] [BibTeX] [DOI] Abstract: Efficient manipulation of sparse matrices is critical to a wide range of HPC applications. Increasingly, GPUs are used to accelerate these sparse matrix operations. We study one common operation, Sparse Matrix Multi-Vector Multiplication (SpMM), and evaluate the impact of the sparsity, distribution of non-zero elements, and tile-traversal strategies on GPU implementations. Using these insights, we determine that operating on these sparse matrices in a Densified Compressed Sparse Row (DCSR) is well-suited to the parallel warp-synchronous execution model of the GPU processing elements.\ Preprocessing or storing the sparse matrix in the DCSR format, however, often requires significantly more memory storage than conventional Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) formats. Given that SpMM kernels are often bottlenecked on DRAM bandwidth, the increase in DRAM traffic to access the larger DCSR formatted data structure can result in a slowdown for many matrices.\ We propose a near-memory transform engine to dynamically create DCSR formatted tiles for the GPU processing elements from the CSC formatted matrix in memory. This work enhances a GPU's last-level cache/memory controller unit to act as an efficient translator between the compute-optimized representation of data and its corresponding storage/bandwidth-optimized format to accelerate sparse workloads. Our approach achieves 2.26× better performance on average compared to the vendor supplied optimized library for sparse matrix operations, cuSPARSE. BibTeX: @inproceedings{Fujiki2019, author = {Fujiki, Daichi and Chatterjee, Niladrish and Lee, Donghyuk and O'Connor, Mike}, title = {Near-memory Data Transformation for Efficient Sparse Matrix Multi-vector Multiplication}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, publisher = {ACM}, year = {2019}, pages = {55:1--55:17}, doi = {10.1145/3295500.3356154} } Garstka M, Cannon M and Goulart P (2019), "COSMO: A conic operator splitting method for large convex problems", In 2019 18th European Control Conference (ECC)., June, 2019. IEEE. [Abstract] [BibTeX] [DOI] Abstract: This paper describes the Conic Operator Splitting Method (COSMO), an operator splitting algorithm for convex optimisation problems with quadratic objective function and conic constraints. At each step the algorithm alternates between solving a quasi-definite linear system with a constant coefficient matrix and a projection onto convex sets. The solver is able to exploit chordal sparsity in the problem data and to detect infeasible problems. The low per-iteration computational cost makes the method particularly efficient for large problems, e.g. semidefinite programs in portfolio optimisation, graph theory, and robust control. Our Julia implementation is open-source, extensible, integrated into the Julia optimisation ecosystem and performs well on a variety of large convex problem classes. BibTeX: @inproceedings{Garstka2019, author = {Garstka, Michael and Cannon, Mark and Goulart, Paul}, title = {COSMO: A conic operator splitting method for large convex problems}, booktitle = {2019 18th European Control Conference (ECC)}, publisher = {IEEE}, year = {2019}, doi = {10.23919/ecc.2019.8796161} } Gates M, Kurzak J, Charara A, YarKhan A and Dongarra J (2019), "SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library", In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA , pp. 26:1-26:18. ACM. [Abstract] [BibTeX] [DOI] Abstract: The SLATE (Software for Linear Algebra Targeting Exascale) library is being developed to provide fundamental dense linear algebra capabilities for current and upcoming distributed high-performance systems, both accelerated CPU-GPU based and CPU based. SLATE will provide coverage of existing ScaLAPACK functionality, including the parallel BLAS; linear systems using LU and Cholesky; least squares problems using QR; and eigenvalue and singular value problems. In this respect, it will serve as a replacement for ScaLAPACK, which after two decades of operation, cannot adequately be retrofitted for modern accelerated architectures. SLATE uses modern techniques such as communication-avoiding algorithms, lookahead panels to overlap communication and computation, and task-based scheduling, along with a modern C++ framework. Here we present the design of SLATE and initial reports of several of its components. BibTeX: @inproceedings{Gates2019, author = {Gates, Mark and Kurzak, Jakub and Charara, Ali and YarKhan, Asim and Dongarra, Jack}, title = {SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, publisher = {ACM}, year = {2019}, pages = {26:1--26:18}, doi = {10.1145/3295500.3356223} } Georgieva I, Harizanov S and Hofreither C (2019), "Iterative Low-rank Approximation Solvers for the Extension Method for Fractional Diffusion". Thesis at: Johann Radon Institute for Computational and Applied Mathematics (RICAM). (RICAM-Report 2019-14) [Abstract] [BibTeX] Abstract: We consider the numerical method for fractional diffusion problems which is based on an extension to a mixed boundary value problem for a local operator in a higher dimensional space. We observe that, when this problem is discretized using tensor product spaces as is commonly done, the solution can be very well approximated by low-rank tensors. This motivates us to apply iterative low-rank approximation algorithms in order to efficiently solve this extended problem. In particular, we employ a recently proposed greedy Tucker approximation method as well as a more classical greedy rank one update method. Throughout, all objects of interest are kept in suitable low-rank approximations, which dramatically reduces the required amount of memory compared to the full formulation of the extended problem.\ Our approach can be used for general, non-structured space discretizations. If the space discretization itself has tensor product structure, we can further decompose the problem in order to deal with even lower dimensional objects. We also note that the approach can be directly applied to higher-order discretizations both in space and the extended variable.\ In several numerical examples, we demonstrate the convergence behaviour of the proposed methods. In particular, the Tucker approximation approach requires only a few iterations in order to reach the discretization error in all tested settings. BibTeX: @techreport{Georgieva2019, author = {Georgieva, Irina and Harizanov, Stanislav and Hofreither, Clemens}, title = {Iterative Low-rank Approximation Solvers for the Extension Method for Fractional Diffusion}, school = {Johann Radon Institute for Computational and Applied Mathematics (RICAM)}, year = {2019}, number = {RICAM-Report 2019-14} } Giles MB, Jentzen A and Welti T (2019), "Generalised multilevel Picard approximations", November, 2019. [Abstract] [BibTeX] Abstract: It is one of the most challenging problems in applied mathematics to approximatively solve high-dimensional partial differential equations (PDEs). In particular, most of the numerical approximation schemes studied in the scientific literature suffer under the curse of dimensionality in the sense that the number of computational operations needed to compute an approximation with an error of size at most varepsilon > 0 grows at least exponentially in the PDE dimension d ∊ ℕ or in the reciprocal of varepsilon . Recently, so-called full-history recursive multilevel Picard (MLP) approximation methods have been introduced to tackle the problem of approximately solving high-dimensional PDEs. MLP approximation methods currently are, to the best of our knowledge, the only methods for parabolic semi-linear PDEs with general time horizons and general initial conditions for which there is a rigorous proof that they are indeed able to beat the curse of dimensionality. The main purpose of this work is to investigate MLP approximation methods in more depth, to reveal more clearly how these methods can overcome the curse of dimensionality, and to propose a generalised class of MLP approximation schemes, which covers previously analysed MLP approximation schemes as special cases. In particular, we develop an abstract framework in which this class of generalised MLP approximations can be formulated and analysed and, thereafter, apply this abstract framework to derive a computational complexity result for suitable MLP approximations for semi-linear heat equations. These resulting MLP approximations for semi-linear heat equations essentially are generalisations of previously introduced MLP approximations for semi-linear heat equations. BibTeX: @article{Giles2019, author = {Giles, Michael B. and Jentzen, Arnulf and Welti, Timo}, title = {Generalised multilevel Picard approximations}, year = {2019} } Gleixner A and Steffy DE (2019), "Linear Programming using Limited-Precision Oracles". Thesis at: Zuse Institute Berlin. (19-57) [Abstract] [BibTeX] Abstract: Since the elimination algorithm of Fourier and Motzkin, many different methods have been developed for solving linear programs. When analyzing the time complexity of LP algorithms, it is typically either assumed that calculations are performed exactly and bounds are derived on the number of elementary arithmetic operations necessary, or the cost of all arithmetic operations is considered through a bit-complexity analysis. Yet in practice, implementations typically use limited-precision arithmetic. In this paper we introduce the idea of a limited-precision LP oracle and study how such an oracle could be used within a larger framework to compute exact precision solutions to LPs. Under mild assumptions, it is shown that a polynomial number of calls to such an oracle and a polynomial number of bit operations, is sufficient to compute an exact solution to an LP. This work provides a foundation for understanding and analyzing the behavior of the methods that are currently most effective in practice for solving LPs exactly. BibTeX: @techreport{Gleixner2019, author = {Ambros Gleixner and Daniel E. Steffy}, title = {Linear Programming using Limited-Precision Oracles}, school = {Zuse Institute Berlin}, year = {2019}, number = {19--57} } Goldenberg S, Stathopoulos A and Romero E (2019), "A Golub--Kahan Davidson Method for Accurately Computing a Few Singular Triplets of Large Sparse Matrices", SIAM Journal on Scientific Computing. Vol. 41(4), pp. A2172-A2192. [Abstract] [BibTeX] [DOI] Abstract: Obtaining high accuracy singular triplets for large sparse matrices is a significant challenge, especially when searching for the smallest triplets. Due to the difficulty and size of these problems, efficient methods must function iteratively, with preconditioners, and under strict memory constraints. In this research, we present a Golub--Kahan Davidson method (GKD), which satisfies these requirements and includes features such as soft-locking with orthogonality guarantees, an inner correction equation similar to Jacobi--Davidson, locally optimal +k restarting, and the ability to find real zero singular values in both square and rectangular matrices. Additionally, our method achieves full accuracy while avoiding the augmented matrix, which often converges slowly for the smallest triplets due to the difficulty of interior eigenvalue problems. We describe our method in detail, including implementation issues that arise. Our experimental results confirm the efficiency and stability of our method over the current implementation of PHSVDS in the PRIMME software package. BibTeX: @article{Goldenberg2019, author = {Goldenberg, S. and Stathopoulos, A. and Romero, E.}, title = {A Golub--Kahan Davidson Method for Accurately Computing a Few Singular Triplets of Large Sparse Matrices}, journal = {SIAM Journal on Scientific Computing}, year = {2019}, volume = {41}, number = {4}, pages = {A2172-A2192}, doi = {10.1137/18M1222004} } Goncalves M, Lamb I, Brum RM and Azambuja JR (2019), "Evaluating the Impact of Accuracy Relaxation in the Reliability of GPU Register Files", In Proceedings of the 26th IEEE International Conference on Electronics, Circuits and Systems., 11, 2019. , pp. 205-208. [Abstract] [BibTeX] [DOI] Abstract: Thanks to the high computing power, GPUs have joined application domains where reliability is a major concern. Faults on electronic components are mainly caused by energized particles, which may cause malfunction and make them result in incorrect output. Errors may be unacceptable for most applications but a small margin of error can be considered safe in some cases. This work uses an approximate computing perspective to analyze the influence of application accuracy relaxation in GPU register files reliability. We perform a fault injection campaign in a Kepler GPU to identify registers' vulnerability and the impact on resulting data. Results show increase in register file reliability in an average of 71.6% for 1% of application accuracy relaxation. BibTeX: @inproceedings{Goncalves2019, author = {Goncalves, M. and Lamb, I. and Brum, R. M. and Azambuja, J. R.}, title = {Evaluating the Impact of Accuracy Relaxation in the Reliability of GPU Register Files}, booktitle = {Proceedings of the 26th IEEE International Conference on Electronics, Circuits and Systems}, year = {2019}, pages = {205--208}, doi = {10.1109/ICECS46596.2019.8964908} } Gondimalla A, Chesnut N, Thottethodi M and Vijaykumar TN (2019), "SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks", In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. New York, NY, USA , pp. 151-165. ACM. [Abstract] [BibTeX] [DOI] Abstract: Convolutional neural networks (CNNs) are emerging as powerful tools for image processing. Recent machine learning work has reduced CNNs' compute and data volumes by exploiting the naturally-occurring and actively-transformed zeros in the feature maps and filters. While previous semi-sparse architectures exploit one-sided sparsity either in the feature maps or the filters, but not both, a recent fully-sparse architecture, called Sparse CNN (SCNN), exploits two-sided sparsity to improve performance and energy over dense architectures. However, sparse vector-vector dot product, a key primitive in sparse CNNs, would be inefficient using the representation adopted by SCNN. The dot product requires finding and accessing non-zero elements in matching positions in the two sparse vectors -- an inner join using the position as the key with a single value field. SCNN avoids the inner join by performing a Cartesian product capturing the relevant multiplications. However, SCNN's approach incurs several considerable overheads and is not applicable to non-unit-stride convolutions. Further, exploiting reuse in sparse CNNs fundamentally causes systematic load imbalance not addressed by SCNN. We propose SparTen which achieves efficient inner join by providing support for native two-sided sparse execution and memory storage. To tackle load imbalance, SparTen employs a software scheme, called greedy balancing, which groups filters by density via two variants, a software-only one which uses whole-filter density and a software-hardware hybrid which uses finer-grain density. Our simulations show that, on average, SparTen performs 4.7×, 1.8×, and 3× better than a dense architecture, one-sided sparse architecture, and SCNN, respectively. An FPGA implementation shows that SparTen performs 4.3× and 1.9× better than a dense architecture and a one-sided sparse architecture, respectively. BibTeX: @inproceedings{Gondimalla2019, author = {Gondimalla, Ashish and Chesnut, Noah and Thottethodi, Mithuna and Vijaykumar, T. N.}, title = {SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks}, booktitle = {Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture}, publisher = {ACM}, year = {2019}, pages = {151--165}, doi = {10.1145/3352460.3358291} } Gong Z, Ji H, Fletcher C, Hughes C and Torrellas J (2019), "SparseTrain:Leveraging Dynamic Sparsity in Training DNNs on General-Purpose SIMD Processors", November, 2019. [Abstract] [BibTeX] Abstract: Our community has greatly improved the efficiency of deep learning applications, including by exploiting sparsity in inputs. Most of that work, though, is for inference, where weight sparsity is known statically, and/or for specialized hardware. We propose a scheme to leverage dynamic sparsity during training. In particular, we exploit zeros introduced by the ReLU activation function to both feature maps and their gradients. This is challenging because the sparsity degree is moderate and the locations of zeros change over time. We also rely purely on software. We identify zeros in a dense data representation without transforming the data and performs conventional vectorized computation. Variations of the scheme are applicable to all major components of training: forward propagation, backward propagation by inputs, and backward propagation by weights. Our method significantly outperforms a highly-optimized dense direct convolution on several popular deep neural networks. At realistic sparsity, we speed up the training of the non-initial convolutional layers in VGG16, ResNet-34, ResNet-50, and Fixup ResNet-50 by 2.19×, 1.37×, 1.31×, and 1.51× respectively on an Intel Skylake-X CPU. BibTeX: @article{Gong2019, author = {Gong, Zhangxiaowen and Ji, Houxiang and Fletcher, Christopher and Hughes, Christopher and Torrellas, Josep}, title = {SparseTrain:Leveraging Dynamic Sparsity in Training DNNs on General-Purpose SIMD Processors}, year = {2019} } Nicholas I. M. Gould TR and Scott JA (2019), "Convergence and evaluation-complexity analysis of a regularized tensor-Newton method for solving nonlinear least-squares problems", Computational Optimization and Applications. [Abstract] [BibTeX] [DOI] Abstract: Given a twice-continuously differentiable vector-valued function r(x), a local minimizer of Vert r(x) _2 is sought. We propose and analyse tensor-Newton methods, in which r(x) is replaced locally by its second-order Taylor approximation. Convergence is controlled by regularization of various orders. We establish global convergence to a first-order critical point of Vert r(x) _2, and provide function evaluation bounds that agree with the best-known bounds for methods using second derivatives. Numerical experiments comparing tensor-Newton methods with regularized Gauss--Newton and Newton methods demonstrate the practical performance of the newly proposed method. BibTeX: @article{Gould2019, author = {Nicholas I. M. Gould, Tyrone Rees and Scott, Jennifer A.}, title = {Convergence and evaluation-complexity analysis of a regularized tensor-Newton method for solving nonlinear least-squares problems}, journal = {Computational Optimization and Applications}, year = {2019}, doi = {10.1007/s10589-019-00064-2} } Grapiglia GN and Nesterov Y (2019), "Tensor methods for minimizing convex functions with Hölder continuous higher-order derivatives". Thesis at: Center for Operations Research and Economics, University of Lauvain. [Abstract] [BibTeX] Abstract: In this paper we study p-order methods for unconstrained minimization of convex functions that are p-times differentiable (p ge 2) with ν-Hölder continuous pth derivatives. We propose tensor schemes with and without acceleration. For the schemes without acceleration, we establish iteration complexity bounds of 𝒪(-1/(p+\nu-1) for reducing the functional residual below a given 𝜖 ∊ (0, 1). Assuming that ν is known, we obtain an improved complexity bound of 𝒪(-1/(p+\nu)) for the corresponding accelerated scheme. For the case in which ν is unknown, we present a universal accelerated tensor scheme with iteration complexity of 𝒪(-p/(p+1)(p+\nu-1)). A lower complexity bound of 𝒪(-2/[3(p+\nu)-2]) is also obtained for this problem class. BibTeX: @techreport{Grapiglia2019, author = {Geovani Nunes Grapiglia and Yurii Nesterov}, title = {Tensor methods for minimizing convex functions with Hölder continuous higher-order derivatives}, school = {Center for Operations Research and Economics, University of Lauvain}, year = {2019} } Gratton S, Royer CW, Vicente LN and Zhang Z (2019), "Direct search based on probabilistic feasible descent for bound and linearly constrained problems", Computational Optimization and Applications., 1, 2019. Vol. 72(3), pp. 525-559. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: Direct search is a methodology for derivative-free optimization whose iterations are characterized by evaluating the objective function using a set of polling directions. In deterministic direct search applied to smooth objectives, these directions must somehow conform to the geometry of the feasible region, and typically consist of positive generators of approximate tangent cones (which then renders the corresponding methods globally convergent in the linearly constrained case). One knows however from the unconstrained case that randomly generating the polling directions leads to better complexity bounds as well as to gains in numerical efficiency, and it becomes then natural to consider random generation also in the presence of constraints. In this paper, we study a class of direct-search methods based on sufficient decrease for solving smooth linearly constrained problems where the polling directions are randomly generated (in approximate tangent cones). The random polling directions must satisfy probabilistic feasible descent, a concept which reduces to probabilistic descent in the absence of constraints. Such a property is instrumental in establishing almost-sure global convergence and worst-case complexity bounds with overwhelming probability. Numerical results show that the randomization of the polling directions can be beneficial over standard approaches with deterministic guarantees, as it is suggested by the respective worst-case complexity bounds. BibTeX: @article{Gratton2019, author = {S. Gratton and C. W. Royer and L. N. Vicente and Z. Zhang}, title = {Direct search based on probabilistic feasible descent for bound and linearly constrained problems}, journal = {Computational Optimization and Applications}, publisher = {Springer Science and Business Media LLC}, year = {2019}, volume = {72}, number = {3}, pages = {525--559}, doi = {10.1007/s10589-019-00062-4} } Grützmacher T, Cojean T, Flegar G, Göbel F and Anzt H (2019), "A customized precision format based on mantissa segmentation for accelerating sparse linear algebra", Concurrency and Computation: Practice and Experience. [Abstract] [BibTeX] [DOI] Abstract: In this work, we pursue the idea of radically decoupling the floating point format used for arithmetic operations from the format used to store the data in memory. We complement this idea with a customized precision memory format derived by splitting the mantissa (significand) of standard IEEE formats into segments, such that values can be accessed faster if lower accuracy is acceptable. Combined with precision-aware algorithms that dynamically adapt the data access accuracy to the numerical requirements, the customized precision memory format can render attractive runtime savings without impacting the memory footprint of the data or the accuracy of the final result. In an experimental analysis using the adaptive precision Jacobi method on diagonalizable test problems, we assess the benefits of the mantissa-segmenting customized precision format on recent multi- and manycore architectures. BibTeX: @article{Grutzmacher2019, author = {Grützmacher, Thomas and Cojean, Terry and Flegar, Goran and Göbel, Fritz and Anzt, Hartwig}, title = {A customized precision format based on mantissa segmentation for accelerating sparse linear algebra}, journal = {Concurrency and Computation: Practice and Experience}, year = {2019}, doi = {10.1002/cpe.5418} } Gu R, Beata P and Becchi M (2019), "Characterizing the Performance/Accuracy Tradeoff of High-Precision Applications via Auto-tuning", In Proceedings of the 2019 IEEE International Symposium on Workload Characterization., 11, 2019. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Many scientific applications (e.g., molecular dynamics, climate modeling and astrophysical simulations) rely on floating-point arithmetic. Floating-point representation is by definition a finite approximation of real numbers, and thus it can lead to inaccuracy and reproducibility issues. To overcome these issues, existing work has proposed high-precision floating-point libraries to be used in scientific simulations, but they come at the cost of significant additional execution time. In this work we analyze performance and accuracy effects from tuning down groups of variables and operations guided by compile-time considerations. The goal of our tuning approach is to convert existing floating-point programs to mixed precision while balancing accuracy and performance. To this end, the tuner starts by maximizing accuracy through the use of a high-precision library and then achieves performance gains under a given error bound by incrementally tuning down groups of variables and operations from higher to lower precision (e.g., double precision). The approach provides input-data independence in its results by defining tuning strategies based on loop structures and the investigation of floating-point computation patterns. In addition, it has a smaller search space than exhaustive or bitonic search algorithms, leading to a significant reduction in tuning time, especially on larger, long-running applications. We tested our tuning on a computational fluid dynamics (CFD) application. BibTeX: @inproceedings{Gu2019, author = {Ruidong Gu and Paul Beata and Michela Becchi}, title = {Characterizing the Performance/Accuracy Tradeoff of High-Precision Applications via Auto-tuning}, booktitle = {Proceedings of the 2019 IEEE International Symposium on Workload Characterization}, publisher = {IEEE}, year = {2019}, doi = {10.1109/iiswc47752.2019.9042137} } Han J, Xiong K, Nie F and Li X (2019), "Structured Graph Reconstruction for Scalable Clustering", IEEE Transactions on Knowledge and Data Engineering. [Abstract] [BibTeX] [DOI] Abstract: Spectral clustering is a quite simple but effective method for solving graph clustering problem. It first embeds the original data into a lower dimensional space with spectral analysis, and then relys on an algorithm to obtain the final cluster labels. Since it involves eigen-decomposition of the graph Laplacian matrix for spectral embedding, spectral clustering suffers from high computational cost as data grow in scale. It is also limited by the performance of post-processing algorithm such as kmeans. To address these two issues, in this paper, we propose a novel approach denoted by Orthogonal and Nonnegative Graph Reconstruction (ONGR) for large scale clustering. The two constraints are served as the structure constraint under which the graph reconstructed by the indicator matrix is structured. The proposed method mainly needs to perform economical singular value decomposition for small size matrix thus it scales linearly with the data size. Moreover, the interpretability of the indicator matrix is offered due to the nonnegative constraint. Therefore, the final cluster labels can be directly obtained without post-processing. Extensive experiments show the effectiveness of the proposed method. BibTeX: @article{Han2019, author = {Han, J. and Xiong, K. and Nie, F. and Li, X.}, title = {Structured Graph Reconstruction for Scalable Clustering}, journal = {IEEE Transactions on Knowledge and Data Engineering}, year = {2019}, doi = {10.1109/TKDE.2019.2948850} } Han R, You Y and Demmel J (2019), "Auto-Precision Scaling for Distributed Deep Learning", November, 2019. [Abstract] [BibTeX] Abstract: In recent years, large-batch optimization is becoming the key of distributed deep learning. However, large-batch optimization is hard. Straightforwardly porting the code often leads to a significant loss in testing accuracy. As some researchers suggested that large batch optimization leads to a low generalization performance, and they further conjectured that large-batch training needs a higher floating-point precision to achieve a higher generalization performance. To solve this problem, we conduct an open study in this paper. Our target is to find the number of bits that large-batch training needs. To do so, we need a system for customized precision study. However, state-of-the-art systems have some limitations that lower the efficiency of developers and researchers. To solve this problem, we design and implement our own system CPD: A High Performance System for Customized-Precision Distributed DL. In our experiments, our application often loses accuracy if we use a very-low precision (e.g. 8 bits or 4 bits). To solve this problem, we proposed the APS (Auto-Precision-Scaling) algorithm, which is a layer-wise adaptive scheme for gradients shifting. With APS, we are able to make the large-batch training converge with only 4 bits. BibTeX: @article{Han2019a, author = {Han, Ruobing and You, Yang and Demmel, James}, title = {Auto-Precision Scaling for Distributed Deep Learning}, year = {2019} } Hannah RR (2019), "Fundamental Results on Asynchronous Parallel Optimization Algorithms". Thesis at: UCLA. [Abstract] [BibTeX] Abstract: In this thesis, we present a body of work on the performance and convergence properties of asynchronous-parallel algorithms completed over the course of my doctorate degree (Hannah, Feng, and Wotao Yin 2018; Hannah and Wotao Yin 2017b; T. Sun, Hannah, and Wotao Yin 2017; Hannah and Wotao Yin 2017a). Asynchronous algorithms eliminate the costly synchronization penalty of traditional synchronous-parallel algorithms. They do this by having computing nodes utilize the most recently available information to compute updates. However, it's not immediately clear whether the trade-off of eliminating synchronization penalty at the cost of using outdated information is favorable.\ We first give a comprehensive theoretical justification of the performance advantages of asynchronous algorithms, which we summarize as "Faster Iterations, Same Quality" (Hannah and Wotao Yin 2017a). Under a well-justified model, we show that asynchronous algorithms complete "Faster Iterations". Using renewal theory, we demonstrate how network delays, heterogeneous sub-problem difficulty and computing power greatly hinder synchronous algorithms, but have no impact on their asynchronous counterparts. We next prove the first exact convergence rate results for a variety of synchronous algorithms including synchronous ARock and synchronous randomized block coordinate descent (sync-RBCD). This allows us to make a fair comparison between these algorithms and their asynchronous counterparts.\ Finally we show that a variety of asynchronous algorithms have a convergence rate that essentially matches the previously derived exact rates for synchronous counterparts so long as the delays are not too large. Hence asynchronous algorithms complete faster iteration that are of the "Same Quality" as synchronous algorithms. Therefore we conclude that a wide variety of asynchronous algorithms will always outcompete their synchronous counterparts if the delays are not too large, and especially at scale. \ Next we present the first asynchonous Nesterov-accelerated algorithm that attains a speedup: A2BCD (Hannah, Feng, and Wotao Yin 2018). We first prove that A2BCD attains NU_ACDM's complexity to highest order. NU_ACDM is a state-of-the-art accelerated coordinate descent algorithm (Allen-Zhu, Qu, et al. 2016). Then we show that both A2BCD and NU_ACDM both have optimal complexity. Hence because A2BCD has faster iterations, and optimal complexity, it should be the fastest coordinate descent algorithm. We verify this with numerical experiments comparing A2BCD with NU_ACDM. We find that A2BCD is up to 4-5× faster than NU_ACDM, and hence conclude that our algorithm is the current fastest coordinate descent algorithm that exists. Finally we derive a second-order ODE, which is the continuoustime limit of A2BCD. The ODE analysis motivates and clarifies our proof strategy. \ Lastly, we present earlier foundational work that comprises the basis of the technical innovations that made the previous results possible (Hannah and Wotao Yin 2017b). We show that ARock and its many special cases may converge even under unbounded delays (both stochastic and deterministic). These results sidestep longstanding impossibility results derived in the 1980s by making slightly stronger assumptions. They were also an early demonstration of the power of meticulous Lyapunov-function construction techniques pioneered in this body of work. BibTeX: @phdthesis{Hannah2019, author = {Hannah, Robert Rafaeil}, title = {Fundamental Results on Asynchronous Parallel Optimization Algorithms}, school = {UCLA}, year = {2019} } Harvey D and van der Hoeven J (2019), "Integer multiplication in time O(n log n)". Thesis at: HAL archives. [Abstract] [BibTeX] [URL] Abstract: We present an algorithm that computes the product of two n-bitintegers in O(n log n) bit operations. BibTeX: @techreport{Harvey2019, author = {Harvey, David and van der Hoeven, Joris}, title = {Integer multiplication in time O(n log n)}, school = {HAL archives}, year = {2019}, url = {https://hal.archives-ouvertes.fr/hal-02070778} } Abdullahi Hassan A, Cardellini V, D'Ambra P, di Serafino D and Filippone S (2019), "Efficient Algebraic Multigrid Preconditioners on Clusters of GPUs", Parallel Processing Letters. Vol. 29(01), pp. 1950001. [Abstract] [BibTeX] [DOI] Abstract: Many scientific applications require the solution of large and sparse linear systems of equations using Krylov subspace methods; in this case, the choice of an effective preconditioner may be crucial for the convergence of the Krylov solver. Algebraic MultiGrid (AMG) methods are widely used as preconditioners, because of their optimal computational cost and their algorithmic scalability. The wide availability of GPUs, now found in many of the fastest supercomputers, poses the problem of implementing efficiently these methods on high-throughput processors. In this work we focus on the application phase of AMG preconditioners, and in particular on the choice and implementation of smoothers and coarsest-level solvers capable of exploiting the computational power of clusters of GPUs. We consider block-Jacobi smoothers using sparse approximate inverses in the solve phase associated with the local blocks. The choice of approximate inverses instead of sparse matrix factorizations is driven by the large amount of parallelism exposed by the matrix-vector product as compared to the solution of large triangular systems on GPUs. The selected smoothers and solvers are implemented within the AMG preconditioning framework provided by the MLD2P4 library, using suitable sparse matrix data structures from the PSBLAS library. Their behaviour is illustrated in terms of execution speed and scalability, on a test case concerning groundwater modelling, provided by the Jülich Supercomputing Center within the Horizon 2020 Project EoCoE. BibTeX: @article{Hassan2019, author = {Abdullahi Hassan, Ambra and Cardellini, Valeria and D'Ambra, Pasqua and di Serafino, Daniela and Filippone, Salvatore}, title = {Efficient Algebraic Multigrid Preconditioners on Clusters of GPUs}, journal = {Parallel Processing Letters}, year = {2019}, volume = {29}, number = {01}, pages = {1950001}, doi = {10.1142/S0129626419500014} } Hegde K, Asghari-Moghaddam H, Pellauer M, Crago N, Jaleel A, Solomonik E, Emer J and Fletcher CW (2019), "ExTensor: An Accelerator for Sparse Tensor Algebra", In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. New York, NY, USA , pp. 319-333. ACM. [BibTeX] [DOI] BibTeX: @inproceedings{Hegde2019, author = {Hegde, Kartik and Asghari-Moghaddam, Hadi and Pellauer, Michael and Crago, Neal and Jaleel, Aamer and Solomonik, Edgar and Emer, Joel and Fletcher, Christopher W.}, title = {ExTensor: An Accelerator for Sparse Tensor Algebra}, booktitle = {Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture}, publisher = {ACM}, year = {2019}, pages = {319--333}, doi = {10.1145/3352460.3358275} } Hein E, Eswar S, Yaşar A, Li J, Young JS, Conte TM, Çatalyürek ÜV, Vuduc R, Riedy J and Uçar B (2019), "Programming Strategies for Irregular Algorithms on the Emu Chick", ACM Transactions on Parallel Computing. [Abstract] [BibTeX] Abstract: The Emu Chick prototype implements migratory memory-side processing in a novel hardware system. Rather than transferring large amounts of data across the system interconnect, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each remote memory read. Previous work has characterized the performance of the Chick prototype in terms of memory bandwidth and programming differences from more typical, non-migratory platforms, but there has not yet been an analysis of algorithms on this system. This work evaluates irregular algorithms that could benefit from the lightweight, memory-side processing of the Chick and demonstrates techniques and optimization strategies for achieving performance in sparse matrix-vector multiply operation (SpMV), breadth-first search (BFS), and graph alignment across up to eight distributed nodes encompassing 64 nodelets in the Chick system. We also define and justify relative metrics to compare prototype FPGA-based hardware with established ASIC architectures. The Chick currently supports up to 68x scaling for graph alignment, 80 MTEPS for BFS on balanced graphs, and 50% of measured STREAM bandwidth for SpMV. BibTeX: @article{Hein2019, author = {Hein, Eric and Eswar, Srinivas and Yaşar, Abdurrahman and Li, Jiajia and Young, Jeffrey S. and Conte, Thomas M. and Çatalyürek, Ümit V. and Vuduc, Rich and Riedy, Jason and Uçar, Bora}, title = {Programming Strategies for Irregular Algorithms on the Emu Chick}, journal = {ACM Transactions on Parallel Computing}, year = {2019} } Helal AE, Aji AM, Chu ML, Beckmann BM and Feng W-c (2019), "Adaptive Task Aggregation for High-Performance Sparse Solvers on GPUs", In The 28th International Conference on Parallel Architectures and Compilation. [Abstract] [BibTeX] Abstract: Sparse solvers are heavily used in computational fluid dynamics (CFD), computer-aided design (CAD), and other important application domains. These solvers remain challenging to execute on massively parallel architectures, due to the sequential dependencies between the fine-grained application tasks. In particular, parallel sparse solvers typically suffer from substantial scheduling and dependency-management overheads relative to the compute operations. We propose adaptive task aggregation (ATA) to efficiently execute such irregular computations on GPU architectures via hierarchical dependency management and lowlatency task scheduling. On a gamut of representative problems with different data-dependency structures, ATA significantly outperforms existing GPU task-execution approaches, achieving a geometric mean speedup of 2.2× to 3.7× across different sparse kernels (with speedups of up to two orders of magnitude). BibTeX: @inproceedings{Helal2019, author = {Helal, Ahmed E. and Aji, Ashwin M. and Chu, Michael L. and Beckmann, Bradford M. and Feng, Wu-chun}, title = {Adaptive Task Aggregation for High-Performance Sparse Solvers on GPUs}, booktitle = {The 28th International Conference on Parallel Architectures and Compilation}, year = {2019} } Hernandez TM, Beeumen RV, Caprio MA and Yang C (2019), "A greedy algorithm for computing eigenvalues of a symmetric matrix", November, 2019. [Abstract] [BibTeX] Abstract: We present a greedy algorithm for computing selected eigenpairs of a large sparse matrix H that can exploit localization features of the eigenvector. When the eigenvector to be computed is localized, meaning only a small number of its components have large magnitudes, the proposed algorithm identifies the location of these components in a greedy manner, and obtains approximations to the desired eigenpairs of H by computing eigenpairs of a submatrix extracted from the corresponding rows and columns of H. Even when the eigenvector is not completely localized, the approximate eigenvectors obtained by the greedy algorithm can be used as good starting guesses to accelerate the convergence of an iterative eigensolver applied to H. We discuss a few possibilities for selecting important rows and columns of H and techniques for constructing good initial guesses for an iterative eigensolver using the approximate eigenvectors returned from the greedy algorithm. We demonstrate the effectiveness of this approach with examples from nuclear quantum many-body calculations and many-body localization studies of quantum spin chains. BibTeX: @article{Hernandez2019, author = {Hernandez, Taylor M. and Beeumen, Roel Van and Caprio, Mark A. and Yang, Chao}, title = {A greedy algorithm for computing eigenvalues of a symmetric matrix}, year = {2019} } Herrmann J, Özkaya M, Uçar B, Kaya K and Çatalyürek Ü (2019), "Multilevel Algorithms for Acyclic Partitioning of Directed Acyclic Graphs", SIAM Journal on Scientific Computing. Vol. 41(4), pp. A2117-A2145. [Abstract] [BibTeX] [DOI] Abstract: We investigate the problem of partitioning the vertices of a directed acyclic graph into a given number of parts. The objective function is to minimize the number or the total weight of the edges having end points in different parts, which is also known as the edge cut. The standard load balancing constraint of having an equitable partition of the vertices among the parts should be met. Furthermore, the partition is required to be acyclic; i.e., the interpart edges between the vertices from different parts should preserve an acyclic dependency structure among the parts. In this work, we adopt the multilevel approach with coarsening, initial partitioning, and refinement phases for acyclic partitioning of directed acyclic graphs. We focus on two-way partitioning (sometimes called bisection), as this scheme can be used in a recursive way for multiway partitioning. To ensure the acyclicity of the partition at all times, we propose novel and efficient coarsening and refinement heuristics. The quality of the computed acyclic partitions is assessed by computing the edge cut. We also propose effective ways to use the standard undirected graph partitioning methods in our multilevel scheme. We perform a large set of experiments on a dataset consisting of (i) graphs coming from an application and (ii) some others corresponding to matrices from a public collection. We report significant improvements compared to the current state of the art. BibTeX: @article{Herrmann2019, author = {Herrmann, J. and Özkaya, M. and Uçar, B. and Kaya, K. and Çatalyürek, Ü.}, title = {Multilevel Algorithms for Acyclic Partitioning of Directed Acyclic Graphs}, journal = {SIAM Journal on Scientific Computing}, year = {2019}, volume = {41}, number = {4}, pages = {A2117--A2145}, doi = {10.1137/18M1176865} } Higham N and Mary T (2019), "Solving Block Low-Rank Linear Systems by LU Factorization is Numerically Stable" [Abstract] [BibTeX] [URL] Abstract: Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientific computing. However, these units are inherently mixed precision and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplication and LU factorization and give rounding error analyses of both kernels. An important application is to GMRES-based iterative refinement with block FMAs, for which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication and LU factorization with TC16 and TC32 forms of FMA, which differ in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU confirm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, while achieving almost the same performance. BibTeX: @article{Higham2019, author = {Higham, Nicholas and Mary, Theo}, title = {Solving Block Low-Rank Linear Systems by LU Factorization is Numerically Stable}, year = {2019}, url = {http://eprints.maths.manchester.ac.uk/2733/1/paper.pdf} } Higham CF and Higham DJ (2019), "Deep Learning: An Introduction for Applied Mathematicians", SIAM Review., January, 2019. Vol. 61(3), pp. 860-891. Society for Industrial & Applied Mathematics (SIAM). [Abstract] [BibTeX] [DOI] Abstract: Multilayered artificial neural networks are becoming a pervasive tool in a host of application fields. At the heart of this deep learning revolution are familiar concepts from applied and computational mathematics, notably from calculus, approximation theory, optimization, and linear algebra. This article provides a very brief introduction to the basic ideas that underlie deep learning from an applied mathematics perspective. Our target audience includes postgraduate and final year undergraduate students in mathematics who are keen to learn about the area. The article may also be useful for instructors in mathematics who wish to enliven their classes with references to the application of deep learning techniques. We focus on three fundamental questions: What is a deep neural network? How is a network trained? What is the stochastic gradient method? We illustrate the ideas with a short MATLAB code that sets up and trains a network. We also demonstrate the use of state-of-the-art software on a large scale image classification problem. We finish with references to the current literature. BibTeX: @article{Higham2019a, author = {Higham, Catherine F. and Higham, Desmond J.}, title = {Deep Learning: An Introduction for Applied Mathematicians}, journal = {SIAM Review}, publisher = {Society for Industrial & Applied Mathematics (SIAM)}, year = {2019}, volume = {61}, number = {3}, pages = {860--891}, doi = {10.1137/18m1165748} } Higham N and Pranesh S (2019), "Exploiting Lower Precision Arithmetic in Solving Symmetric Positive Definite Linear Systems and Least Squares Problems" [Abstract] [BibTeX] [URL] Abstract: What is the fastest way to solve a linear system Ax= b in arithmetic of a given precision when A is symmetric positive definite and otherwise unstructured? The usual answer is by Cholesky factorization, assuming that A can be factorized. We develop an algorithm that can be faster, given an arithmetic of precision lower than the working precision as well as (optionally) one of higher precision. The arithmetics might, for example, be of precisions half, single, and double; half and double, possibly with quadruple; or single and double, possibly with quadruple. We compute a Cholesky factorization at the lower precision and use the factors as preconditioners in GMRES-based iterative refinement. To avoid breakdown of the factorization we shift the matrix by a small multiple of its diagonal. We explain why this is preferable to the common approach of shifting by a multiple of the identity matrix, We also incorporate scaling in order to avoid overflow and reduce the chance of underflow when working in IEEE half precision arithmetic. We extend the algorithm to solve a linear least squares problem with a well conditioned coefficient matrix by forming and solving the normal equations. In both algorithms most of the work is done at low precision provided that iterative refinement and the inner iterative solver converge quickly. We explain why replacing GMRES by the conjugate gradient method causes convergence guarantees to be lost, but show that this change has little effect on convergence in practice. Our numerical experiments confirm the potential of the new algorithms to provide faster solutions in environments that support multiple precisions of arithmetic. BibTeX: @article{Higham2019b, author = {Higham, Nicholas and Pranesh, Srikara}, title = {Exploiting Lower Precision Arithmetic in Solving Symmetric Positive Definite Linear Systems and Least Squares Problems}, year = {2019}, url = {http://eprints.maths.manchester.ac.uk/2736/1/paper.pdf} } Hoemmen M, Badwaik J, Brucher M, Iliopoulos A(N and Michopoulos J (2019), "Historical lessons for C++ linear algebra library standardization". Thesis at: ISO C++ standards meeting (Kona). (P1417R0) [BibTeX] [URL] BibTeX: @techreport{Hoemmen2019, author = {Hoemmen, Mark and Badwaik, Jayesh and Brucher, Matthieu and Iliopoulos, Athanasios (Nasos) and Michopoulos, John}, title = {Historical lessons for C++ linear algebra library standardization}, school = {ISO C++ standards meeting (Kona)}, year = {2019}, number = {P1417R0}, url = {http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1417r0.pdf} } Hong C, Sukumaran-Rajam A, Nisa I, Singh K and Sadayappan P (2019), "Adaptive Sparse Tiling for Sparse Matrix Multiplication", In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. New York, NY, USA , pp. 300-314. ACM. [Abstract] [BibTeX] [DOI] Abstract: Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern of sparse matrix multiplication makes it challenging to use tiling to enhance data reuse. In this paper, we devise an adaptive tiling strategy and apply it to enhance the performance of two primitives: SpMM (product of sparse matrix and dense matrix) and SDDMM (sampled dense-dense matrix multiplication). In contrast to studies that have resorted to non-standard sparse-matrix representations to enhance performance, we use the standard Compressed Sparse Row (CSR) representation, within which intra-row reordering is performed to enable adaptive tiling. Experimental evaluation using an extensive set of matrices from the Sparse Suite collection demonstrates significant performance improvement over currently available state-of-the-art alternatives. BibTeX: @inproceedings{Hong2019, author = {Hong, Changwan and Sukumaran-Rajam, Aravind and Nisa, Israt and Singh, Kunal and Sadayappan, P.}, title = {Adaptive Sparse Tiling for Sparse Matrix Multiplication}, booktitle = {Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, year = {2019}, pages = {300--314}, doi = {10.1145/3293883.3295712} } Hossain S and Mahmud MS (2019), "On Computing with Diagonally Structured Matrices", In Proceedings of the IEEE High Performance Extreme Computing Conference., September, 2019. , pp. 1-6. [Abstract] [BibTeX] [DOI] Abstract: We present a storage scheme for storing matrices by diagonals and algorithms for performing matrix-matrix and matrix-vector multiplication by diagonals. Matrix elements are accessed with stride-1 and involve no indirect referencing. Access to the transposed matrix requires no additional effort. The proposed storage scheme handles dense matrices and matrices with special structure e.g., banded, triangular, symmetric in a uniform manner. Test results from preliminary numerical experiments with an OpenMP implementation of our method are encouraging. BibTeX: @inproceedings{Hossain2019, author = {Hossain, S. and Mahmud, M. S.}, title = {On Computing with Diagonally Structured Matrices}, booktitle = {Proceedings of the IEEE High Performance Extreme Computing Conference}, year = {2019}, pages = {1--6}, doi = {10.1109/HPEC.2019.8916325} } Hu Y, Li T-M, Anderson L, Ragan-Kelley J and Durand F (2019), "Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures", ACM Transactions on Graphics. Vol. 38(6) [Abstract] [BibTeX] Abstract: 3D visual computing data are often spatially sparse. To exploit such sparsity, people have developed hierarchical sparse data structures, such as multilevel sparse voxel grids, particles, and 3D hash tables. However, developing and using these high-performance sparse data structures is challenging, due to their intrinsic complexity and overhead. We propose Taichi, a new data-oriented programming language for efficiently authoring, accessing, and maintaining such data structures. The language offers a high-level, data structure-agnostic interface for writing computation code. The user independently specifies the data structure. We provide several elementary components with different sparsity properties that can be arbitrarily composed to create a wide range of multi-level sparse data structures. This decoupling of data structures from computation makes it easy to experiment with different data structures without changing computation code, and allows users to write computation as if they are working with a dense array. Our compiler then uses the semantics of the data structure and index analysis to automatically optimize for locality, remove redundant operations for coherent accesses, maintain sparsity and memory allocations, and generate efficient parallel and vectorized instructions for CPUs and GPUs. \ Our approach yields competitive performance on common computational kernels such as stencil applications, neighbor lookups, and particle scattering. We demonstrate our language by implementing simulation, rendering, and vision tasks including a material point method simulation, finite element analysis, a multigrid Poisson solver for pressure projection, volumetric path tracing, and 3D convolution on sparse grids. Our computation-data structure decoupling allows us to quickly experiment with different data arrangements, and to develop high-performance data structures tailored for specific computational tasks. With 110th as many lines of code, we achieve 4.55× higher performance on average, compared to hand-optimized reference implementations. BibTeX: @article{Hu2019, author = {Hu, Yuangming and Li, Tzu-Mao and Anderson, Luke and Ragan-Kelley, Jonathan and Durand, Frédo}, title = {Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures}, journal = {ACM Transactions on Graphics}, year = {2019}, volume = {38}, number = {6} } Huang R (2019), "Novel COmputational Methods for Eigenvalue Problems". Thesis at: Michigan Technological University. [Abstract] [BibTeX] [URL] Abstract: This dissertation focuses on novel computational method for eigenvalue problems. In Chapter 1, preliminaries of functional analysis related to eigenvalue problems are presented. Some classical methods for matrix eigenvalue problems are discussed. Several PDE eigenvalue problems are covered. The chapter is concluded with a summary of the contributions. In Chapter 2, a novel recursive contour integral method (RIM) for matrix eigenvalue problem is proposed. This method can effectively find all eigenvalues in a region on the complex plane with no a priori spectrum information. Regions that contain eigenvalues are subdivided and tested recursively until the size of region reaches specified precision. The method is robust, which is demonstrated using various examples. In Chapter 3, we propose an improved version of RIM for non-Hermitian eigenvalue problems, called SIM-M. By incorporating Cayley transformation and Arnoldi's method, the main computation cost of solving linear systems is reduced significantly. The numerical experiments demonstrate that RIM-M gains significant speed-up over RIM.\ In Chapter 4, we propose a multilevel spectral indicator method (SIM-M) to address the memory requirement for large sparse matrices. We modify the indicator of RIM-M such that it requires much less memory. Matrices from University of Florida Sparse Matrix Collection are tested, suggesting that a parallel version of SIM-M has the potential to be efficient.\ In Chapter 5, we develop a novel method to solve the elliptic PDE eigenvalue problem. We construct a multi-wavelet basis with Riesz stability in H1 0 (Ω). By incorporating multi-grid discretization scheme and sparse grids, the method retains the optimal convergence rate for the smallest eigenvalue with much less computational cost. BibTeX: @phdthesis{Huang2019, author = {Ruihao Huang}, title = {Novel COmputational Methods for Eigenvalue Problems}, school = {Michigan Technological University}, year = {2019}, url = {https://digitalcommons.mtu.edu/cgi/viewcontent.cgi?article=2090&context=etdr} } Huckle TK (2019), "Accelerated Jacobi iterations for bidiagonal and sparse triangular matrices" [Abstract] [BibTeX] [URL] Abstract: In many applications a sparse linear system of equations Ax = b has to be solved. For applying iterative solvers like preconditioned conjugate gradient (pcg) or GMRES, effective preconditioners are necessary, e.g. Jacobi, Gauss-Seidel, or incomplete LU factorization (ILU). Often, effective preconditioners are given via sparse triangular matrices L, that have to be solved in every iteration step. Recent work by Edmond Chow introduced an easy to parallelize fixed-point iteration for computing approximations to (I)LU factorizations. Therefore, the aching handicap in parallel solution methods for sparse matrices is the solving of sparse triangular systems, e.g. bidiagonal matrices. In a parallel environment direct solvers can take only restricted advantage of parallelism. Therefore, in this paper we develop a fast iterative solution method for sparse triangular matrices. In contrast to direct solvers for triangular matrices L like graph-based methods, sparse factorization methods, or Sherman-Morrison-Woodbury, here we want to consider stationary Jacobi iterations. In its original form the Jacobi iteration for ill-conditioned matrices can lead to very slow convergence. Therefore, we introduce different acceleration tools like preconditioning (block Jacobi and Incomplete Sparse Approximate Inverse ISAI), and a recursive acceleration of the Jacobi method. Here the Neumann series is replaced by the Euler expansion (see [4, 19, 8]). This is derived by a recursive computation of the Neumann series using powers of the initial Jacobi iteration matrix. The goal is to shift the major part of the operations from cheap but numerous iteration steps to better parallelizable cheap and sparse matrix-matrix products reducing the number of necessary iterations considerably, e.g. to less than _2(n) for an n × n matrix. BibTeX: @online{Huckle2019, author = {Huckle, Thomas K.}, title = {Accelerated Jacobi iterations for bidiagonal and sparse triangular matrices}, year = {2019}, url = {https://www5.in.tum.de/persons/huckle/it_triang.pdf} } Idreos S, Dayan N, Qin W, Akmanalp M, Hilgard S, Ross A, Lennon J, Jain V, Gupta H, Li D and Zhu Z (2019), "Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn", In Biennial Conference on Innovative Data Systems Research. [Abstract] [BibTeX] Abstract: We introduce the concept of design continuums for the data layout of key-value stores. A design continuum unifies major distinct data structure designs under the same model. The critical insight and potential long-term impact is that such unifying models 1) render what we consider up to now as fundamentally different data structures to be seen as views'' of the very same overall design space, and 2) allow seeing'' new data structure designs with performance properties that are not feasible by existing designs. The core intuition behind the construction of design continuums is that all data structures arise from the very same set of fundamental design principles, i.e., a small set of data layout design concepts out of which we can synthesize any design that exists in the literature as well as new ones. We show how to construct, evaluate, and expand, design continuums and we also present the first continuum that unifies major data structure designs, i.e., B+Tree, BeTree, LSM-tree, and LSH-Table.\The practical benefit of a design continuum is that it creates a fast inference engine for the design of data structures. For example, we can near instantly predict how a specific design change in the underlying storage of a data system would affect performance, or reversely what would be the optimal data structure (from a given set of designs) given workload characteristics and a memory budget. In turn, these properties allow us to envision a new class of self-designing key-value stores with a substantially improved ability to adapt to workload and hardware changes by transitioning between drastically different data structure designs to assume a diverse set of performance properties at will. BibTeX: @inproceedings{Idreos2019, author = {Idreos, Stratos and Dayan, Niv and Qin, Wilson and Akmanalp, Mali and Hilgard, Sophie and Ross, Andrew and Lennon, James and Jain, Varun and Gupta, Harshita and Li, David and Zhu, Zichen}, title = {Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn}, booktitle = {Biennial Conference on Innovative Data Systems Research}, year = {2019} } Ioannidis EI, Cheimarios N, Spyropoulos AN and Boudouvis AG (2019), "On the performance of various parallel GMRES implementations on CPU and GPU clusters" [BibTeX] BibTeX: @article{Ioannidis2019, author = {Ioannidis, E. I. and Cheimarios, N. and Spyropoulos, A. N. and Boudouvis, A. G.}, title = {On the performance of various parallel GMRES implementations on CPU and GPU clusters}, year = {2019} } Iwashita T, Li S and Fukaya T (2019), "Hierarchical Block Multi-Color Ordering: A New Parallel Ordering Method for Vectorization and Parallelization of the Sparse Triangular Solver in the ICCG Method" [Abstract] [BibTeX] Abstract: In this paper, we propose a new parallel ordering method to vectorize and parallelize the sparse triangular solver, which is called hierarchical block multi-color ordering. In this method, the parallel forward and backward substitutions can be vectorized while preserving the advantages of block multi-color ordering, that is, fast convergence and fewer thread synchronizations. To evaluate the proposed method in a parallel ICCG (Incomplete Cholesky Conjugate Gradient) solver, numerical tests were conducted using five test matrices on three types of computational nodes. The numerical results indicate that the proposed method outperforms the conventional block and nodal multi-color ordering methods in 13 out of 15 test cases, which confirms the effectiveness of the method. BibTeX: @article{Iwashita2019, author = {Iwashita, Takeshi and Li, Senxi and Fukaya, Takeshi}, title = {Hierarchical Block Multi-Color Ordering: A New Parallel Ordering Method for Vectorization and Parallelization of the Sparse Triangular Solver in the ICCG Method}, year = {2019} } Jagode H, Danalis A, Anzt H and Dongarra J (2019), "PAPI software-defined events for in-depth performance analysis", The International Journal of High Performance Computing Applications. [Abstract] [BibTeX] [DOI] Abstract: The methodology and standardization layer provided by the Performance Application Programming Interface (PAPI) has played a vital role in application profiling for almost two decades. It has enabled sophisticated performance analysis tool designers and performance-conscious scientists to gain insights into their applications by simply instrumenting their code using a handful of PAPI functions that “just work” across different hardware components. In the past, PAPI development had focused primarily on hardware-specific performance metrics. However, the rapidly increasing complexity of software infrastructure poses new measurement and analysis challenges for the developers of large-scale applications. In particular, acquiring information regarding the behavior of libraries and runtimes—used by scientific applications—requires low-level binary instrumentation, or APIs specific to each library and runtime. No uniform API for monitoring events that originate from inside the software stack has emerged. In this article, we present our efforts to extend PAPI's role so that it becomes the de facto standard for exposing performance-critical events, which we refer to as software-defined events (SDEs), from different software layers. Upgrading PAPI with SDEs enables monitoring of both types of performance events—hardware- and software-related events—in a uniform way, through the same consistent PAPI. The goal of this article is threefold. First, we motivate the need for SDEs and describe our design decisions regarding the functionality we offer through PAPI's new SDE interface. Second, we illustrate how SDEs can be utilized by different software packages, specifically, by showcasing their use in the numerical linear algebra library MAGMA-Sparse, the tensor algebra library TAMM that is part of the NWChem suite, and the compiler-based performance analysis tool Byfl. Third, we provide a performance analysis of the overhead that results from monitoring SDEs and discuss the trade-offs between overhead and functionality. BibTeX: @article{Jagode2019, author = {Jagode, Heike and Danalis, Anthony and Anzt, Hartwig and Dongarra, Jack}, title = {PAPI software-defined events for in-depth performance analysis}, journal = {The International Journal of High Performance Computing Applications}, year = {2019}, doi = {10.1177/1094342019846287} } Jakovetic D, Bajovic D, Xavier J and Moura JMF (2019), "Primal-dual optimization methods for large-scale and distributed data analytics", December, 2019. [Abstract] [BibTeX] Abstract: The augmented Lagrangian method (ALM) is a classical optimization tool that solves a given "difficult" (constrained) problem via finding solutions of a sequence of "easier"(often unconstrained) sub-problems with respect to the original (primal) variable, wherein constraints satisfaction is controlled via the so-called dual variables. ALM is highly flexible with respect to how primal sub-problems can be solved, giving rise to a plethora of different primal-dual methods. The powerful ALM mechanism has recently proved to be very successful in various large scale and distributed applications. In addition, several significant advances have appeared, primarily on precise complexity results with respect to computational and communication costs in the presence of inexact updates and design and analysis of novel optimal methods for distributed consensus optimization. We provide a tutorial-style introduction to ALM and its analysis via control-theoretic tools, survey recent results, and provide novel insights in the context of two emerging applications: federated learning and distributed energy trading. BibTeX: @article{Jakovetic2019, author = {Dusan Jakovetic and Dragana Bajovic and Joao Xavier and Jose M. F. Moura}, title = {Primal-dual optimization methods for large-scale and distributed data analytics}, year = {2019} } Jia Z, Maggioni M, Smith J and Scarpazza DP (2019), "Dissecting the NVidia Turing T4 GPU via Microbenchmarking", March, 2019. [Abstract] [BibTeX] Abstract: In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want to extract the highest possible performance. Last year, these very reasons motivated us to dissect the Volta GPU architecture using microbenchmarks. The introduction in August 2018 of Turing, NVidia's latest architecture, pressed us to update our study. In this report, we examine Turing and compare it quantitatively against previous NVidia GPU generations. Specifically, we study the T4 GPU: a low-power board aiming at inference applications. We describe its improvements against its inference-oriented predecessor: the P4 GPU based on the Pascal architecture. Both T4 and P4 GPUs achieve significantly higher frequency-per-Watt figures than their full-size counterparts. We study the performance of the T4's TensorCores, finding a much higher throughput on low-precision operands than on the P4 GPU. We reveal that Turing introduces new instructions that express matrix math more succinctly. We map Turing's instruction space, finding the same encoding as Volta, and additional instructions. We reveal that the Turing TU104 chip has the same memory hierarchy depth as the Volta GV100; cache levels sizes on the TU104 are frequently twice as large as those found on the Pascal GP104. We benchmark each constituent of the T4 memory hierarchy and find substantial overall performance improvements over its P4 predecessor. We studied how clock throttling affects compute-intensive workloads that hit power or thermal limits. Many of our findings are novel, published here for the first time. All of them can guide high-performance software developers get closer to the GPU's peak performance. BibTeX: @article{Jia2019, author = {Jia, Zhe and Maggioni, Marco and Smith, Jeffrey and Scarpazza, Daniele Paolo}, title = {Dissecting the NVidia Turing T4 GPU via Microbenchmarking}, year = {2019} } Jiang Y, Kouzoupis D, Yin H, Diehl M and Houska B (2019), "Decentralized Optimization over Tree Graphs", October, 2019. [Abstract] [BibTeX] Abstract: This paper presents a decentralized algorithm for non-convex optimization over tree-structured networks. We assume that each node of this network can solve small-scale optimization problems and communicate approximate value functions with its neighbors based on a novel multi-sweep communication protocol. In contrast to existing parallelizable optimization algorithms for non-convex optimization the nodes of the network are neither synchronized nor assign any central entity. None of the nodes needs to know the whole topology of the network, but all nodes know that the network is tree-structured. We discuss conditions under which locally quadratic convergence rates can be achieved. The method is illustrated by running the decentralized asynchronous multi-sweep protocol on a radial AC power network case study. BibTeX: @article{Jiang2019, author = {Jiang, Yuning and Kouzoupis, Dimitris and Yin, Haoyu and Diehl, Moritz and Houska, Boris}, title = {Decentralized Optimization over Tree Graphs}, year = {2019} } Kanellopoulos K, Vijaykumar N, Giannoula C, Azizi R, Koppula S, Ghiasi NM, Shahroodi T, Luna JG and Mutlu O (2019), "SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations", In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. New York, NY, USA , pp. 600-614. ACM. [Abstract] [BibTeX] [DOI] Abstract: Important workloads, such as machine learning and graph analytics applications, heavily involve sparse linear algebra operations. These operations use sparse matrix compression as an effective means to avoid storing zeros and performing unnecessary computation on zero elements. However, compression techniques like Compressed Sparse Row (CSR) that are widely used today introduce significant instruction overhead and expensive pointer-chasing operations to discover the positions of the non-zero elements. In this paper, we identify the discovery of the positions (i.e., indexing) of non-zero elements as a key bottleneck in sparse matrix-based workloads, which greatly reduces the benefits of compression.\ We propose SMASH, a hardware-software cooperative mechanism that enables highly-efficient indexing and storage of sparse matrices. The key idea of SMASH is to explicitly enable the hardware to recognize and exploit sparsity in data. To this end, we devise a novel software encoding based on a hierarchy of bitmaps. This encoding can be used to efficiently compress any sparse matrix, regardless of the extent and structure of sparsity. At the same time, the bitmap encoding can be directly interpreted by the hardware. We design a lightweight hardware unit, the Bitmap Management Unit (BMU), that buffers and scans the bitmap hierarchy to perform highly-efficient indexing of sparse matrices. SMASH exposes an expressive and rich ISA to communicate with the BMU, which enables its use in accelerating any sparse matrix computation.\We demonstrate the benefits of SMASH on four use cases that include sparse matrix kernels and graph analytics applications. Our evaluations show that SMASH provides average performance improvements of 38% for Sparse Matrix Vector Multiplication and 44% for Sparse Matrix Matrix Multiplication, over a state-of-the-art CSR implementation, on a wide variety of matrices with different characteristics. SMASH incurs a very modest hardware area overhead of up to 0.076% of an out-of-order CPU core. BibTeX: @inproceedings{Kanellopoulos2019, author = {Kanellopoulos, Konstantinos and Vijaykumar, Nandita and Giannoula, Christina and Azizi, Roknoddin and Koppula, Skanda and Ghiasi, Nika Mansouri and Shahroodi, Taha and Luna, Juan Gomez and Mutlu, Onur}, title = {SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations}, booktitle = {Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture}, publisher = {ACM}, year = {2019}, pages = {600--614}, doi = {10.1145/3352460.3358286} } Kawaguchi K and Pack Kaelbling L (2019), "Every Local Minimum is a Global Minimum of an Induced Model", April, 2019. [Abstract] [BibTeX] Abstract: For non-convex optimization in machine learning, this paper proves that every local minimum achieves the global optimality of the perturbable gradient basis model at any differentiable point. As a result, non-convex machine learning is theoretically as supported as convex machine learning with a hand-crafted basis in terms of the loss at differentiable local minima, except in the case when a preference is given to the hand-crafted basis over the perturbable gradient basis. The proofs of these results are derived under mild assumptions. Accordingly, the proven results are directly applicable to many machine learning models, including practical deep neural networks, without any modification of practical methods. Furthermore, as special cases of our general results, this paper improves or complements several state-of-the-art theoretical results in the literature with a simple and unified proof technique. BibTeX: @online{Kawaguchi2019, author = {Kawaguchi, Kenji and Pack Kaelbling, Leslie}, title = {Every Local Minimum is a Global Minimum of an Induced Model}, year = {2019} } Kim H, Hong S, Park J and Han H (2019), "Static code transformations for thread-dense memory accesses in GPU computing", Concurrency and Computation: Practice and Experience., October, 2019. Wiley. [Abstract] [BibTeX] [DOI] Abstract: Due to the GPU's complex memory system and massive thread-level parallelism, application programmers often have difficulty optimizing GPU programs. An essential approach to memory optimization is to utilize low-latency on-chip memory to avoid high latency of off-chip memory accesses. Shared memory is an on-chip memory, which is explicitly managed by programmers. Shared memory has a read/write latency similar to that of the L1 cache, but poor data management can degrade performance. In this paper, we present a static code transformation that preloads dataset in GPU's shared memory. Our static analysis primarily targets global memory requests with high thread-density for preloading in shared memory. The thread-dense memory access pattern is a pattern in which many threads efficiently manage the address space of shared memory, as well as reuse the same data in a thread block. We limit the usage of shared memory so that thread-level parallelism remains at the same level when selecting datasets for preloading. Finally, our source-to-source compiler allows to preload selected datasets in shared memory by transforming non-optimized GPU kernel code. Our methods achieve 1.26× and 1.62× speedups on average (geometric mean), respectively with GTX980 and P100 GPUs. BibTeX: @article{Kim2019, author = {Kim, Hyunjun and Hong, Sungin and Park, Jeonghwan and Han, Hwansoo}, title = {Static code transformations for thread-dense memory accesses in GPU computing}, journal = {Concurrency and Computation: Practice and Experience}, publisher = {Wiley}, year = {2019}, doi = {10.1002/cpe.5512} } Kiran U, Sanfui S, Ratnakar SK, Gautam SS and Sharma D (2019), "Comparative Analysis of GPU-Based Solver Libraries for a Sparse Linear System of Equations", In Advances in Computational Methods in Manufacturing. Singapore , pp. 889-897. Springer Singapore. [Abstract] [BibTeX] Abstract: In this paper, a comparison of GPU-based linear solverLinear solver libraries for the solution of sparse positive-definite matrices is presented. These large sparse matrices arise in a number of computational disciplines seeking a solution for partial differential equations. The solution of these matrices is often a time-consuming process that can be reduced by parallel computingParallel computing. Since the development of GPU for general-purpose computing, a number of numerical solver libraries have evolved that can accelerate the solution procedure. The performance of three solver libraries has been evaluated in this paper for five different test matrices. These test matrices have been taken from different application domains with different sparsity patterns. Results demonstrate a higher speedup from the iterative solver over the direct solver on GPU and also over a multithreaded CPU implementation. BibTeX: @inproceedings{Kiran2019, author = {Kiran, Utpal and Sanfui, Subhajit and Ratnakar, Shashi Kant and Gautam, Sachin Singh and Sharma, Deepak}, editor = {Narayanan, R. Ganesh and Joshi, Shrikrishna N. and Dixit, Uday Shanker}, title = {Comparative Analysis of GPU-Based Solver Libraries for a Sparse Linear System of Equations}, booktitle = {Advances in Computational Methods in Manufacturing}, publisher = {Springer Singapore}, year = {2019}, pages = {889--897} } Klowckiewicz B and Darve E (2019), "Sparse hierarchical preconditioners using piecewise smooth approximations of eigenvectors" [Abstract] [BibTeX] Abstract: When solving linear systems arising from PDE discretizations, iterative methods (such as Conjugate Gradient, GMRES, or MINRES) are often the only practical choice. To converge in a small number of iterations, however, they have to be coupled with an efficient preconditioner. The efficiency of the preconditioner depends largely on its accuracy on the eigenvectors corresponding to small eigenvalues, and unfortunately, black-box methods typically cannot guarantee sufficient accuracy on these eigenvectors. Thus, constructing the preconditioner becomes a very problemdependent task. We describe a hierarchical approximate factorization approach which addresses this issue by focusing on improving the accuracy on smooth eigenvectors (such eigenvectors typically correspond to the small eigenvalues). The improved accuracy is achieved by preserving the action of the factorized matrix on piecewise polynomial functions of the PDE domain. Based on the factorization, we propose a family of sparse preconditioners with O (n) or O (n log n) construction complexities. Our methods exhibit the optimal O (n) solution times in benchmarks run on large elliptic problems of different types, arising for example in flow or mechanical simulations. In the case of the linear elasticity equation the preconditioners are exact on the near-kernel rigid body modes. BibTeX: @article{Klowckiewicz2019, author = {Klowckiewicz, Bazyl and Darve, Eric}, title = {Sparse hierarchical preconditioners using piecewise smooth approximations of eigenvectors}, year = {2019} } Kong F (2019), "A parallel monolithic multilevel Schwarz preconditioner for the neutron transport criticality calculations with a nonlinear diffusion acceleration method" [Abstract] [BibTeX] Abstract: The multigroup neutron transport criticality calculations using modern supercomputers have been widely employed in a nuclear reactor analysis for studying whether or not a system is self-sustaining. However, the design and development of an efficient parallel algorithm for the transport criticality calculations is a challenging task especially when the number of processor cores is large and the unstructured mesh is adopted since both the compute time and the memory usage need to be taken into consideration. In this paper, we study a monolithic multilevel Schwarz preconditioner for the transport criticality calculations using the nonlinear diffusion acceleration (NDA). In NDA, the linear systems of equations arising from the discretizations of the nonlinear diffusion equations and the transport equations need to be efficiently solved. To achieve this goal, we propose a monolithically coupled approach equipped with several important ingredients; e.g., subspace-based coarsening, aggressive coarsening and strength matrix thresholding. The proposed monolithic multilevel method is capable of efficiently handling the linear systems of equations for both the transport system and the diffusion system. In the multilevel method, the construction of coarse spaces is nontrivial and expensive. We propose a subspace-based coarsening algorithm to resolve this issue by exploring the matrix structures of the transport equations and the nonlinear diffusion equations. We numerically demonstrate that the monolithic multilevel preconditioner with the subspace-based coarsening algorithm is twice as fast as that equipped with a full space based coarsening approach on thousands of processor cores for an unstructured mesh neutron transport problem with billions of unknowns. BibTeX: @article{Kong2019, author = {Kong, Fande}, title = {A parallel monolithic multilevel Schwarz preconditioner for the neutron transport criticality calculations with a nonlinear diffusion acceleration method}, year = {2019} } Kong F (2019), "Parallel memory-efficient all-at-once algorithms for the sparse matrix triple products in multigrid methods", The International Journal of High Performance Computing Applications. [BibTeX] BibTeX: @article{Kong2019a, author = {Kong, Fande}, title = {Parallel memory-efficient all-at-once algorithms for the sparse matrix triple products in multigrid methods}, journal = {The International Journal of High Performance Computing Applications}, year = {2019} } Kong Q, Jing Y-F, Huang T-Z and An H-B (2019), "Acceleration of the Scheduled Relaxation Jacobi method: promising strategies for solving large, sparse linear systems", Journal of Computational Physics. , pp. 108862. [Abstract] [BibTeX] [DOI] [URL] Abstract: The main aim of this paper is to develop two algorithms based on the Scheduled Relaxation Jacobi (SRJ) method [J. Comput. Phys., 274 (2014), pp. 695-708] for solving problems arising from the finite-difference discretization of elliptic partial differential equations on large grids. These two algorithms are the Alternating Anderson-Scheduled Relaxation Jacobi (AASRJ) method by utilizing Anderson mixing after each SRJ iteration cycle and the Minimal Residual Scheduled Relaxation Jacobi (MRSRJ) method by minimizing residual after each SRJ iteration cycle, respectively. Through numerical experiments, we show that AASRJ is competitive with the optimal version of the SRJ method [J. Comput. Phys., 332 (2017), pp. 446-460] in most problems we considered here, and MRSRJ outperforms SRJ in all cases. The properties of AASRJ and MRSRJ are demonstrated. Both of them are promising strategies for solving large, sparse linear systems while maintaining the simplicity of the Jacobi method. BibTeX: @article{Kong2019b, author = {Kong, Qian and Jing, Yan-Fei and Huang, Ting-Zhu and An, Heng-Bin}, title = {Acceleration of the Scheduled Relaxation Jacobi method: promising strategies for solving large, sparse linear systems}, journal = {Journal of Computational Physics}, year = {2019}, pages = {108862}, url = {http://www.sciencedirect.com/science/article/pii/S0021999119305467}, doi = {10.1016/j.jcp.2019.108862} } Konnov I, Kukovec J and Tran T-H (2019), "TLA+ Model Checking Made Symbolic", Proceedings of the ACM on Programming Languages. New York, NY, USA, 10, 2019. Vol. 3(123), pp. 1-30. ACM. [Abstract] [BibTeX] [DOI] Abstract: TLA+ is a language for formal specification of all kinds of computer systems. System designers use this language to specify concurrent, distributed, and fault-tolerant protocols, which are traditionally presented in pseudo-code. TLA+ is extremely concise yet expressive: The language primitives include Booleans, integers, functions, tuples, records, sequences, and sets thereof, which can be also nested. This is probably why the only model checker for TLA+ (called TLC) relies on explicit enumeration of values and states. \ In this paper, we present APALACHE -- a first symbolic model checker for TLA+. Like TLC, it assumes that all specification parameters are fixed and all states are finite structures. Unlike TLC, APALACHE translates the underlying transition relation into quantifier-free SMT constraints, which allows us to exploit the power of SMT solvers. Designing this translation is the central challenge that we address in this paper. Our experiments show that APALACHE outperforms TLC on examples with large state spaces. BibTeX: @article{Konnov2019, author = {Konnov, Igor and Kukovec, Jure and Tran, Thanh-Hai}, title = {TLA+ Model Checking Made Symbolic}, journal = {Proceedings of the ACM on Programming Languages}, publisher = {ACM}, year = {2019}, volume = {3}, number = {123}, pages = {1--30}, doi = {10.1145/3360549} } Koutis I and Le H (2019), "Spectral Modification of Graphs for Improved Spectral Clustering", In Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver, CA [Abstract] [BibTeX] Abstract: Spectral clustering algorithms provide approximate solutions to hard optimization problems that formulate graph partitioning in terms of the graph conductance. It is well understood that the quality of these approximate solutions is negatively affected by a possibly significant gap between the conductance and the second eigenvalue of the graph. In this paper we show that for any graph G, there exists a ‘spectral maximizer' graph H which is cut-similar to G, but has eigenvalues that are near the theoretical limit implied by the cut structure of G. Applying then spectral clustering on H has the potential to produce improved cuts that also exist in G due to the cut similarity. This leads to the second contribution of this work: we describe a practical spectral modification algorithm that raises the eigenvalues of the input graph, while preserving its cuts. Combined with spectral clustering on the modified graph, this yields demonstrably improved cuts. BibTeX: @inproceedings{Koutis2019, author = {Koutis, Ioannis and Le, Huong}, title = {Spectral Modification of Graphs for Improved Spectral Clustering}, booktitle = {Proceedings of the 33rd Conference on Neural Information Processing Systems}, year = {2019} } Kouyialis G, Wang X and Misener R (2019), "Symmetry Detection for Quadratic Optimization Using Binary Layered Graphs", Processes., November, 2019. Vol. 7(11), pp. 838. MDPI AG. [Abstract] [BibTeX] [DOI] Abstract: Symmetry in mathematical optimization may create multiple, equivalent solutions. In nonconvex optimization, symmetry can negatively affect algorithm performance, e.g., of branch-and-bound when symmetry induces many equivalent branches. This paper develops detection methods for symmetry groups in quadratically-constrained quadratic optimization problems. Representing the optimization problem with adjacency matrices, we use graph theory to transform the adjacency matrices into binary layered graphs. We enter the binary layered graphs into the software package nauty that generates important symmetric properties of the original problem. Symmetry pattern knowledge motivates a discretization pattern that we use to reduce computation time for an approximation of the point packing problem. This paper highlights the importance of detecting and classifying symmetry and shows that knowledge of this symmetry enables quick approximation of a highly symmetric optimization problem BibTeX: @article{Kouyialis2019, author = {Kouyialis, Georgia and Wang, Xiaoyu and Misener, Ruth}, title = {Symmetry Detection for Quadratic Optimization Using Binary Layered Graphs}, journal = {Processes}, publisher = {MDPI AG}, year = {2019}, volume = {7}, number = {11}, pages = {838}, doi = {10.3390/pr7110838} } Kuppannagari SR, Rajat R, Kannan R, Dasu A and Prasanna VK (2019), "IP Cores for Graph Kernels on FPGAs", In Proceedings of the IEEE High Performance Extreme Computing Conference., September, 2019. , pp. 1-7. [Abstract] [BibTeX] [DOI] Abstract: Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over graph partitions. Partitioning the input graph into nonoverlapping partitions improves on-chip data reuse. Additional optimizations to exploit intra and interpartition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR, SSSP and WCC, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, 16.4× speedup for PR, 3.5× speedup for SSSP, and 35.1× speedup for WCC, and compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× speedup for SpMV, 1.64× speedup for PR, and 1.8× speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance predictions of our designs assuming the graph is stored in HBM2 instead of DRAM. We further discuss extensions to our optimizations to improve the throughput. BibTeX: @inproceedings{Kuppannagari2019, author = {Kuppannagari, S. R. and Rajat, R. and Kannan, R. and Dasu, A. and Prasanna, V. K.}, title = {IP Cores for Graph Kernels on FPGAs}, booktitle = {Proceedings of the IEEE High Performance Extreme Computing Conference}, year = {2019}, pages = {1--7}, doi = {10.1109/HPEC.2019.8916363} } Kurzak J, Tsai YM, Gates M, Abdelfattah A and Dongarra J (2019), "Massively Parallel Automated Software Tuning", In Proceedings of the 48th International Conference on Parallel Processing. New York, NY, USA , pp. 92:1-92:10. ACM. [Abstract] [BibTeX] [DOI] Abstract: This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined. BibTeX: @inproceedings{Kurzak2019, author = {Kurzak, Jakub and Tsai, Yaohung M. and Gates, Mark and Abdelfattah, Ahmad and Dongarra, Jack}, title = {Massively Parallel Automated Software Tuning}, booktitle = {Proceedings of the 48th International Conference on Parallel Processing}, publisher = {ACM}, year = {2019}, pages = {92:1--92:10}, doi = {10.1145/3337821.3337908} } Laberge G, Shirzad S, Diehl P, Kaiser H, Prudhomme S and Lemoine AS (2019), "Scheduling Optimization of Parallel Linear Algebra Algorithms Using Supervised Learning", In Proceedings of the 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments., 11, 2019. , pp. 31-43. [Abstract] [BibTeX] [DOI] Abstract: Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and the hardware the executable is run on. In the realm of Asynchronous Many Task runtime systems, a key aspect of the scheduling problem is predicting the proper chunk-size, where the chunk-size is defined as the number of iterations of a for-loop are assigned to a thread as one task. In this paper, we study the applications of supervised learning models to predict the chunk-size which yields maximum performance on multiple parallel linear algebra operations using the HPX backend of Blaze's linear algebra library. More precisely, we generate our training and tests sets by measuring performance of the application with different chunk-sizes for multiple linear algebra operations; vector-addition, matrix-vector-multiplication, matrix-matrix addition and matrix-matrix-multiplication. We compare the use of logistic regression, neural networks and decision trees with a newly developed decision tree based model in order to predict the optimal value for chunk-size. Our results show that classical decision trees and our custom decision tree model are able to forecast a chunk-size which results in good performance for the linear algebra operations. BibTeX: @inproceedings{Laberge2020, author = {Laberge, G. and Shirzad, S. and Diehl, P. and Kaiser, H. and Prudhomme, S. and Lemoine, A. S.}, title = {Scheduling Optimization of Parallel Linear Algebra Algorithms Using Supervised Learning}, booktitle = {Proceedings of the 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments}, year = {2019}, pages = {31--43}, doi = {10.1109/MLHPC49564.2019.00009} } Lagravière J, Langguth J, Prugger M, Einkemmer L, Ha PH and Cai X (2019), "Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC", Scientific Programming. Vol. 2019 [Abstract] [BibTeX] [DOI] Abstract: .e Unified Parallel C (UPC) programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory subsystems. One convenient feature of UPC is its ability to automatically execute betweenthread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. .e programmer friendliness, however, can come at the cost of substantial performance penalties. .is is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper, we study performance enhancement strategies specifically targeting such finegrained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require fine-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the different scenarios, in the form of quantifiable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. .ese performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh. BibTeX: @article{Lagraviere2019, author = {Lagravière, Jèrèmie and Langguth, Johannes and Prugger, Martina and Einkemmer, Lukas and Ha, Phuong Hoai and Cai, Xing}, title = {Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC}, journal = {Scientific Programming}, year = {2019}, volume = {2019}, doi = {10.1155/2019/6825728} } Lee C-L, Chao C-T, Lee J-K, Huang C-W and Hung M-Y (2019), "Sparse-Matrix Compression Primitives with OpenCL Framework to Support Halide", In Proceedings of the International Workshop on OpenCL. New York, NY, USA , pp. 1-2. ACM. [Abstract] [BibTeX] [DOI] Abstract: Halide and OpenCL now play important roles for heterogeneous multi-core computing. OpenCL provides vendor-level support and Halide provides domain-specific support such as vision processing and AI model (TVM Halide IR). Halide also provides flexible scheduling for applications on target machines. OpenCL plays a supporting role for Halide environments. In this work, we investigate the research issues in supporting sparse computation with Halide and their corresponding OpenCL support. We present sparse matrix compression primitives on Halide for sparse matrix matrix (SpMM) multiplication with OpenCL framework. Halide is a programming language designed to process image and array from numerous algorithms and scheduling primitives to achieve state-of-art performance including SIMD and heterogeneous computation. This paper proposed the implementation of sparse matrix compression for Halide scheduling primitives including COO, CSR, and hybrid CSR. The design of experiments includes Halide primitives for sparse matrix compression and matrix computations. The experimental result of computation with compressing matrix shows the performance are improved by up to 85% compared to the baseline without compression. BibTeX: @inproceedings{Lee2019, author = {Lee, Chao-Lin and Chao, Chen-Ting and Lee, Jenq-Kuen and Huang, Chung-Wen and Hung, Ming-Yu}, title = {Sparse-Matrix Compression Primitives with OpenCL Framework to Support Halide}, booktitle = {Proceedings of the International Workshop on OpenCL}, publisher = {ACM}, year = {2019}, pages = {1--2}, doi = {10.1145/3318170.3318179} } Lee D, Oh J and Yu H (2019), "OCam: Out-of-core coordinate descent algorithm for matrix completion", Information Sciences. [Abstract] [BibTeX] [DOI] [URL] Abstract: Recently, there are increasing reports that most datasets can be actually stored in disks of a single off-the-shelf workstation, and utilizing out-of-core methods is much cheaper and even faster than using a distributed system. For these reasons, out-of-core methods have been actively developed for machine learning and graph processing. The goal of this paper is to develop an efficient out-of-core matrix completion method based on coordinate descent approach. Coordinate descent-based matrix completion (CD-MC) has two strong benefits over other approaches: 1) it does not involve heavy computation such as matrix inversion and 2) it does not have step-size hyper-parameters, which reduces the effort for hyper-parameter tuning. Existing solutions for CD-MC have been developed and analyzed for in-memory setting and they do not take disk-I/O into account. Thus, we propose OCam, a novel out-of-core coordinate descent algorithm for matrix completion. Our evaluation results and cost analyses provide sound evidences supporting the following benefits of OCam: (1) Scalability -- OCam is a truly scalable out-of-core method and thus decomposes a matrix larger than the size of memory, (2) Efficiency -- OCam is super fast. OCam is up to 10× faster than the state-of-the-art out-of-core method, and up to 4.1× faster than a competing distributed method when using eight machines. The source code of OCam will be available for reproducibility. BibTeX: @article{Lee2019a, author = {Lee, Dongha and Oh, Jinoh and Yu, Hwanjo}, title = {OCam: Out-of-core coordinate descent algorithm for matrix completion}, journal = {Information Sciences}, year = {2019}, url = {http://www.sciencedirect.com/science/article/pii/S0020025519309284}, doi = {10.1016/j.ins.2019.09.077} } le Gorrec L, Mouysset S, Duff IS, Knight PA and Ruiz D (2019), "Uncovering Hidden Block Structure for Clustering", In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery. [Abstract] [BibTeX] Abstract: We present a multistage procedure to cluster directed and undirected weighted graphs by finding the block structure of their adjacency matrices. A central part of the process is to scale the adjacency matrix into a doubly-stochastic form, which permits detection of the whole matrix block structure with minimal spectral information (theoretically a single pair of singular vectors suffices).We present the different stages of our method, namely the impact of the doubly-stochastic scaling on singular vectors, detection of the block structure by means of these vectors, and details such as cluster refinement and a stopping criterion. Then we test thealgorithm's effectiveness by using it on two unsupervised classification tasks: community detection in networks and shape detection in cloudsof points in two dimensions. By comparing results of our approach with thoseof widely used algorithms designed for specific purposes, we observe that our method is competitive (for community detection) if not superior (for shape detection) in comparison with existing methods. BibTeX: @inproceedings{LeGorrec2019, author = {le Gorrec, Luce and Mouysset, Sandrine and Duff, Iain S. and Knight, Philip A. and Ruiz, Daniel}, title = {Uncovering Hidden Block Structure for Clustering}, booktitle = {Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery}, year = {2019} } Lei D, Du M, Chen H, Li Z and Wu Y (2019), "Distributed Parallel Sparse Multinomial Logistic Regression", IEEE Access. Vol. 7, pp. 55496-55508. [Abstract] [BibTeX] [DOI] Abstract: Sparse Multinomial Logistic Regression (SMLR) is widely used in the field of image classification, multi-class object recognition, and so on, because it has the function of embedding feature selection during classification. However, it cannot meet the time and memory requirements for processing large-scale data. We have reinvestigated the classification accuracy and running efficiency of the algorithm for solving SMLR problems using the Alternating Direction Method of Multipliers (ADMM), which is called fast SMLR (FSMLR) algorithm in this paper. By reformulating the optimization problem of FSMLR, we transform the serial convex optimization problem to the distributed convex optimization problem, i.e., global consensus problem and sharing problem. Based on the distributed optimization problem, we propose two distribute parallel SMLR algorithms, sample partitioning-based distributed SMLR (SP-SMLR), and feature partitioning-based distributed SMLR (FP-SMLR), for a large-scale sample and large-scale feature datasets in big data scenario, respectively. The experimental results show that the FSMLR algorithm has higher accuracy than the original SMLR algorithm. The big data experiments show that our distributed parallel SMLR algorithms can scale for massive samples and large-scale features, with high precision. In a word, our proposed serial and distribute SMLR algorithms outperform the state-of-the-art algorithms. BibTeX: @article{Lei2019, author = {Lei, D. and Du, M. and Chen, H. and Li, Z. and Wu, Y.}, title = {Distributed Parallel Sparse Multinomial Logistic Regression}, journal = {IEEE Access}, year = {2019}, volume = {7}, pages = {55496--55508}, doi = {10.1109/ACCESS.2019.2913280} } Li Y, Xie P, Chen X, Liu J, Yang B, Li S, Gong C, Gan X and Xu H (2019), "VBSF: a new storage format for SIMD sparse matrix--vector multiplication on modern processors", The Journal of Supercomputing., April, 2019. [Abstract] [BibTeX] [DOI] Abstract: Sparse matrix--vector multiplication (SpMV) is one of the most indispensable kernels of solving problems in numerous applications, but its performance of SpMV is limited by the need for frequent memory access. Modern processors exploit data-level parallelism to improve the performance using single-instruction multiple data (SIMD). In order to take full advantage of SIMD acceleration technology, a new storage format called Variable Blocked-σ-SIMD Format (VBSF) is proposed in this paper to change the irregular nature of traditional matrix storage formats. This format combines the adjacent nonzero elements into variable size blocks to ensure that SpMV can be computed with SIMD vector units. We compare the VBSF-based SpMV with traditional storage formats using 15 matrices as a benchmark suite on three computing platforms (FT2000, Intel Xeon E5 and Intel Silver) with different SIMD length. For the matrices in the benchmark suite, the VBSF obtains great performance improvement on three platforms, respectively, and it proves to have better storage efficiency compared with other storage formats. BibTeX: @article{Li2019, author = {Li, Yishui and Xie, Peizhen and Chen, Xinhai and Liu, Jie and Yang, Bo and Li, Shengguo and Gong, Chunye and Gan, Xinbiao and Xu, Han}, title = {VBSF: a new storage format for SIMD sparse matrix--vector multiplication on modern processors}, journal = {The Journal of Supercomputing}, year = {2019}, doi = {10.1007/s11227-019-02835-4} } Li J, Uçar B, Çatalyürek ÜV, Sun J, Barker K and Vuduc R (2019), "Efficient and Effective Sparse Tensor Reordering", In Proceedings of the ACM International Conference on Supercomputing. New York, NY, USA , pp. 227-237. ACM. [Abstract] [BibTeX] [DOI] Abstract: This paper formalizes the problem of reordering a sparse tensor to improve the spatial and temporal locality of operations with it, and proposes two reordering algorithms for this problem, which we call BFS-MCS and Lexi-Order. The BFS-MCS method is a Breadth First Search (BFS)-like heuristic approach based on the maximum cardinality search family; Lexi-Order is an extension of doubly lexical ordering of matrices to tensors. We show the effects of these schemes within the context of a widely used tensor computation, the CANDECOMP/PARAFAC decomposition (CPD), when storing the tensor in three previously proposed sparse tensor formats: coordinate (COO), compressed sparse fiber (CSF), and hierarchical coordinate (HiCOO). A new partition-based superblock scheduling is also proposed for HiCOO format to improve load balance. On modern multicore CPUs, we show Lexi-Order obtains up to 4.14× speedup on sequential HiCOO-Mttkrp and 11.88× speedup on its parallel counterpart. The performance of COO- and CSF-based Mttkrps also improves. Our two reordering methods are more effective than state-of-the-art approaches. The code is released as part of Parallel Tensor Infrastructure (ParTI!): https://github.com/hpcgarage/ParTI. BibTeX: @inproceedings{Li2019a, author = {Li, Jiajia and Uçar, Bora and Çatalyürek, Ümit V. and Sun, Jimeng and Barker, Kevin and Vuduc, Richard}, title = {Efficient and Effective Sparse Tensor Reordering}, booktitle = {Proceedings of the ACM International Conference on Supercomputing}, publisher = {ACM}, year = {2019}, pages = {227--237}, doi = {10.1145/3330345.3330366} } Li M, Hawrylak P and Hale J (2019), "Combining OpenCL and MPI to Support Heterogeneous Computing on a Cluster", In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning). New York, NY, USA , pp. 5:1-5:6. ACM. [Abstract] [BibTeX] [DOI] Abstract: This paper presents an implementation of a heterogeneous programming model which combines Open Computing Language (OpenCL) and Message Passing Interface (MPI). The model is applied to solving a Markov decision process (MDP) with value iteration method. The performance test is conducted on a high performance computing cluster. At peak performance, the model is able to achieve a 57× speedup over a serial implementation. For an extremely large input MDP, which has 1,000,000 states, the obtained speedup is still over 12×, showing that this heterogeneous programming model can solve MDPs more efficiently than the serial solver does. BibTeX: @inproceedings{Li2019b, author = {Li, Ming and Hawrylak, Peter and Hale, John}, title = {Combining OpenCL and MPI to Support Heterogeneous Computing on a Cluster}, booktitle = {Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning)}, publisher = {ACM}, year = {2019}, pages = {5:1--5:6}, doi = {10.1145/3332186.3333059} } Li M, Liu Y, Yang H, Luan Z, Gan L, Yang G and Qian D (2019), "Accelerating Sparse Cholesky Factorization on Sunway Manycore Architecture", IEEE Transactions on Parallel and Distributed Systems. [Abstract] [BibTeX] [DOI] Abstract: To improve the performance of Sparse Cholesky factorization, existing research divides the adjacent columns of the sparse matrix with the same nonzero patterns into supernodes for parallelization. However, due to the various structures of sparse matrices, the computation of the generated supernodes varies significantly, and thus hard to optimize when computed by dense matrix kernels. Therefore, how to efficiently map sparse Choleksy factorization to the emerging architectures, such as Sunway many-core processor, remains an active research direction. In this paper, we propose swCholesky, which is a highly optimized implementation of sparse Cholesky factorization on Sunway processor. Specifically, we design three kernel task queues and a dense matrix library to dynamically adapt to the kernel characteristics and architecture features. In addition, we propose an auto-tuning mechanism to search for the optimal settings of the important parameters in swCholesky. Our experiments show that swCholesky achieves better performance than state-of-the-art implementations. BibTeX: @article{Li2019c, author = {Li, M. and Liu, Y. and Yang, H. and Luan, Z. and Gan, L. and Yang, G. and Qian, D.}, title = {Accelerating Sparse Cholesky Factorization on Sunway Manycore Architecture}, journal = {IEEE Transactions on Parallel and Distributed Systems}, year = {2019}, doi = {10.1109/TPDS.2019.2953852} } Liew D, Cadar C, Donaldson AF and Stinnett JR (2019), "Just fuzz it: solving floating-point constraints using coverage-guided fuzzing", In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM Press. [Abstract] [BibTeX] [DOI] Abstract: We investigate the use of coverage-guided fuzzing as a means of proving satisfiability of SMT formulas over finite variable domains, with specific application to floating-point constraints. We show how an SMT formula can be encoded as a program containing a location that is reachable if and only if the program's input corresponds to a satisfying assignment to the formula. A coverage-guided fuzzer can then be used to search for an input that reaches the location, yielding a satisfying assignment. We have implemented this idea in a tool, Just Fuzz-it Solver (JFS), and we present a large experimental evaluation showing that JFS is both competitive with and complementary to state-of-the-art SMT solvers with respect to solving floating-point constraints, and that the coverage-guided approach of JFS provides significant benefit over naive fuzzing in the floating-point domain. Applied in a portfolio manner, the JFS approach thus has the potential to complement traditional SMT solvers for program analysis tasks that involve reasoning about floating-point constraints. BibTeX: @inproceedings{Liew2019, author = {Daniel Liew and Cristian Cadar and Alastair F. Donaldson and J. Ryan Stinnett}, title = {Just fuzz it: solving floating-point constraints using coverage-guided fuzzing}, booktitle = {Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering}, publisher = {ACM Press}, year = {2019}, doi = {10.1145/3338906.3338921} } Lin T and Jordan MI (2019), "A Control-Theoretic Perspective on Optimal High-Order Optimization", December, 2019. [Abstract] [BibTeX] Abstract: In this paper, we provide a control-theoretic perspective on optimal tensor optimization algorithms for minimizing a convex function in a finite-dimensional Euclidean space. Given a function : ℝ^d → ℝ that is convex and twice-continuously differentiable, we study an ordinary differential equation (ODE) that is governed by the gradient operator ∇ Φ and a positive control parameter (t) that tends to infinity as t → +∞. The tuning of () is achieved via a closed-loop control law based on the algebraic equation [(t)]^p\nabla(x(t))^p-1 = θ for a given θ > 0. We prove the existence and uniqueness of a local solution to this closed-loop ODE by the Banach fixed-point theorem. We then present a Lyapunov function that allows us to establish the existence and uniqueness of a global solution and analyze the convergence properties of trajectories. The rate of convergence is 𝒪(t^-(3p+1)/2) in terms of objective gap and 𝒪(t^-3p) in terms of squared gradient norm. We present two frameworks for implicit time discretization of the ODE, one of which generalizes the large-step A-HPE framework of [Monteiro2013], and the other of which leads to a new p-th order tensor algorithm. A highlight of our analysis is that we show that all of the p-th order optimal tensor algorithms in this paper minimize the squared gradient norm at a rate of 𝒪(k^-3p). BibTeX: @article{Lin2019, author = {Tianyi Lin and Michael. I. Jordan}, title = {A Control-Theoretic Perspective on Optimal High-Order Optimization}, year = {2019} } Lin L and Wu X (2019), "Numerical solution of large scale Hartree-Fock-Bogoliubov equations", December, 2019. [Abstract] [BibTeX] Abstract: The Hartree-Fock-Bogoliubov (HFB) theory is the starting point for treating superconducting systems. However, the computational cost for solving large scale HFB equations can be much larger than that of the Hartree-Fock equations, particularly when the Hamiltonian matrix is sparse, and the number of electrons N is relatively small compared to the matrix size N_b. We first provide a concise and relatively self-contained review of the HFB theory for general finite sized quantum systems, with special focus on the treatment of spin symmetries from a linear algebra perspective. We then demonstrate that the pole expansion and selected inversion (PEXSI) method can be particularly well suited for solving large scale HFB equations. For a Hubbard-type Hamiltonian, the cost of PEXSI is at most O(N_b^2) for both gapped and gapless systems, which can be significantly faster than the standard cubic scaling diagonalization methods. We show that PEXSI can solve a two-dimensional Hubbard-Hofstadter model with N_b up to 2.88× 10^6, and the wall clock time is less than 100 s using 17280 CPU cores. This enables the simulation of physical systems under experimentally realizable magnetic fields, which cannot be otherwise simulated with smaller systems. BibTeX: @article{Lin2019a, author = {Lin Lin and Xiaojie Wu}, title = {Numerical solution of large scale Hartree-Fock-Bogoliubov equations}, year = {2019} } Liu H, Tian Y, Zong H, Ma Q, Wang MY and Zhang L (2019), "Fully parallel level set method for large-scale structural topology optimization", Computers & Structures. Vol. 221, pp. 13-27. [Abstract] [BibTeX] [DOI] [URL] Abstract: To realize large-scale or high-resolution structural topology optimization design, a fully parallel parameterized level set method with compactly supported radial basis functions (CSRBFs) is developed based on both the uniform and non-uniform structured meshes. In this work, the whole computation process is parallelized, including mesh generation, sensitivity analysis, calculation and assembly of the element stiffness matrices, solving of the structural state equation, parameterization and updating of the level set function, and output of the computational results during the optimization iterations. In addition, some typical numerical examples, in which the calculation scale is up to 7 million 8-node hexahedral elements, are carried out for verifying the effectiveness of the proposed method. Finally, the computing time is also analyzed in detail. It is found that: (1) In the optimized structures, the thin sheet-like components gradually replace the truss-like ones when refining the mesh, (2) the parameterization process of the level set function will become fast as long as the non-uniformity of mesh is not very high and the supported radius of CSRBF is small enough, and (3) more than 80% of the total computing time is always consumed for solving the structural state equation during the finite element analysis (FEA). BibTeX: @article{Liu2019, author = {Liu, Hui and Tian, Ye and Zong, Hongming and Ma, Qingping and Wang, Michael Yu and Zhang, Liang}, title = {Fully parallel level set method for large-scale structural topology optimization}, journal = {Computers & Structures}, year = {2019}, volume = {221}, pages = {13--27}, url = {http://www.sciencedirect.com/science/article/pii/S0045794918316511}, doi = {10.1016/j.compstruc.2019.05.010} } Liu C, Yang H, Liu X, Luan Z and Qian D (2019), "Intelligent-Unrolling: Exploiting Regular Patterns in Irregular Applications", October, 2019. [Abstract] [BibTeX] Abstract: Modern optimizing compilers are able to exploit memory access or computation patterns to generate vectorization codes. However, such patterns in irregular applications are unknown until runtime due to the input dependence. Thus, either compiler's static optimization or profile-guided optimization based on specific inputs cannot predict the patterns for any common input, which leads to suboptimal code generation. To address this challenge, we develop Intelligent-Unroll, a framework to automatically optimize irregular applications with vectorization. Intelligent-Unroll allows the users to depict the computation task using code seed with the memory access and computation patterns represented in feature table and information-code tree, and generates highly efficient codes. Furthermore, Intelligent-Unroll employs several novel optimization techniques to optimize reduction operations and gather/scatter instructions. We evaluate Intelligent-Unroll with sparse matrix-vector multiplication (SpMV) and graph applications. Experimental results show that Intelligent-Unroll is able to generate more efficient vectorization codes compared to the state-of-the-art implementations. BibTeX: @article{Liu2019a, author = {Liu, Changxi and Yang, Hailong and Liu, Xu and Luan, Zhongzhi and Qian, Depei}, title = {Intelligent-Unrolling: Exploiting Regular Patterns in Irregular Applications}, year = {2019} } Ma S, Liu Z, Chen S, Huang L, Guo Y, Wang Z and Zhang M (2019), "Coordinated DMA: Improving the DRAM Access Efficiency for Matrix Multiplication", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. [Abstract] [BibTeX] [DOI] Abstract: High performance implementation of matrix multiplication is essential for scientific computing. The memory access procedure is quite possible to be the bottleneck of matrix multiplication. The widely used GotoBLAS GEMM implementation divides the integral matrix into several partitions to be assigned to different cores for parallelization. Traditionally, each core deploys a DMA transfer to access its own partition in the DRAM memory. However, deploying an independent DMA transfer for each core cannot efficiently exploit the inter-core locality. Also, multiple concurrent DMA transfers interfere with each other, further reducing the DRAM access efficiency. We observe that the same row of neighboring partitions is in the same DRAM page, which means that there is significant locality inherent in the address layout. We propose the coordinated DMA to efficiently exploit the locality. It invokes one transfer to serve all cores and moves data in a row-major manner to improve the DRAM access efficiency. Compared with a baseline design, the coordinated DMA improves the bandwidth by 84.8% and reduces DRAM energy consumption by 43.1% for micro-benchmarks. It achieves higher performance for the GEMM and Linpack benchmark. With much less hardware costs, the coordinated DMA significantly outperforms an out-of-order memory controller. BibTeX: @article{Ma2019, author = {Ma, S. and Liu, Z. and Chen, S. and Huang, L. and Guo, Y. and Wang, Z. and Zhang, M.}, title = {Coordinated DMA: Improving the DRAM Access Efficiency for Matrix Multiplication}, journal = {IEEE Transactions on Parallel and Distributed Systems}, year = {2019}, pages = {1--1}, doi = {10.1109/TPDS.2019.2906891} } Macintosh HJ, Banks JE and Kelson NA (2019), "Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs", International Journal of Reconfigurable Computing., October, 2019. Vol. 2019, pp. 1-13. Hindawi Limited. [Abstract] [BibTeX] [DOI] Abstract: Solving diagonally dominant tridiagonal linear systems is a common problem in scientific high-performance computing (HPC). Furthermore, it is becoming more commonplace for HPC platforms to utilise a heterogeneous combination of computing devices. Whilst it is desirable to design faster implementations of parallel linear system solvers, power consumption concerns are increasing in priority. This work presents the oclspkt routine. The oclspkt routine is a heterogeneous OpenCL implementation of the truncated SPIKE algorithm that can use FPGAs, GPUs, and CPUs to concurrently accelerate the solving of diagonally dominant tridiagonal linear systems. The routine is designed to solve tridiagonal systems of any size and can dynamically allocate optimised workloads to each accelerator in a heterogeneous environment depending on the accelerator's compute performance. The truncated SPIKE FPGA solver is developed first for optimising OpenCL device kernel performance, global memory bandwidth, and interleaved host to device memory transactions. The FPGA OpenCL kernel code is then refactored and optimised to best exploit the underlying architecture of the CPU and GPU. An optimised TDMA OpenCL kernel is also developed to act as a serial baseline performance comparison for the parallel truncated SPIKE kernel since no FPGA tridiagonal solver capable of solving large tridiagonal systems was available at the time of development. The individual GPU, CPU, and FPGA solvers of the oclspkt routine are 110%, 150%, and 170% faster, respectively, than comparable device-optimised third-party solvers and applicable baselines. Assessing heterogeneous combinations of compute devices, the GPU + FPGA combination is found to have the best compute performance and the FPGA-only configuration is found to have the best overall estimated energy efficiency. BibTeX: @article{Macintosh2019, author = {Macintosh, Hamish J. and Banks, Jasmine E. and Kelson, Neil A.}, title = {Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs}, journal = {International Journal of Reconfigurable Computing}, publisher = {Hindawi Limited}, year = {2019}, volume = {2019}, pages = {1--13}, doi = {10.1155/2019/3679839} } Malitsky Y and Mishchenko K (2019), "Adaptive Gradient Descent without Descent", Proceedings of the 37th International Conference on Machine Learning., October, 2019. [Abstract] [BibTeX] Abstract: We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) don't increase the stepsize too fast and 2) don't overstep the local curvature. No need for functional values, no line search, no information about the function except for the gradients. By following these rules, you get a method adaptive to the local geometry, with convergence guarantees depending only on the smoothness in a neighborhood of a solution. Given that the problem is convex, our method converges even if the global smoothness constant is infinity. As an illustration, it can minimize arbitrary continuously twice-differentiable convex function. We examine its performance on a range of convex and nonconvex problems, including logistic regression and matrix factorization. BibTeX: @article{Malitsky2019, author = {Yura Malitsky and Konstantin Mishchenko}, title = {Adaptive Gradient Descent without Descent}, journal = {Proceedings of the 37th International Conference on Machine Learning}, year = {2019} } Mamooler P (2019), "The domain decomposition method of Bank and Jimack as an optimized Schwarz method". Thesis at: University of Geneva. [Abstract] [BibTeX] Abstract: The aim of this thesis is to introduce the Bank-Jimack domain decomposition method and study its convergence behavior. We are interested in understanding what the precise contribution of the outer coarse mesh is to the convergence behavior of the domain decomposition method proposed by Bank and Jimack. We show for a two subdomain decomposition that the outer coarse mesh can be interpreted as computing an approximation to the optimal transmission condition represented by the Dirichlet to Neumann map, and thus the method of Bank and Jimack can be viewed as an optimized Schwarz method, i.e. a Schwarz method that uses Robin or higher order transmission conditions instead of theclassical Dirichlet ones. BibTeX: @phdthesis{Mamooler2019, author = {Mamooler, Parisa}, title = {The domain decomposition method of Bank and Jimack as an optimized Schwarz method}, school = {University of Geneva}, year = {2019} } Marques SMVN, Medeiros TS, Rossi FD, Luizelli MC, Girardi AG, Beck ACS and Lorenzon AF (2019), "The Impact of Turbo Frequency on the Energy, Performance, and Aging of Parallel Applications", In Proceedings of the IFIP/IEEE 27th International Conference on Very Large Scale Integration., October, 2019. , pp. 149-154. [Abstract] [BibTeX] [DOI] Abstract: Technologies that improve the performance of parallel applications by increasing the nominal operating frequency of processors respecting a given TDP (Thermal Design Power) have been widely used. However, they may impact on other non-functional requirements in different ways (e.g. increasing energy consumption or aging). Therefore, considering the huge number of configurations available, represented by the range of all possible combinations among different parallel applications, amount of threads, dynamic voltage and frequency scaling (DVFS) governors, boosting technologies and simultaneous multithreading (SMT), selecting the one that offers the best tradeoff for a non-functional requirement is extremely challenging for software designers. Given that, in this work we assess the impact of changing these configurations on the energy consumption, performance, and aging of parallel applications on a turbo-compliant processor. Results show that there is no single configuration that would provide the best solution for all nonfunctional requirements at once. For instance, we demonstrate that the configuration that offers the best performance is the same one that has the worst impact on aging, accelerating it by up to 1.75 times. With our experiments, we provide guidelines for the developer when it comes to tuning performance using turbo boosting to save as much energy as possible and increase the lifespan of the hardware components1.1This study was financed in part by the Coordenao de Aperfeioamento de Pessoal de Nvel Superior - Brasil (CAPES) - Finance Code 001 BibTeX: @inproceedings{Marques2019, author = {Marques, S. M. V. N. and Medeiros, T. S. and Rossi, F. D. and Luizelli, M. C. and Girardi, A. G. and Beck, A. C. S. and Lorenzon, A. F.}, title = {The Impact of Turbo Frequency on the Energy, Performance, and Aging of Parallel Applications}, booktitle = {Proceedings of the IFIP/IEEE 27th International Conference on Very Large Scale Integration}, year = {2019}, pages = {149--154}, doi = {10.1109/VLSI-SoC.2019.8920389} } Massias M, Vaiter S, Gramfort A and Salmon J (2019), "Dual Extrapolation for Sparse Generalized Linear Models" [Abstract] [BibTeX] Abstract: Generalized Linear Models (GLM) form a wide class of regression and classification models, where prediction is a function of a linear combination of the input variables. For statistical inference in high dimension, sparsity inducing regularizations have proven to be useful while offering statistical guarantees. However, solving the resulting optimization problems can be challenging: even for popular iterative algorithms such as coordinate descent, one needs to loop over a large number of variables. To mitigate this, techniques known as screening rules and working sets diminish the size of the optimization problem at hand, either by progressively removing variables, or by solving a growing sequence of smaller problems. For both techniques, significant variables are identified thanks to convex duality arguments. In this paper, we show that the dual iterates of a GLM exhibit a Vector AutoRegressive (VAR) behavior after sign identification, when the primal problem is solved with proximal gradient descent or cyclic coordinate descent. Exploiting this regularity, one can construct dual points that offer tighter certificates of optimality, enhancing the performance of screening rules and helping to design competitive working set algorithms. BibTeX: @article{Massias2019, author = {Massias, Mathurin and Vaiter, Samuel and Gramfort, Alexandre and Salmon, Joseph}, title = {Dual Extrapolation for Sparse Generalized Linear Models}, year = {2019} } Mattson T, Davis TA, Kumar M, Buluc A, McMillan S, Moreira J and Yang C (2019), "LAGraph: A Community Effort to Collect Graph Algorithms Built on Top of the GraphBLAS", In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops., May, 2019. , pp. 276-284. [Abstract] [BibTeX] [DOI] Abstract: In 2013, we released a position paper to launch a community effort to define a common set of building blocks for constructing graph algorithms in the language of linear algebra. This led to the GraphBLAS. We released a specification for the C programming language binding to the GraphBLAS in 2017. Since that release, multiple libraries that conform to the GraphBLAS C specification have been produced. In this position paper, we launch the next phase of this ongoing community effort: a project to assemble a set of high level graph algorithms built on top of the GraphBLAS. While many of these algorithms are well-known with high quality implementations available, they have not been assembled in one place and integrated with the GraphBLAS. We call this project the LAGraph graph algorithms project and with this position paper, we put out a call for collaborators to join us. While the initial goal is to just assemble these algorithms into a single framework, the long term goal is a library of production-worthy code, with the LAGraph library serving as an open source repository of verified graph algorithms that use the GraphBLAS. BibTeX: @inproceedings{Mattson2019, author = {Mattson, T. and Davis, T. A. and Kumar, M. and Buluc, A. and McMillan, S. and Moreira, J. and Yang, C.}, title = {LAGraph: A Community Effort to Collect Graph Algorithms Built on Top of the GraphBLAS}, booktitle = {2019 IEEE International Parallel and Distributed Processing Symposium Workshops}, year = {2019}, pages = {276--284}, doi = {10.1109/IPDPSW.2019.00053} } Mendoza H, Klein A, Feurer M, Springenberg JT, Urban M, Burkart M, Dippel M, Lindauer M and Hutter F (2019), "Towards Automatically-Tuned Deep Neural Networks", In Automated Machine Learning: Methods, Systems, Challenges. Cham , pp. 135-149. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: Recent advances in AutoML have led to automated tools that can compete with machine learning experts on supervised learning tasks. In this work, we present two versions of Auto-Net, which provide automatically-tuned deep neural networks without any human intervention. The first version, Auto-Net 1.0, builds upon ideas from the competition-winning system Auto-sklearn by using the Bayesian Optimization method SMAC and uses Lasagne as the underlying deep learning (DL) library. The more recent Auto-Net 2.0 builds upon a recent combination of Bayesian Optimization and HyperBand, called BOHB, and uses PyTorch as DL library. To the best of our knowledge, Auto-Net 1.0 was the first automatically-tuned neural network to win competition datasets against human experts (as part of the first AutoML challenge). Further empirical results show that ensembling Auto-Net 1.0 with Auto-sklearn can perform better than either approach alone, and that Auto-Net 2.0 can perform better yet. BibTeX: @inbook{Mendoza2019, author = {Mendoza, Hector and Klein, Aaron and Feurer, Matthias and Springenberg, Jost Tobias and Urban, Matthias and Burkart, Michael and Dippel, Maximilian and Lindauer, Marius and Hutter, Frank}, editor = {Hutter, Frank and Kotthoff, Lars and Vanschoren, Joaquin}, title = {Towards Automatically-Tuned Deep Neural Networks}, booktitle = {Automated Machine Learning: Methods, Systems, Challenges}, publisher = {Springer International Publishing}, year = {2019}, pages = {135--149}, doi = {10.1007/978-3-030-05318-5_7} } Meng K, Li J, Tan G and Sun N (2019), "A Pattern Based Algorithmic Autotuner for Graph Processing on GPUs", In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. New York, NY, USA , pp. 201-213. ACM. [Abstract] [BibTeX] [DOI] Abstract: This paper proposes Gswitch, a pattern-based algorithmic auto-tuning system that dynamically switches between optimization variants with negligible overhead. Its novelty lies in a small set of algorithmic patterns that allow for the configurable assembly of variants of the algorithm. The fast transition of Gswitch is based on a machine learning model trained using 644 real graphs. Moreover, Gswitch provides a simple programming interface that conceals low-level tuning details from the user. We evaluate Gswitch on typical graph algorithms (BFS, CC, PR, SSSP, and BC) using Nvidia Kepler and Pascal GPUs. The results show that Gswitch runs up to 10times faster than the best configuration of the state-of-the-art programmable GPU-based graph processing libraries on 10 representative graphs. Gswitch outperforms Gunrock on 92.4% cases of 644 graphs which is the largest dataset evaluation reported to date. BibTeX: @inproceedings{Meng2019, author = {Meng, Ke and Li, Jiajia and Tan, Guangming and Sun, Ninghui}, title = {A Pattern Based Algorithmic Autotuner for Graph Processing on GPUs}, booktitle = {Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, year = {2019}, pages = {201--213}, doi = {10.1145/3293883.3295716} } Milroy DJ, Baker AH, Dennis JM, Gettelman A and Hammerling DM (2019), "Investigating the Impact of Mixed Precision on Correctness for a Large Climate Code", In Proceedings of the Third International Workshop on Software Correctness for HPC Applications. [Abstract] [BibTeX] Abstract: Earth system models (ESMs) are computationally expensive and represent many complex processes on a wide range of scales from molecular to global. Certain ESM computations require high precision while others, such as atmospheric microphysics (e.g., precipitation) which are approximated by bulk properties, should not. As such, atmospheric microphysics models are prime candidates for conversion to single precision, which afford distinct computational and memory advantages over typical double-precision numbers. However, care must be taken as indiscriminate type casting to single precision can result in numerical instability and divergent output when applied naively. In this work we relate our experiences attempting to improve the performance of the Morrison-Gettelman microphysics package (MG2) in a popular ESM by modifying it to compute in single precision without sacrificing correctness. We find that modification of the entire MG2 package to compute with singleprecision floats achieves a respectable performance increase but does not appear to be correct in terms of maintaining consistency with double-precision MG2. On the other hand, narrowing the scope of our conversion to a couple expensive subprograms yields more satisfying results in terms of correctness but with negligible overall performance improvement. We evaluate correctness with both an objective statistical tool and traditional approaches more familiar to climate scientists. While we are still working toward our ultimate goal of improving the performance of MG2 without negatively affecting model output, we believe that our experiences may be helpful to other groups pursuing similar goals. BibTeX: @inproceedings{Milroy2019, author = {Milroy, D. J. and Baker, A. H. and Dennis, J. M. and Gettelman, A. and Hammerling, D. M.}, title = {Investigating the Impact of Mixed Precision on Correctness for a Large Climate Code}, booktitle = {Proceedings of the Third International Workshop on Software Correctness for HPC Applications}, year = {2019} } Mniszewski SM (2019), "Graph Partitioning As Quadratic Unconstrained Binary Optimization (QUBO) on Spiking Neuromorphic Hardware", In Proceedings of the International Conference on Neuromorphic Systems. New York, NY, USA , pp. 4:1-4:5. ACM. [Abstract] [BibTeX] [DOI] Abstract: In this work, graph partitioning (GP) is explored using quadratic unconstrained binary optimization (QUBO) on the IBM TrueNorth spiking neuromorphic architecture. GP splits a graph into similar-sized parts while minimizing the number of cut edges between parts. Classical approaches to GP rely on heuristics and approximation algorithms. The GP QUBO formulation was inspired by previous work using the D-Wave quantum annealer. This approach is not limited to graph algorithms, but is applicable to solving a spectrum of NP-hard optimization problems. A classical pseudo simulated annealing metaheuristic is used to solve the QUBO. Implementation on the IBM TrueNorth using a spiking framework is described. Results as converged high-energy solutions are shown to be "good enough" or optimal for partitioning a graph into 2 parts. BibTeX: @inproceedings{Mniszewski2019, author = {Mniszewski, Susan M.}, title = {Graph Partitioning As Quadratic Unconstrained Binary Optimization (QUBO) on Spiking Neuromorphic Hardware}, booktitle = {Proceedings of the International Conference on Neuromorphic Systems}, publisher = {ACM}, year = {2019}, pages = {4:1--4:5}, doi = {10.1145/3354265.3354269} } Mohammadi MS, Yuki T, Cheshmi K, Davis EC, Hall M, Dehnavi MM, Nandy P, Olschanowsky C, Venkat A and Strout MM (2019), "Sparse Computation Data Dependence Simplification for Efficient Compiler-generated Inspectors", In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York, NY, USA , pp. 594-609. ACM. [Abstract] [BibTeX] [DOI] Abstract: This paper presents a combined compile-time and runtime loop-carried dependence analysis of sparse matrix codes and evaluates its performance in the context of wavefront parallellism. Sparse computations incorporate indirect memory accesses such as x[col[j]] whose memory locations cannot be determined until runtime. The key contributions of this paper are two compile-time techniques for significantly reducing the overhead of runtime dependence testing: (1) identifying new equality constraints that result in more efficient runtime inspectors, and (2) identifying subset relations between dependence constraints such that one dependence test subsumes another one that is therefore eliminated. New equality constraints discovery is enabled by taking advantage of domain-specific knowledge about index arrays, such as col[j]. These simplifications lead to automatically-generated inspectors that make it practical to parallelize such computations. We analyze our simplification methods for a collection of seven sparse computations. The evaluation shows our methods reduce the complexity of the runtime inspectors significantly. Experimental results for a collection of five large matrices show parallel speedups ranging from 2x to more than 8x running on a 8-core CPU. BibTeX: @inproceedings{Mohammadi2019, author = {Mohammadi, Mahdi Soltan and Yuki, Tomofumi and Cheshmi, Kazem and Davis, Eddie C. and Hall, Mary and Dehnavi, Maryam Mehri and Nandy, Payal and Olschanowsky, Catherine and Venkat, Anand and Strout, Michelle Mills}, title = {Sparse Computation Data Dependence Simplification for Efficient Compiler-generated Inspectors}, booktitle = {Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation}, publisher = {ACM}, year = {2019}, pages = {594--609}, doi = {10.1145/3314221.3314646} } Montagne E and Surós R (2019), "Systolic Sparse Matrix Vector Multiply in the Age of TPUs and Accelerators", In 2019 Spring Simulation Conference (SpringSim)., April, 2019. , pp. 1-10. [Abstract] [BibTeX] [DOI] Abstract: Tensor Processing Units has brought back systolic arrays as a computational alternative to high performance computing. Recently Google presented a Tensor Processing Unit for handling matrix multiplication using systolic arrays. This unit is designed for dense matrices only. As they stated, sparse architectural support was omitted momentarily but they will focus on sparsity in future designs. We propose a systolic array to compute the Sparse Matrix Vector product in T2(n)≈ displaystyle ⌈nnz2+2n+2 using 2 n+2 processing elements. The systolic array we propose also use accumulators to collect the partial results of the resulting vector and supports adapting tiling. BibTeX: @inproceedings{Montagne2019, author = {Montagne, E. and Surós, R.}, title = {Systolic Sparse Matrix Vector Multiply in the Age of TPUs and Accelerators}, booktitle = {2019 Spring Simulation Conference (SpringSim)}, year = {2019}, pages = {1--10}, doi = {10.23919/SpringSim.2019.8732860} } Montoison A and Orban D (2019), "BiLQ: An Iterative Method for Nonsymmetric Linear Systems with a Quasi-Minimum Error Property", October, 2019. [Abstract] [BibTeX] [DOI] Abstract: We introduce an iterative method named BiLQ for solving general square linear systems Ax = b based on the Lanczos biorthogonalization process defined by least-norm subproblems, and that is a natural companion to BiCG and QMR. Whereas the BiCG (Fletcher, 1976), CGS (Sonneveld, 1989) and BiCGSTAB (van der Vorst, 1992) iterates may not exist when the tridiagonal projection of A is singular, BiLQ is reliable on compatible systems even if A is ill-conditioned or rank deficient. As in the symmetric case, the BiCG residual is often smaller than the BiLQ residual and, when the BiCG iterate exists, an inexpensive transfer from the BiLQ iterate is possible. Although the Euclidean norm of the BiLQ error is usually not monotonic, it is monotonic in a different norm that depends on the Lanczos vectors. We establish a similar property for the QMR (Freund and Nachtigal, 1991) residual. BiLQ combines with QMR to take advantage of two initial vectors and solve a system and an adjoint system simultaneously at a cost similar to that of applying either method. We derive an analogous combination of USYMLQ and USYMQR based on the orthogonal tridiagonalization process (Saunders, Simon, and Yip, 1988). The resulting combinations, named BiLQR and TriLQR, may be used to estimate integral functionals involving the solution of a primal and an adjoint system. We compare BiLQR and TriLQR with Minres-qlp on a related augmented system, which performs a comparable amount of work and requires comparable storage. In our experiments, BiLQR terminates earlier than TriLQR and MINRES-QLP in terms of residual and error of the primal and adjoint systems. BibTeX: @article{Montoison2019, author = {Montoison, Alexis and Orban, Dominique}, title = {BiLQ: An Iterative Method for Nonsymmetric Linear Systems with a Quasi-Minimum Error Property}, year = {2019}, doi = {10.13140/RG.2.2.18287.59042} } Osama M, Truong M, Yang C, Buluç A and Owens JD (2019), "Graph Coloring on the GPU". Thesis at: UC Davis: College of Engineering. [Abstract] [BibTeX] [URL] Abstract: We design and implement parallel graph coloring algorithms on the GPU using two different abstractions—one datacentric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic affect performance. Our Gunrock graph coloring implementation has a peak 2× speed-up, a geomean speed-up of 1.3× and produces 1.6× more colors over previous hardwired state-of-theart implementations on real-world datasets. Our GraphBLAS implementation of Luby's algorithm produces 1.9× fewer colors than the previous state-of-the-art parallel implementation at the cost of 3× extra runtime, and 1.014× fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6×. BibTeX: @techreport{Muhammad2019, author = {Osama, Muhammad and Truong, Minh and Yang, Carl and Buluç, Aydın and Owens, John D.}, title = {Graph Coloring on the GPU}, school = {UC Davis: College of Engineering}, year = {2019}, url = {https://escholarship.org/uc/item/6kp4p18t} } Mukkara A, Beckmann N and Sanchez D (2019), "PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates", The 52nd Annual IEEE/ACM International Symposium on Microarchitecture., In Proceedings of the 52nd Annual IEEE/ACM International Symposium on MIcroarchitecture. Columbus, OH, USA, October, 2019. , pp. 14. [Abstract] [BibTeX] Abstract: Many applications perform frequent scatter update operations to large data structures. For example, in push-style graph algorithms, processing each vertex requires updating the data of all its neighbors. Neighbors are often scattered over the whole graph, so these scatter updates have poor spatial and temporal locality. In current systems, scatter updates suffer high synchronization costs and high memory traffic. These drawbacks make push-style execution unattractive, and, when algorithms allow it, programmers gravitate towards pull-style implementations based on gather reads instead. \ We present PHI, a push cache hierarchy that makes scatter updates synchronization- and bandwidth-efficient. PHI adds support for pushing sparse, commutative updates from cores towards main memory. PHI adds simple compute logic at each cache level to buffer and coalesce these commutative updates throughout the hierarchy. This avoids synchronization, exploits temporal locality, and produces a load balanced execution. Moreover, PHI exploits spatial locality by selectively deferring updates with poor spatial locality, batching them to achieve sequential main memory transfers. \ PHI is the first system to leverage both the temporal and spatial locality benefits of commutative scatter updates, some of which do not apply to gather reads. As a result, PHI not only makes push algorithms efficient, but makes them consistently faster than pull ones. We evaluate PHI on graph algorithms and other sparse applications processing large inputs. PHI improves performance by 4.7× on average (and by up to 11×), and reduces memory traffic by 2× (and by up to 5×). BibTeX: @inproceedings{Mukkara2019, author = {Mukkara, Anurag and Beckmann, Nathan and Sanchez, Daniel}, title = {PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates}, booktitle = {Proceedings of the 52nd Annual IEEE/ACM International Symposium on MIcroarchitecture}, journal = {The 52nd Annual IEEE/ACM International Symposium on Microarchitecture}, year = {2019}, pages = {14} } Muro R, Fujii A and Tanaka T (2019), "Acceleration of Symmetric Sparse Matrix-Vector Product Using Improved Hierarchical Diagonal Blocking Format", In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. New York, NY, USA , pp. 63-70. ACM. [Abstract] [BibTeX] [DOI] Abstract: In the previous study, Guy et al. proposed sparse matrix-vector product (SpMV) acceleration using the Hierarchical Diagonal Blocking (HDB) format that recursively repeated partitioning, reordering, and blocking on symmetric sparse matrix. The HDB format stores sparse matrix hierarchically using tree structure. Each node of tree structure of HDB format store small sparse matrices using CSR format.\ In this present study, we examined two problems with the HDB format and provided a solution for each problem.\ First, SpMV using the HDB format has a partial dependent relationship among hierarchies. The problem with the HDB format is that the parallelism of computation decreases as the hierarchy of nodes gets closer to the root. Thus, we propose cutting of dependency using work vectors to solve this problem.\ Second, each node of the conventional HDB format is stored in Compressed Sparse Row (CSR) format. Block compressed Sparse Row (BSR) format often becomes faster than CSR format in SpMV performance. Thus, we evaluated the effectiveness of our proposed method with work vectors also for BSR-HDB format.\ In addition, we compare the performance in the general format (CSR format, BSR format) using the Intel Math Kernel Library (MKL), the conventional HDB format, and the expanded HDB format by using 22 types of sparse matrix that from various field. The results showed that the SpMV performance was highest in the HDB format that we expanded in 19 types of sparse matrix, which was 1.99 times faster than the CSR format. BibTeX: @inproceedings{Muro2019, author = {Muro, Ryo and Fujii, Akihiro and Tanaka, Teruo}, title = {Acceleration of Symmetric Sparse Matrix-Vector Product Using Improved Hierarchical Diagonal Blocking Format}, booktitle = {Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region}, publisher = {ACM}, year = {2019}, pages = {63--70}, doi = {10.1145/3293320.3293332} } Mutlu BO, Kestor G, Cristal A, Unsal O and Krishnamoorthy S (2019), "Ground-Truth Prediction to Accelerate Soft-Error Impact Analysis for Iterative Methods", In Proceedings of the 26th IEEE International Conference on High Performance Computing, Data, and Analytics., December, 2019. , pp. 333-344. [Abstract] [BibTeX] [DOI] Abstract: Understanding the impact of soft errors on applications can be expensive. Often, it requires an extensive error injection campaign involving numerous runs of the full application in the presence of errors. In this paper, we present a novel approach to arriving at the ground truth-the true impact of an error on the final output-for iterative methods by observing a small number of iterations to learn deviations between normal and error-impacted execution. We develop a machine learning based predictor for three iterative methods to generate ground-truth results without running them to completion for every error injected. We demonstrate that this approach achieves greater accuracy than alternative prediction strategies, including three existing soft error detection strategies. We demonstrate the effectiveness of the ground truth prediction model in evaluating vulnerability and the effectiveness of soft error detection strategies in the context of iterative methods. BibTeX: @inproceedings{Mutlu2019, author = {B. O. Mutlu and G. Kestor and A. Cristal and O. Unsal and S. Krishnamoorthy}, title = {Ground-Truth Prediction to Accelerate Soft-Error Impact Analysis for Iterative Methods}, booktitle = {Proceedings of the 26th IEEE International Conference on High Performance Computing, Data, and Analytics}, year = {2019}, pages = {333-344}, doi = {10.1109/HiPC.2019.00048} } Nagasaka Y, Matsuoka S, Azad A and Buluç A (2019), "Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors", Parallel Computing., August, 2019. , pp. 102545. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. We build the performance model for hash-table and heap-based algorithms, which supports the recipe. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix. Finally, we integrate our implementations into a large-scale protein clustering code named HipMCL, accelerating its SpGEMM kernel by up to 10X and achieving an overall performance boost for the whole HipMCL application by 2.6×. BibTeX: @article{Nagasaka2019, author = {Nagasaka, Yusuke and Matsuoka, Satoshi and Azad, Ariful and Buluç, Aydın}, title = {Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors}, journal = {Parallel Computing}, publisher = {Elsevier BV}, year = {2019}, pages = {102545}, doi = {10.1016/j.parco.2019.102545} } Nataf F (2019), "Adaptive Domain Decomposition method for Saddle Point problem in Matrix Form" [Abstract] [BibTeX] Abstract: We introduce an adaptive domain decomposition (DD) method for solving saddle point problems defined as a block two by two matrix. The algorithm does not require any knowledge of the constrained space. We assume that all sub matrices are sparse and that the diagonal blocks are the sum of positive semi definite matrices. The latter assumption enables the design of adaptive coarse space for DD methods. BibTeX: @article{Nataf2019, author = {Nataf, F.}, title = {Adaptive Domain Decomposition method for Saddle Point problem in Matrix Form}, year = {2019} } Nelson L, Bornholt J, Gu R, Baumann A, Torlak E and Wang X (2019), "Scaling Symbolic Evaluation for Automated Verification of Systems Code with Serval", In Proceedings of the 27th ACM Symposium on Operating Systems Principles. New York, NY, USA , pp. 225-242. ACM. [Abstract] [BibTeX] [DOI] [URL] Abstract: This paper presents Serval, a framework for developing automated verifiers for systems software. Serval provides an extensible infrastructure for creating verifiers by lifting interpreters under symbolic evaluation, and a systematic approach to identifying and repairing verification performance bottlenecks using symbolic profiling and optimizations.\ Using Serval, we build automated verifiers for the RISC-V, x86--32, LLVM, and BPF instruction sets. We report our experience of retrofitting CertiKOS and Komodo, two systems previously verified using Coq and Dafny, respectively, for automated verification using Serval, and discuss trade-offs of different verification methodologies. In addition, we apply Serval to the Keystone security monitor and the BPF compilers in the Linux kernel, and uncover 18 new bugs through verification, all confirmed and fixed by developers. BibTeX: @inproceedings{Nelson2019, author = {Nelson, Luke and Bornholt, James and Gu, Ronghui and Baumann, Andrew and Torlak, Emina and Wang, Xi}, title = {Scaling Symbolic Evaluation for Automated Verification of Systems Code with Serval}, booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles}, publisher = {ACM}, year = {2019}, pages = {225--242}, url = {http://doi.acm.org/10.1145/3341301.3359641}, doi = {10.1145/3341301.3359641} } Nie Q and Malik S (2019), "SpFlow: Memory-Driven Data Flow Optimization for Sparse Matrix-Matrix Multiplication", In Proceedings of the IEEE International Symposium on Circuits and Systems., May, 2019. , pp. 1-5. [Abstract] [BibTeX] [DOI] Abstract: To improve the performance of sparse matrix-matrix multiplication (SpMM) running on a specialized architecture, orchestrating a data flow that maximizes data reuse in local memory is critical but challenging due to the irregular non-zero element locations and the wide range of sparsity. In this work, we proposed SpFlow, a memory-driven data flow optimization framework for SpMM. SpFlow can realize 54X fewer DRAM accesses and 97X fewer SRAM accesses on average than a GPU running the cuSPARSE kernel. And in comparison with a state-of-the-art accelerator, the performance can be improved by 3X, and SRAM accesses reduced by 5X on average. BibTeX: @inproceedings{Nie2019, author = {Nie, Q. and Malik, S.}, title = {SpFlow: Memory-Driven Data Flow Optimization for Sparse Matrix-Matrix Multiplication}, booktitle = {Proceedings of the IEEE International Symposium on Circuits and Systems}, year = {2019}, pages = {1--5}, doi = {10.1109/ISCAS.2019.8702111} } Nie J, Zhang C, Zou D, Xia F, Lu L, Wang X and Zhao F (2019), "Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous Architecture", In Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference. ACM Press. [Abstract] [BibTeX] [DOI] Abstract: SpMV is the core algorithm in solving the sparse linear equations, which is widely used in many research and engineering application field. GPU is the most common coprocessor in high-performance computing domain, and has already been proven to researchers the practical value in accelerating various algorithms. A lot of reletead work has been carried out to optimize parallel SpMV on CPU-GPU platforms, which mainly focuses on reducing the computing overhead on the GPU, including branch divergence and cache missing, and little attention was paid to the overall efficiency of the heterogeneous platform. In this paper, we describe the design and implementation of an adaptive sparse matrix-vector multiplication (SpMV) on CPU-GPU heterogeneous architecture. We propose a dynamic task scheduling framework for CPU-GPU platform to improve the utilization of both CPU and GPU. A double buffering scheme is also presented to hide the data transfer overhead between CPU and GPU. Two deeply optimized SpMV kernels are deployed for CPU and GPU respectively. The evaluation on typical sparse matrices indicates that the proposed algorithm obtains both significant performance increase and adaptability to different types of sparse matrices. BibTeX: @inproceedings{Nie2019a, author = {Nie, Jing and Zhang, Chunlei and Zou, Dan and Xia, Fei and Lu, Lina and Wang, Xiang and Zhao, Fei}, title = {Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous Architecture}, booktitle = {Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference}, publisher = {ACM Press}, year = {2019}, doi = {10.1145/3341069.3341072} } Nisa I, Li J, Sukumaran-Rajam A, Vuduc R and Sadayappan P (2019), "Load-Balanced Sparse MTTKRP on GPUs", In Proceedings of the 2019 International Parallel and Distributed Processing Symposium. [Abstract] [BibTeX] Abstract: Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse tensor computations. This work focuses on optimizing the MTTKRP operation on GPUs, addressing both performance and storage requirements. We begin by identifying the performance bottlenecks in directly extending the state-ofthe-art CSF (compressed sparse fiber) format from CPUs to GPUs. A significant challenge with GPUs compared to multicore CPUs is that of utilizing the much greater degree of parallelism in a load-balanced fashion for irregular computations like sparse MTTKRP. To address this issue, we develop a new storage-efficient representation for tensors that enables highperformance, load-balanced execution of MTTKRP on GPUs. A GPU implementation of sparse MTTKRP using the new sparse tensor representation is shown to outperform all currently known parallel sparse CPU and GPU MTTKRP implementations. BibTeX: @inproceedings{Nisa2019, author = {Nisa, Israt and Li, Jiajia and Sukumaran-Rajam, Aravind and Vuduc, Richard and Sadayappan, P.}, title = {Load-Balanced Sparse MTTKRP on GPUs}, booktitle = {Proceedings of the 2019 International Parallel and Distributed Processing Symposium}, year = {2019} } Nisa IJ (2019), "Architecture-aware Algorithm Design of Sparse Tensor/Matrix Primitives for GPUs". Thesis at: Ohio State University. [Abstract] [BibTeX] Abstract: Sparse matrix/tensor operations have been a common computational motif in a wide spectrum of domains - numerical linear algebra, graph analytics, machine learning, health-care, etc. Sparse kernels play a key role in numerous machine learning algorithms and the rising popularity of this domain increases the significance of the primitives like SpMV (Sparse Matrix-Vector Multiplication), SDDMM (Sampled Dense-Dense Matrix Multiplication), MF/TF(Sparse Matrix/Tensor Factorization), etc. These primitives are data-parallel and highly suitable for GPU-like architectures that provide massive parallelism. Real-world matrices and tensors are large-scale and have millions of data points, which is sufficient to utilize all the cores of a GPU. Yet, a data parallel algorithm can become the bottleneck of an application and perform way below than the upper bound of the roofline model. Some common reasons are frequent irregular global memory access, low data reuse, and imbalanced work distribution. However, efficient utilization of GPU memory hierarchy, reduced thread communication, increased data locality , and an even workload distribution can provide ample opportunities for significant performance improvement. The challenge lies in utilizing the techniques across applications and achieve an even performance in spite of the irregularity of the input matrices or tensors. In this work, we systematically identify the performance bottlenecks of the important sparse algorithms and provide optimized and high performing solutions.\At the beginning of this dissertation, we explore the application of cost-effective ML techniques in solving the format selection and performance modeling problem in the SpMV domain. By identifying a small set of sparse matrix features to use in training the ML models, we are able to select the best storage format, and predict the execution time of an SpMV kernel as well. Next, we optimize the SDDMM kernel, which is a key bottleneck in factor analysis and topic modeling algorithms like ALS, LDA, GaP, ALS, etc. The performance constraints are addressed by exploiting data reuse and increasing parallelism using virtual warping, multi-level tiling and effective use of on-chip memory (shared memory), etc. Rest of the following works are on the optimization of factorization techniques of sparse matrix and tensors on GPUs. For matrix factorization, we optimize the cyclic coordinate descent (CCD++), which is the state-of-the-art factorization method. An efficient GPU implementation is devised by using kernel fusion, tiling and binning. Next, we extend the optimization of the factorization problem to higher order data - tensor. MTTKRP (Matricized Tensor Times Khatri-Rao Products) is a key bottleneck of one of the most common tensor factorization techniques - CPD (CANDECOMP/PARAFAC decomposition). We develop new storage-efficient representations, B-CSF and HB-CSF, for tensors that enables high-performance and load-balanced execution of MTTKRP on GPUs. However, for a tensor with d modes, CPD requires a sequence of d tensor computations. To guarantee efficient memory access with respect to different modes, many storage formats store d distinct representations despite d-fold space overhead. Hence, we devise MM-CSF, a compact mixed-mode representation where better performance is achieved compared to existing solutions while utilizing a small fraction of the space. BibTeX: @phdthesis{Nisa2019a, author = {Nisa, Israt J.}, title = {Architecture-aware Algorithm Design of Sparse Tensor/Matrix Primitives for GPUs}, school = {Ohio State University}, year = {2019} } Nisa I, Li J, Sukumaran-Rajam A, Rawat PS, Krishnamoorthy S and Sadayappan P (2019), "An Efficient Mixed-mode Representation of Sparse Tensors", In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA , pp. 49:1-49:25. ACM. [Abstract] [BibTeX] [DOI] Abstract: The Compressed Sparse Fiber (CSF) representation for sparse tensors is a generalization of the Compressed Sparse Row (CSR) format for sparse matrices. For a tensor with d modes, typical tensor methods such as CANDECOMP/PARAFAC decomposition (CPD) require a sequence of d tensor computations, where efficient memory access with respect to different modes is required for each of them. The straightforward solution is to use d distinct representations of the tensor, with each one being efficient for one of the d computations. However, a d-fold space overhead is often unacceptable in practice, especially with memory-constrained GPUs. In this paper, we present a mixed-mode tensor representation that partitions the tensor's nonzero elements into disjoint sections, each of which is compressed to create fibers along a different mode. Experimental results demonstrate that better performance can be achieved while utilizing only a small fraction of the space required to keep d distinct CSF representations. BibTeX: @inproceedings{Nisa2019b, author = {Nisa, Israt and Li, Jiajia and Sukumaran-Rajam, Aravind and Rawat, Prasant Singh and Krishnamoorthy, Sriram and Sadayappan, P.}, title = {An Efficient Mixed-mode Representation of Sparse Tensors}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, publisher = {ACM}, year = {2019}, pages = {49:1--49:25}, doi = {10.1145/3295500.3356216} } Nowak I, Muts P and Hendrix EMT (2019), "Multi-Tree Decomposition Methods for Large-Scale Mixed Integer Nonlinear Optimization", In Large Scale Optimization in Supply Chains and Smart Manufacturing: Theory and Applications. Cham , pp. 27-58. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: Most industrial optimization problems are sparse and can be formulated as block-separable mixed-integer nonlinear programmingMixed integer nonlinear programming(MINLP) problems, defined by linking low-dimensional sub-problems by (linear) coupling constraints. Decomposition methods solve a block-separable MINLP by alternately solving master problems and sub-problems. In practice, decomposition methods are sometimes the only possibility to compute high-quality solutions of large-scale optimization problems. However, efficient implementations may require expert knowledge and problem-specific features. Recently, there is renewed interest in making these methods accessible to general users by developing generic decomposition frameworks and modelling support. The focus of this chapter is on so-called multi-tree decomposition methods, which iteratively approximate the feasible area without using a single (global) branch-and-bound tree, i.e. branch-and-bound is only used for solving sub-problems. After an introduction, we describe first outer approximation (OA) decomposition methods, Outer approximationincluding the adaptive, multivariate partitioning (AMP)Adaptive, Multivariate Partitioning (AMP) algorithmand the novel decomposition-based outer approximation (DECOA) algorithmDecomposition-based outer approximation (DECOA). This is followed by a description of multi-tree methods using a reduced master problem for solving large-scale industrial optimization problems. The first method to be described applies parallel column generationColumn generation(CG) and iterative fixing for solving nonconvex transport optimization problems with several hundred millions of variables and constraints. The second method is based on a novel approach combining CG and compact outer approximation. The last methodology to be discussed is the general Benders decomposition methodBenders decomposition methodfor globally solving large nonconvex stochastic programs using a reduced mixed-integer programming (MIP) master problem. BibTeX: @inbook{Nowak2019, author = {Nowak, Ivo and Muts, Pavlo and Hendrix, Eligius M. T.}, editor = {Velásquez-Bermúdez, J. and Khakifirooz, M. and Fathi, M.}, title = {Multi-Tree Decomposition Methods for Large-Scale Mixed Integer Nonlinear Optimization}, booktitle = {Large Scale Optimization in Supply Chains and Smart Manufacturing: Theory and Applications}, publisher = {Springer International Publishing}, year = {2019}, pages = {27--58}, doi = {10.1007/978-3-030-22788-3_2} } Nurminen JK, Halvari T, Harviainen J, Mylläri J, Röyskö A, Silvennoinen J and Mikkonen T (2019), "Software Framework for Data Fault Injection to Test Machine Learning Systems", In Proceedings of the 30th International Symposium on Software Reliability Engineering. [Abstract] [BibTeX] Abstract: Data-intensive systems are sensitive to the quality of data. Data often has problems due to faulty sensors or network problems, for instance. In this work, we develop a software framework to emulate faults in data and use it to study how machine learning (ML) systems work when the data has problems. We aim for flexibility: users can use predefined or their own dedicated fault models. Likewise, different kind of data (e.g. text, time series, video) can be used and the system under test can vary from a single ML model to a complicated software system. Our goal is to show how data faults can be emulated and how that can be used in the study and development of ML solutions. BibTeX: @inproceedings{Nurminen2019, author = {Jukka K. Nurminen and Tuomas Halvari and Juha Harviainen and Juha Mylläri and Antti Röyskö and Juuso Silvennoinen and Tommi Mikkonen}, title = {Software Framework for Data Fault Injection to Test Machine Learning Systems}, booktitle = {Proceedings of the 30th International Symposium on Software Reliability Engineering}, year = {2019} } Ozaslan IK, Pilanci M and Arikan O (2019), "Regularized Momentum Iterative Hessian Sketch for Large Scale Linear System of Equations", December, 2019. [Abstract] [BibTeX] Abstract: In this article, Momentum Iterative Hessian Sketch (M-IHS) techniques, a group of solvers for large scale linear Least Squares (LS) problems, are proposed and analyzed in detail. The proposed techniques are obtained by incorporating the Heavy Ball Acceleration into the Iterative Hessian Sketch algorithm and they provide significant improvements over the randomized preconditioning techniques. Through the error analyses of the M-IHS variants, lower bounds on the sketch size for various randomized distributions to converge at a pre-determined rate with a constant probability are established. The bounds present the best results in the current literature for obtaining a solution approximation and they suggest that the sketch size can be chosen proportional to the statistical dimension of the regularized problem regardless of the size of the coefficient matrix. The statistical dimension is always smaller than the rank and it gets smaller as the regularization parameter increases. By using approximate solvers along with the iterations, the M-IHS variants are capable of avoiding all matrix decompositions and inversions, which is one of the main advantages over the alternative solvers such as the Blendenpik and the LSRN. Similar to the Chebyshev Semi-iterations, the M-IHS variants do not use any inner products and eliminate the corresponding synchronizations steps in hierarchical or distributed memory systems, yet the M-IHS converges faster than the Chebyshev Semi-iteration based solvers. BibTeX: @article{Ozaslan2019, author = {Ozaslan, Ibrahim Kurban and Pilanci, Mert and Arikan, Orhan}, title = {Regularized Momentum Iterative Hessian Sketch for Large Scale Linear System of Equations}, year = {2019} } Pandey S, Li XS, Buluç A, Xu J and Liu H (2019), "H-INDEX: Hash-Indexing for Parallel Triangle Counting on GPUs", In Proceedings of the IEEE High Performance Extreme Computing Conference., September, 2019. , pp. 1-7. [Abstract] [BibTeX] [DOI] Abstract: Triangle counting is a graph algorithm that calculates the number of triangles involving each vertex in a graph. Briefly, a triangle encompasses three vertices from a graph, where every vertex possesses at least one incidental edge to the other two vertices from the triangle. Consequently, list intersection, which identifies the incidental edges, becomes the core algorithm for triangle counting. At the meantime, attracted by the enormous parallel computing potential of Graphics Processing Units (GPUs), numerous efforts have been devoted to deploy triangle counting algorithms on GPUs.While state-of-the-art intersection algorithms, such as merge-path and binary-search, perform well on traditional multi-core CPU systems, deploying them on massively parallel GPUs turns out to be challenging. In particular, merge-path based approach experiences the hardship of evenly distributing the workload across vast GPU threads and irregular memory accesses. Binary-search based approach often suffers from the potential problem of high time complexity. Furthermore, both approaches require sorted neighbor lists from the input graphs, which involves nontrivial preprocessing overhead. To this end, we introduce H-INDEX, a hash-indexing assisted triangle counting algorithm that overcomes all the aforementioned shortcomings. Notably, HINDEX achieves 141.399 billion TEPS computing rate on a Protein K-mer V2a graph with 64 GPUs. To the best of our knowledge, this is the first work that advances triangle counting beyond the 100 billion TEPS rate. BibTeX: @inproceedings{Pandey2019, author = {Pandey, S. and Li, X. S. and Buluç, A. and Xu, J. and Liu, H.}, title = {H-INDEX: Hash-Indexing for Parallel Triangle Counting on GPUs}, booktitle = {Proceedings of the IEEE High Performance Extreme Computing Conference}, year = {2019}, pages = {1--7}, doi = {10.1109/HPEC.2019.8916492} } Peng S and Tan SXD (2019), "GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation" [Abstract] [BibTeX] Abstract: In this article, we propose a new GPU-based sparse LU factorization method, called GLU3.0, solves the aforementioned problems. First, it introduces a much more efficient double-U dependency detection algorithm to make the detection much simpler. Second, we observe that the potential parallelism is different as the matrix factorization goes on. We then develop three different modes of GPU kernel to adapt to different stages to accommodate the computing task changes in the factorization. As a result, the new GLU can dynamically allocate GPU blocks and wraps based on the number of columns in a level to better balance the computing demands and resources during the LU factorization process. Experimental results on circuit matrices from University of Florida Sparse Matrix Collection (UFL) show that the GLU3.0 can deliver 2-3 orders of magnitude speedup over GLU2.0 for the data dependency detection. Furthermore, GLU3.0 achieve 13.0× (arithmetic mean) and 6.7× (geometric mean) speedup over GLU2.0 and 7.1× (arithmetic mean) and 4.8× (geometric mean) over the recently proposed enhanced GLU2.0 sparse LU solver on the same set of circuit matrices. BibTeX: @article{Peng2019, author = {Peng, Shaoyi and Tan, Sheldon X. D.}, title = {GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation}, year = {2019} } Piccinotti D, Ramalli E, Parravicini A, Brondolin R and Santambrogio M (2019), "Solving write conflicts in GPU-accelerated graph computation: A PageRank case-study", In Proceedings of the IEEE 5th International forum on Research and Technology for Society and Industry., September, 2019. , pp. 144-148. [Abstract] [BibTeX] [DOI] Abstract: Graph ranking algorithms, such as PageRank, are widely used in a number of real-world applications like web search. As the size of the graphs on which these algorithms are applied gets bigger and bigger, it becomes necessary to devise powerful and flexible techniques to accelerate and parallelize the computation, both at software and hardware level. Leveraging GPUs is a promising direction due to their highly parallel computing capabilities, but execution time is often hampered by write conflicts. In this paper, we present a solution to handle write conflicts in GPU computations exploiting high level of parallelism, and show how this technique can effectively be used to accelerate the computation of PageRank by a factor of 5×, with respect to a baseline in which conflicts are not handled. Our solution is implemented at software level, and doesn't require specific hardware resources. BibTeX: @inproceedings{Piccinotti2019, author = {Piccinotti, D. and Ramalli, E. and Parravicini, A. and Brondolin, R. and Santambrogio, M.}, title = {Solving write conflicts in GPU-accelerated graph computation: A PageRank case-study}, booktitle = {Proceedings of the IEEE 5th International forum on Research and Technology for Society and Industry}, year = {2019}, pages = {144--148}, doi = {10.1109/RTSI.2019.8895572} } Quirynen R and Cairano SD (2019), "PRESAS: Block-Structured Preconditioning of Iterative Solvers within a Primal Active-Set Method for fast MPC", December, 2019. [Abstract] [BibTeX] Abstract: Model predictive control (MPC) for linear dynamical systems requires solving an optimal control structured quadratic program (QP) at each sampling instant. This paper proposes a primal active-set strategy (PRESAS) for the efficient solution of such block-sparse QPs, based on a preconditioned iterative solver to compute the search direction in each iteration. Rank-one factorization updates of the preconditioner result in a per-iteration computational complexity of 𝒪(N m^2), where m denotes the number of state and control variables and N the number of control intervals. Three different block-structured preconditioning techniques are presented and their numerical properties are studied further. In addition, an augmented Lagrangian based implementation is proposed to avoid a costly initialization procedure to find a primal feasible starting point. Based on a standalone C code implementation, we illustrate the computational performance of PRESAS against current state of the art QP solvers for multiple linear and nonlinear MPC case studies. We also show that the solver is real-time feasible on a dSPACE MicroAutoBox-II rapid prototyping unit for vehicle control applications, and numerical reliability is illustrated based on experimental results from a testbench of small-scale autonomous vehicles. BibTeX: @article{Quirynen2019, author = {Quirynen, Rien and Cairano, Stefano Di}, title = {PRESAS: Block-Structured Preconditioning of Iterative Solvers within a Primal Active-Set Method for fast MPC}, year = {2019} } Rais HMD, Abed SA and Watada J (2019), "Computational Comparison of Major Proposed Methods for Graph Partitioning Problem", Journal of Advanced Computational Intelligence and Intelligent Informatics. Vol. 23(1), pp. 5-17. [Abstract] [BibTeX] [DOI] Abstract: k-way graph partitioning is an NP-complete problem, which is applied to various tasks such as route planning, image segmentation, community detection, and high-performance computing. The approximate methods constitute a useful solution for these types of problems. Thus, many research studies have focused on developing meta-heuristic algorithms to tackle the graph partitioning problem. Local search is one of the earliest methods that has been applied efficiently to this type of problem. Recent studies have explored various types of local search methods and have improved them such that they can be used with the partitioning process. Moreover, local search methods are widely integrated with population-based approaches, to provide the best diversification and intensification for the problem space. This study emphasizes the local search approaches, as well as their combination with other graph partitioning approaches. At present, none of the surveys in the literature has focused on this class of state of the art approaches in much detail. In this study, the vital parts of these approaches including neighborhood structure, acceptance criterion, and the ways of combining them with other approaches, are highlighted. Additionally, we provide an experimental comparison that shows the variance in the performance of the reviewed methods. Hence, this study clarifies these methods to show their advantages and limitations for the targeted problem, and thus can aid in the direction of research flow towards the area of graph partitioning. BibTeX: @article{Rais2019, author = {Rais, Helmi M. D. and Abed, Saad Adnan and Watada, Junzo}, title = {Computational Comparison of Major Proposed Methods for Graph Partitioning Problem}, journal = {Journal of Advanced Computational Intelligence and Intelligent Informatics}, year = {2019}, volume = {23}, number = {1}, pages = {5--17}, doi = {10.20965/jaciii.2019.p0005} } Ramesh C (2019), "Hardware-Software Co-Design Accelerators for Sparse BLAS". Thesis at: Centre for Nano Science and Engineering (CeNSE), Indian Institute of Science, Bangalore. [Abstract] [BibTeX] Abstract: Sparse Basic Linear Algebra Subroutines (Sparse BLAS) is an important library. Sparse BLAS includes three levels of subroutines. Level 1, Level2 and Level 3 Sparse BLAS routines. Level 1 Sparse BLAS routines do computations over sparse vector and spare/dense vector. Level 2 deals with sparse matrix and vector operations. Level 3 deals with sparse matrix and dense matrix operations. The computations of these Sparse BLAS routines on General Purpose Processors (GPPs) not only suffer from less utilization of hardware resources but also takes more compute time than the workload due to poor data locality of sparse vector/matrix storage formats. In the literature, tremendous efforts have been put into software to improve these Sparse BLAS routines performance on GPPs. GPPs best suit for applications with high data locality, whereas Sparse BLAS routines operate on applications with less data locality hence, GPPs performance is poor. Various Custom Function Units (Hardware Accelerators) are proposed in the literature and are proved to be efficient than soft wares which tried to accelerate Sparse BLAS subroutines. Though existing hardware accelerators improved the Sparse BLAS performance compared to software Sparse BLAS routines, there is still lot of scope to improve these accelerators. This thesis describes both the existing software and hardware software co-designs (HW/SW co-design) and identifies the limitations of these existing solutions. We propose a new sparse data representation called Sawtooth Compressed Row Storage (SCRS) and corresponding SpMV and SpMM algorithms. SCRS based SpMV and SpMM are performing better than existing software solutions. Even though SCRS based SpMV and SpMM algorithms perform better than existing solutions, they still could not reach theoretical peak performance. The knowledge gained from the study of limitations of these existing solutions including the proposed SCRS based SpMV and SpMM is used to propose new HW/SW co-designs. Software accelerators are limited by the hardware properties of GPPs, and GPUs itself, hence, we propose HW/SW co-designs to accelerate few basic Sparse BLAS operations (SpVV and SpMV). Our proposed Parallel Sparse BLAS HW/SW co-design achieves near theoretical peak performance with reasonable hardware resources. BibTeX: @phdthesis{Ramesh2019, author = {Ramesh, Chinthala}, title = {Hardware-Software Co-Design Accelerators for Sparse BLAS}, school = {Centre for Nano Science and Engineering (CeNSE), Indian Institute of Science, Bangalore}, year = {2019} } Regev S and Saunders MA (2019), "SSAI: A Symmetric Sparse Approximate Inverse Preconditioner for the Conjugate Gradient Method". Thesis at: Stanford University. [Abstract] [BibTeX] Abstract: We propose a method for solving a Hermitian positive definite linear system Ax = b, where A is an explicit sparse matrix (real or complex). A sparse approximate right inverse M is computed and replaced by M = (M + M^H)/2, which is used as a left-right preconditioner in a modified version of the preconditioned conjugate gradient (PCG) method. M is formed column by column and can therefore be computed in parallel. PCG requires only matrix-vector multiplications with A and M (not solving a linear system with the preconditioner), and so too can be carried out in parallel. We compare it with incomplete Cholesky factorization (the gold standard for PCG) and with MATLAB's backslash operator (sparse Cholesky) on matrices from various applications. BibTeX: @techreport{Regev2019, author = {Regev, Shaked and Saunders, Michael A.}, title = {SSAI: A Symmetric Sparse Approximate Inverse Preconditioner for the Conjugate Gradient Method}, school = {Stanford University}, year = {2019} } Reguly IZ, Mudalige GR, Giles MB and Maheswaran S (2019), "Improving resilience of scientific software through a domain-specific approach", Journal of Parallel and Distributed Computing. [Abstract] [BibTeX] [DOI] [URL] Abstract: In this paper we present research on improving the resilience of the execution of scientific software, an increasingly important concern in High Performance Computing (HPC). We build on an existing high-level abstraction framework, the Oxford Parallel library for Structured meshes (OPS), developed for the solution of multi-block structured mesh-based applications, and implement an algorithm in the library to carry out checkpointing automatically, without the intervention of the user. The target applications are a hydrodynamics benchmark application from the Mantevo Suite, CloverLeaf 3D, the sparse linear solver proxy application TeaLeaf, and the OpenSBLI compressible Navier--Stokes direct numerical simulation (DNS) solver. We present (1) the basic algorithm that OPS relies on to determine the optimal checkpoint in terms of size and location, (2) improvements that supply additional information to improve the decision, (3) techniques that reduce the cost of writing the checkpoints to non-volatile storage, (4) a performance analysis of the developed techniques on a single workstation and on several supercomputers, including ORNL's Titan. Our results demonstrate the utility of the high-level abstractions approach in automating the checkpointing process and show that performance is comparable to, or better than the reference in all cases. BibTeX: @article{Reguly2019, author = {Reguly, I. Z. and Mudalige, G. R. and Giles, M. B. and Maheswaran, S.}, title = {Improving resilience of scientific software through a domain-specific approach}, journal = {Journal of Parallel and Distributed Computing}, year = {2019}, url = {http://www.sciencedirect.com/science/article/pii/S0743731519300917}, doi = {10.1016/j.jpdc.2019.01.015} } Regunta SC, Tondomker SH and Kothapalli K (2019), "BRICS -- Efficient Techniques for Estimating the Farness-Centrality in Parallel", In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops., May, 2019. , pp. 645-654. [Abstract] [BibTeX] [DOI] Abstract: In this paper, we study scalable parallel algorithms for estimating the farness-centrality value of the nodes in a given undirected and connected graph. Our algorithms consider approaches that are more suitable for sparse graphs. To this end, we propose four optimization techniques based on removing redundant nodes, removing identical nodes, removing chain nodes, and making use of decomposition based on the biconnected components of the input graph. We test our techniques on a collection of real-world graphs for the time taken and the average error percentage. We further analyze the applicability of our techniques on various classes of real-world graphs. We suggest why certain techniques work better on certain classes of graphs. BibTeX: @inproceedings{Regunta2019, author = {Regunta, S. C. and Tondomker, S. H. and Kothapalli, K.}, title = {BRICS -- Efficient Techniques for Estimating the Farness-Centrality in Parallel}, booktitle = {2019 IEEE International Parallel and Distributed Processing Symposium Workshops}, year = {2019}, pages = {645--654}, doi = {10.1109/IPDPSW.2019.00110} } Ren Y, Meng L and Zhang J (2019), "Scalable Heterogeneous Social Network Alignment through Synergistic Graph Partition", December, 2019. [Abstract] [BibTeX] Abstract: Social network alignment has been an important research problem for social network analysis in recent years. With the identified shared users across networks, it will provide researchers with the opportunity to achieve a more comprehensive understanding of users' social activities both within and across networks. Social network alignment is a very difficult problem. Besides the challenges introduced by the network heterogeneity, the network alignment problem can be reduced to a combinatorial optimization problem with an extremely large search space. The learning effectiveness and efficiency of existing alignment models will be degraded significantly as the network size increases. In this paper, we will focus on studying the scalable heterogeneous social network alignment problem, and propose to address it with a novel two-stage network alignment model, namely Scalable Heterogeneous Network Alignment (SHNA). Based on a group of intra- and inter-network meta diagrams, SHNA first partitions the social networks into a group of sub-networks synergistically. Via the partially known anchor links, SHNA will extract the partitioned sub-network correspondence relationships. Instead of aligning the complete input network, SHNA proposes to identify the anchor links between the matched sub-network pairs, while those between the unmatched sub-networks will be pruned to effectively shrink the search space. Extensive experiments have been done to compare SHNA with the state-of-the-art baseline methods on a real-world aligned social networks dataset. The experimental results have demonstrated both the effectiveness and efficiency of the model in addressing the problem. BibTeX: @article{Ren2019, author = {Yuxiang Ren and Lin Meng and Jiawei Zhang}, title = {Scalable Heterogeneous Social Network Alignment through Synergistic Graph Partition}, year = {2019} } Ribizel T and Anzt H (2019), "Approximate and Exact Selection on GPUs", In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops., May, 2019. , pp. 471-478. [Abstract] [BibTeX] [DOI] Abstract: We present a novel algorithm for parallel selection on GPUs. The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always using the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for - and exploiting the characteristics of - "pleasant" data distributions. At the same time, as the SampleSelect does not work on the actual values but the ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. Additionally to the exact SampleSelect, we address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy. BibTeX: @inproceedings{Ribizel2019, author = {Ribizel, T. and Anzt, H.}, title = {Approximate and Exact Selection on GPUs}, booktitle = {2019 IEEE International Parallel and Distributed Processing Symposium Workshops}, year = {2019}, pages = {471--478}, doi = {10.1109/IPDPSW.2019.00088} } Ribizel T and Anzt H (2019), "Parallel Selection on GPUs", Parallel Computing. , pp. 102588. [Abstract] [BibTeX] [DOI] [URL] Abstract: We present a novel parallel selection algorithm for GPUs capable of handling single rank selection (single selection) and multiple rank selection (multiselection). The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always leveraging the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for – and exploiting the characteristics of – “pleasant” data distributions. At the same time, as the proposed SampleSelect algorithm does not work on the actual element values but on the element ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. We also address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy. BibTeX: @article{Ribizel2019a, author = {Ribizel, Tobias and Anzt, Hartwig}, title = {Parallel Selection on GPUs}, journal = {Parallel Computing}, year = {2019}, pages = {102588}, url = {http://www.sciencedirect.com/science/article/pii/S0167819119301796}, doi = {10.1016/j.parco.2019.102588} } Sadi F, Sweeney J, Low TM, Hoe JC, Pileggi L and Franchetti F (2019), "Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-way Merge Parallelization", In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. New York, NY, USA , pp. 347-358. ACM. [Abstract] [BibTeX] [DOI] Abstract: The importance of Sparse Matrix dense Vector multiplication (SpMV) operation in graph analytics and numerous scientific applications has led to development of custom accelerators that are intended to over-come the difficulties of sparse data operations on general purpose architectures. However, efficient SpMV operation on large problem (i.e. working set exceeds on-chip storage) is severely constrained due to strong dependence on limited amount of fast random access memory to scale. Additionally, unstructured matrix with high sparsity pose difficulties as most solutions rely on exploitation of data locality. This work presents an algorithm co-optimized scalable hardware architecture that can efficiently operate on very large ( billion nodes) and/or highly sparse (avg. degree <10) graphs with significantly less on-chip fast memory than existing solutions. A novel parallelization methodology for implementing large and high throughput multi-way merge network is the key enabler of this high performance SpMV accelerator. Additionally, a data compression scheme to reduce off-chip traffic and special computation for nodes with exceptionally large number of edges, commonly found in power-law graphs, are presented. This accelerator is demonstrated with 16-nm fabricated ASIC and Stratix 10 FPGA platforms. Experimental results show more than an order of magnitude improvement over current custom hardware solutions and more than two orders of magnitude improvement over commercial off-the-shelf (COTS) architectures for both performance and energy efficiency. BibTeX: @inproceedings{Sadi2019, author = {Sadi, Fazle and Sweeney, Joe and Low, Tze Meng and Hoe, James C. and Pileggi, Larry and Franchetti, Franz}, title = {Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-way Merge Parallelization}, booktitle = {Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture}, publisher = {ACM}, year = {2019}, pages = {347--358}, doi = {10.1145/3352460.3358330} } Sahasrabudhe D, Phipps ET, Rajamanickam S and Berzins M (2019), "A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures", In Proceedings of the 6th Workshop on Accelerator Programming Using Directives. Denver, Colorado , pp. 1-23. ACM. [Abstract] [BibTeX] Abstract: As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (SIMD) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (KNL), and also facilitates Single Instruction Multiple Threads (SIMT) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new SIMD primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (LVL). The SIMD primitive provides portability across cpus and gpus without any performance degradation being observed experimentally. BibTeX: @inproceedings{Sahasrabudhe2019, author = {Sahasrabudhe, Damodar and Phipps, Eric T. and Rajamanickam, Sivasankaran and Berzins, Martin}, title = {A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures}, booktitle = {Proceedings of the 6th Workshop on Accelerator Programming Using Directives}, publisher = {ACM}, year = {2019}, pages = {1--23} } Sao P, Li XS and Vuduc R (2019), "A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems", Journal of Parallel and Distributed Computing. [Abstract] [BibTeX] [DOI] [URL] Abstract: We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of Ologn and latency by a factor of Ologn. For non-planar cases, our algorithm can reduce the per-process communication volume by 3× and latency by On13 times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (Halo) algorithm for co-processor offload [44]. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration. BibTeX: @article{Sao2019, author = {Sao, Piyush and Li, Xiaoye S. and Vuduc, Richard}, title = {A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems}, journal = {Journal of Parallel and Distributed Computing}, year = {2019}, url = {http://www.sciencedirect.com/science/article/pii/S0743731518305197}, doi = {10.1016/j.jpdc.2019.03.004} } Sao P, Kannan R, Li XS and Vuduc R (2019), "A Communication-avoiding 3D Sparse Triangular Solver", In Proceedings of the ACM International Conference on Supercomputing. New York, NY, USA , pp. 127-137. ACM. [Abstract] [BibTeX] [DOI] Abstract: We present a novel distributed memory algorithm to improve the strong scalability of the solution of a sparse triangular system. This operation appears in the solve phase of direct methods for solving general sparse linear systems, Ax = b. Our 3D sparse triangular solver employs several techniques, including a 3D MPI process grid, elimination tree parallelism, and data replication, all of which reduce the per-process communication when combined. We present analytical models to understand the communication cost of our algorithm and show that our 3D sparse triangular solver can reduce the per-process communication volume asymptotically by a factor of O(n^1/4) and O(n^1/6) for problems arising from the finite element discretizations of 2D "planar" and 3D "non-planar" PDEs, respectively. We implement our algorithm for use in SuperLU_DIST3D, using a hybrid MPI+OpenMP programming model. Our 3D triangular solve algorithm, when run on 12k cores of Cray XC30, outperforms the current state-of-the-art 2D algorithm by 7.2× for planar and 2.7× for the non-planar sparse matrices, respectively. BibTeX: @inproceedings{Sao2019a, author = {Sao, Piyush and Kannan, Ramakrishnan and Li, Xiaoye Sherry and Vuduc, Richard}, title = {A Communication-avoiding 3D Sparse Triangular Solver}, booktitle = {Proceedings of the ACM International Conference on Supercomputing}, publisher = {ACM}, year = {2019}, pages = {127--137}, doi = {10.1145/3330345.3330357} } Scott J and Tůma M (2019), "Sparse stretching for solving sparse-dense linear least-squares problems", SIAM Journal on Scientific Computing. [Abstract] [BibTeX] Abstract: Large-scale linear least-squares problems arise in a wide range of practical applications. In some cases, the system matrix contains a small number of dense rows. These make the problem significantly harder to solve because their presence limits the direct applicability of sparse matrix techniques. In particular, the normal matrix is (close to) dense, so that forming it is impractical. One way to help overcome the dense row problem is to employ matrix stretching. Stretching is a sparse matrix technique that improves sparsity by making the least-squares problem larger. We show that standard stretching can still result in the normal matrix for the stretched problem having an unacceptably large amount of fill. This motivates us to propose a new sparse stretching strategy that performs the stretching so as to limit the fill in the normal matrix and its Cholesky factor. Numerical examples from real problems are used to illustrate the potential gains. BibTeX: @article{Scott2019, author = {Scott, J. and Tůma, M.}, title = {Sparse stretching for solving sparse-dense linear least-squares problems}, journal = {SIAM Journal on Scientific Computing}, year = {2019} } di Serafino D and Orban D (2019), "Constraint-Preconditioned Krylov Solvers for Regularized Saddle-Point Systems", October, 2019. [Abstract] [BibTeX] [DOI] Abstract: We consider the iterative solution of regularized saddle-point systems. When the leading block is symmetric and positive semi-definite on an appropriate subspace, Dollar, Gould, Schilders, and Wathen (2006) describe how to apply the conjugate gradient (CG) method coupled with a constraint preconditioner, a choice that has proved to be effective in optimization applications. We investigate the design of constraint-preconditioned variants of other Krylov methods for regularized systems by focusing on the underlying basis-generation process. We build upon principles laid out by Gould, Orban, and Rees (2014) to provide general guidelines that allow us to specialize any Krylov method to regularized saddle-point systems. In particular, we obtain constraint-preconditioned variants of Lanczos and Arnoldi-based methods, including the Lanczos version of CG, MINRES, SYMMLQ, GMRES(m) and DQGMRES. We also provide MATLAB implementations in hopes that they are useful as a basis for the development of more sophisticated software. Finally, we illustrate the numerical behavior of constraint-preconditioned Krylov solvers using symmetric and nonsymmetric systems arising from constrained optimization. BibTeX: @article{Serafino2019, author = {di Serafino, Daniela and Orban, Dominique}, title = {Constraint-Preconditioned Krylov Solvers for Regularized Saddle-Point Systems}, year = {2019}, doi = {10.5281/zenodo.3473542} } Shaiek H, Tomov S, Ayala A, Haidar A and Dongarra J (2019), "GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems", EuroMPI'19 Posters, Zurich, Switzerland.. Thesis at: ICL., September, 2019. (icl-ut-19-06) [Abstract] [BibTeX] Abstract: Fast Fourier transforms (FFTs) are used in applications ranging from molecular dynamics and spectrum estimation to machine learn- ing, fast convolution and correlation, signal modulation, wireless multimedia applications, and others. However, FFTs are memory bound, and therefore, to accelerate them, it is crucial to avoid and optimize the FFTs communications. To this end, we present a 3-D FFT design for distributed graphics processing unit (GPU) systems that: (1) efficiently uses GPUs high bandwidth, (2) reduces global communications algorithmically, when possible, and (3) employs GPUDirect technologies as well as MPI optimizations in the development of high-performance FFTs for large-scale GPU-accelerated systems. We show that these developments and optimizations lead to very good strong scalability and a performance that is close to 90% of the theoretical peak. BibTeX: @techreport{Shaiek2019, author = {Shaiek, Hejer and Tomov, Stanimire and Ayala, Alan and Haidar, Azzam and Dongarra, Jack}, title = {GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems}, journal = {EuroMPI'19 Posters, Zurich, Switzerland}, school = {ICL}, year = {2019}, number = {icl-ut-19-06} } Sharma M, Palkar P and Mahajan A (2019), "Linearization and Parallelization Schemes for Convex MINLPs" [Abstract] [BibTeX] Abstract: We present parallelization and linearization schemes for improving state-of-the-art algorithms for convex mixed-integer nonlinear programs (MINLPs). Recently, shared-memory multicore computing systems are quite prevalent and one can effectively harness the available computing resources. On this front, we first present parallel extensions of the tree based algorithms in the MINLP solver MINOTAUR, the nonlinear branch-and-bound (NLP-BB) and the LP/NLP based branch-andbound (QG). Next, we deploy several linearization techniques to obtain tighter relaxations in QG. Adding cuts only at integer feasible points in the branch-and-cut framework of QG may lead to large trees. First, we describe methods based on primal and dual information at the nodes to detect appropriate conditions for adding more inequalities. Next, we describe methods to generate tight inequalities by exploiting specific structures in the nonlinear constraint functions using iterative solution of LP and NLP relaxations in some neighbourhood of the root NLP solution. Third, we present line search based methods for finding good points for generating linear inequalities when this structure is missing.\ We benchmark our improvised algorithms with two parallel extensions of outer-approximation (OA) algorithm implemented in MINOTAUR that deploy a multi-threaded mixed-integer linear programming (MILP) solver. The first extension executes the MILPs within OA in parallel. Recently, some MILP solvers allow the calling programs to pass problem specific information like cuts, solutions etc. during their tree-search through callback functions. Using this (lazy cuts callback) functionality, we implement a second variant, which can also be viewed as a parallel QG algorithm primarily managed by the MILP solver (LSTOA). We present extensive computational experiments to show the encouraging individual and combined effects of parallelization and linearization techniques on the tree-based algorithms and analyze their performance alongside MILP based algorithms. BibTeX: @article{Sharma2019, author = {Meenarli Sharma and Prashant Palkar and Ashutosh Mahajan}, title = {Linearization and Parallelization Schemes for Convex MINLPs}, year = {2019} } Shi Y (2019), "Efficient Tensor Operations via Compression and Parallel Computation". Thesis at: University of California, Irvine. [Abstract] [BibTeX] [URL] Abstract: Linear algebra is the foundation of machine learning, especially for handling big data. We want to extract useful information that can represent the behavior of the data. For data with underlying known structures, it is straightforward to apply algorithms that maintain that structure. For instance, singular value decomposition (SVD) is one way to approximate lowrank matrices. The generalized SVD, tensor decomposition, is the crux of model estimation for tensors. However, not all data has a trivial structure. Multi-modality data that contains information from different sources can be complex and hard to extract the structure. A data-independent randomized algorithm, such as sketching, is the solution for this case. Under both scenarios, the information extraction process may be statistically challenging as the problems are non-convex optimization problems. More importantly, the large size and the high-dimensionality of the data have been significant obstacles in discovering hidden variables and summarizing them. Thus, how to improve high-dimensional data computation efficiency is vitally important.\ This thesis contains the theoretical analysis for learning the underlying information from high-dimensional structured or non-structured data via tensor operations such as tensor decomposition and tensor sketching. It is easy to consider tensors as multi-dimensional vectors or matrices and apply vector/matrix-based algorithms to find the solution. However, these methods omit multi-dimensionality of the data and can be computational inefficient than considering the tensor as a whole. We show the superiority of our approximation algorithms over these methods from computation and memory efficiency point of views.\ This thesis also discusses optimizing tensor operation computations from the high-performance computing aspect. Conventional methods treat tensors as flattened matrices or vectors. Operations between tensors may require lots of permutations and reshapes. We propose new tensor algebra computation routines that avoid the prepossessing as much as possible. The value of this approach and its applications are recognized by NVIDIA. The proposed interface exists in the CUBLAS 8.0. BibTeX: @phdthesis{Shi2019, author = {Shi, Yang}, title = {Efficient Tensor Operations via Compression and Parallel Computation}, school = {University of California, Irvine}, year = {2019}, url = {https://escholarship.org/uc/item/2wm4k3sn} } Shi Z and Eryilmaz A (2019), "Cubic Regularized ADMM with Convergence to a Local Minimum in Non-convex Optimization", In Proceedings of the 57th Annual Allerton Conference on Communication, Contrl, and Computing. [Abstract] [BibTeX] Abstract: How to escape saddle points is a critical issue in nonconvex optimization. Previous methods on this issue mainly assume that the objective function is Hessian-Lipschitz, which leave a gap for applications using non-Hessian-Lipschitz functions. In this paper, we propose Cubic Regularized Alternating Direction Method of Multipliers (CR-ADMM) to escape saddle points of separable non-convex functions containing a non-HessianLipschitz component. By carefully choosing a parameter, we prove that CR-ADMM converges to a local minimum of the original function with a rate of O(1T 13) in time horizon T, which is faster than gradient-based methods. We also show that when one or more steps of CR-ADMM are not solved exactly, CRADMM can converge to a neighborhood of the local minimum. Through the experiments of matrix factorization problems, CRADMM is shown to have a faster rate and a lower optimality gap compared with other gradient-based methods. Our approach can also find applications in other scenarios where regularized non-convex cost minimization is performed, such as parameter optimization of deep neural networks. BibTeX: @inproceedings{Shi2019a, author = {Shi, Zai and Eryilmaz, Atilla}, title = {Cubic Regularized ADMM with Convergence to a Local Minimum in Non-convex Optimization}, booktitle = {Proceedings of the 57th Annual Allerton Conference on Communication, Contrl, and Computing}, year = {2019} } Sid-Lakhdar WM, Aznaveh MM, Li XS and Demmel JW (2019), "Multitask and Transfer Learning for Autotuning Exascale Applications", August, 2019. [Abstract] [BibTeX] Abstract: Multitask learning and transfer learning have proven to be useful in the field of machine learning when additional knowledge is available to help a prediction task. We aim at deriving methods following these paradigms for use in autotuning, where the goal is to find the optimal performance parameters of an application treated as a black-box function. We show comparative results with state-of-the-art autotuning techniques. For instance, we observe an average 1.5× improvement of the application runtime compared to the OpenTuner and HpBandSter autotuners. We explain how our approaches can be more suitable than some state-of-the-art autotuners for the tuning of any application in general and of expensive exascale applications in particular. BibTeX: @article{Sid-Lakhdar2019, author = {Sid-Lakhdar, Wissam M. and Aznaveh, Mohsen Mahmoudi and Li, Xiaoye S. and Demmel, James W.}, title = {Multitask and Transfer Learning for Autotuning Exascale Applications}, year = {2019} } Silvestri F and Vella F (2019), "A Computational Model for Tensor Core Units", August, 2019. [Abstract] [BibTeX] Abstract: To respond to the need of efficient training and inference of deep neural networks, a pletora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broad the class of algorithms that exploit these systems, we propose a computational model, named TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for linear algebra problems, including dense and sparse matrix multiplication, FFT, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model. BibTeX: @article{Silvestri2019, author = {Silvestri, Francesco and Vella, Flavio}, title = {A Computational Model for Tensor Core Units}, year = {2019} } Sivkov I, Lazzaro A and Hutter J (2019), "DBCSR: A Library for Dense Matrix Multiplications on Distributed GPU-Accelerated Systems", October, 2019. [Abstract] [BibTeX] Abstract: Most, if not all the modern scientific simulation packages utilize matrix algebra operations. Among the operation of the linear algebra, one of the most important kernels is the multiplication of matrices, dense and sparse. Examples of application of such a kernel are in electronic structure calculations, machine learning, data mining, graph processing, and digital signal processing. Several optimized libraries exist that can achieve high-performance on distributed systems. Only a few of them target distributed GPU-accelerated systems. In most of the cases, these libraries are provided and optimized by system vendors for their specific computer systems. In this paper, we present the DBCSR library (Distributed Block Compressed Sparse Row) for the distributed dense matrix-matrix multiplications. Although the library is specifically designed for block-sparse matrix-matrix multiplications, we optimized it for the dense case on GPU-accelerated systems. We show that the DBCSR outperforms the multiplication of matrices of different sizes and shapes provided by a vendor optimized GPU version of the ScaLAPACK library up to 2.5× (1.4× on average). BibTeX: @article{Sivkov2019, author = {Sivkov, Ilia and Lazzaro, Alfio and Hutter, Juerg}, title = {DBCSR: A Library for Dense Matrix Multiplications on Distributed GPU-Accelerated Systems}, year = {2019} } Sivkov I, Seewald P, Lazzaro A and Hutter J (2019), "DBCSR: A Blocked Sparse Tensor Algebra Library", October, 2019. [Abstract] [BibTeX] Abstract: Advanced algorithms for large-scale electronic structure calculations are mostly based on processing multi-dimensional sparse data. Examples are sparse matrix-matrix multiplications in linear-scaling Kohn-Sham calculations or the efficient determination of the exact exchange energy. When going beyond mean field approaches, e.g. for Moller-Plesset perturbation theory, RPA and Coupled-Cluster methods, or the GW methods, it becomes necessary to manipulate higher-order sparse tensors. Very similar problems are also encountered in other domains, like signal processing, data mining, computer vision, and machine learning. With the idea that the most of the tensor operations can be mapped to matrices, we have implemented sparse tensor algebra functionalities in the frames of the sparse matrix linear algebra library DBCSR (Distributed Block Compressed Sparse Row). DBCSR has been specifically designed to efficiently perform blocked-sparse matrix operations, so it becomes natural to extend its functionality to include tensor operations. We describe the newly developed tensor interface and algorithms. In particular, we introduce the tensor contraction based on a fast rectangular sparse matrix multiplication algorithm. BibTeX: @article{Sivkov2019a, author = {Sivkov, Ilia and Seewald, Patrick and Lazzaro, Alfio and Hutter, Juerg}, title = {DBCSR: A Blocked Sparse Tensor Algebra Library}, year = {2019} } Slak J and Kosec G (2019), "Medusa: A C++ Library for solving PDEs using Strong Form Mesh-Free methods", December, 2019. [Abstract] [BibTeX] Abstract: Medusa, a novel library for implementation of strong form mesh-free methods, is described. We identify and present common parts and patterns among many such methods reported in the literature, such as node positioning, stencil selection and stencil weight computation. Many different algorithms exist for each part and the possible combinations offer a plethora of possibilities for improvements of solution procedures that are far from fully understood. As a consequence there are still many unanswered questions in mesh-free community resulting in vivid ongoing research in the field. Medusa implements the core mesh-free elements as independent blocks, which offers users great flexibility in experimenting with the method they are developing, as well as easily comparing it with other existing methods. The paper describes the chosen abstractions and their usage, illustrates aspects of the philosophy and design, offers some executions time benchmarks and demonstrates the application of the library on cases from linear elasticity and fluid flow in irregular 2D and 3D domains. BibTeX: @article{Slak2019, author = {Jure Slak and Gregor Kosec}, title = {Medusa: A C++ Library for solving PDEs using Strong Form Mesh-Free methods}, year = {2019} } Song L and Vicente LN (2019), "Modeling Hessian-vector products in nonlinear optimization: New Hessian-free methods", December, 2019. [Abstract] [BibTeX] Abstract: In this paper, we suggest two ways of calculating interpolation models for unconstrained smooth nonlinear optimization when Hessian-vector products are available. The main idea is to interpolate the objective function using a quadratic on a set of points around the current one and concurrently using the curvature information from products of the Hessian times appropriate vectors, possibly defined by the interpolating points. These enriched interpolating conditions form then an affine space of model Hessians or model Newton directions, from which a particular one can be computed once an equilibrium or least secant principle is defined. A first approach consists of recovering the Hessian matrix satisfying the enriched interpolating conditions, from which then a Newton direction model can be computed. In a second approach we pose the recovery problem directly in the Newton direction. These techniques can lead to a significant reduction in the overall number of Hessian-vector products when compared to the inexact or truncated Newton method, although simple implementations may pay a cost in linear algebra or number of function evaluations. BibTeX: @article{Song2019, author = {Lili Song and Luis Nunes Vicente}, title = {Modeling Hessian-vector products in nonlinear optimization: New Hessian-free methods}, year = {2019} } Steck D and Kanzow C (2019), "Regularization of Limited Memory Quasi-Newton Methods for Large-Scale Nonconvex Minimization", November, 2019. [Abstract] [BibTeX] Abstract: This paper deals with the unconstrained optimization of smooth objective functions. It presents a class of regularized quasi-Newton methods whose globalization turns out to be more efficient than standard line search or trust-region strategies. The focus is therefore on the solution of large-scale problems using limited memory quasi-Newton techniques. Global convergence of the regularization methods is shown under mild assumptions. The details of the regularized limited memory quasi-Newton updates are discussed including their compact representations. Numerical results using all large-scale test problems from the CUTEst collection indicate that the regularization method outperforms the standard line search limited memory BFGS method. BibTeX: @article{Steck2019, author = {Steck, Daniel and Kanzow, Christian}, title = {Regularization of Limited Memory Quasi-Newton Methods for Large-Scale Nonconvex Minimization}, year = {2019} } Stoll M (2019), "A literature survey of matrix methods for data science", December, 2019. [Abstract] [BibTeX] Abstract: Efficient numerical linear algebra is a core ingredient in many applications across almost all scientific and industrial disciplines. With this survey we want to illustrate that numerical linear algebra has played and is playing a crucial role in enabling and improving data science computations with many new developments being fueled by the availability of data and computing resources. BibTeX: @article{Stoll2019, author = {Martin Stoll}, title = {A literature survey of matrix methods for data science}, year = {2019} } Stramondo G, Ciobanu CB, Laat C and Varbanescu AL (2019), "Designing and building application-centric parallel memories", Concurrency and Computation: Practice and Experience., August, 2019. Wiley. [Abstract] [BibTeX] [DOI] Abstract: Memory bandwidth is a critical performance factor for many applications and architectures. Intuitively, a parallel memory could be a good solution for any bandwidth-limited application, yet building application-centric custom parallel memories remains a challenge. In this work, we present a comprehensive approach to tackle this challenge and demonstrate how to systematically design and implement application-centric parallel memories. Specifically, our approach (1) analyzes the application memory access traces to extract parallel accesses, (2) configures our parallel memory for maximum performance, and (3) builds the actual application-centric memory system. We further provide a simple performance prediction model for the constructed memory system. We evaluate our approach with two sets of experiments. First, we demonstrate how our parallel memories provide performance benefits for a broad range of memory access patterns. Second, we prove the feasibility of our approach and validate our performance model by implementing and benchmarking the designed parallel memories using FPGA hardware and a sparse version of the STREAM benchmark. BibTeX: @article{Stramondo2019, author = {Stramondo, Giulio and Ciobanu, Cătălin Bogdan and Laat, Cees and Varbanescu, Ana Lucia}, title = {Designing and building application-centric parallel memories}, journal = {Concurrency and Computation: Practice and Experience}, publisher = {Wiley}, year = {2019}, doi = {10.1002/cpe.5485} } Sun R (2019), "Optimization for deep learning: theory and algorithms", December, 2019. [Abstract] [BibTeX] Abstract: When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis. BibTeX: @article{Sun2019, author = {Ruoyu Sun}, title = {Optimization for deep learning: theory and algorithms}, year = {2019} } Tang Y-H, Selvitopi O, Popovici D and Buluç A (2019), "A High-Throughput Solver for Marginalized Graph Kernels on GPU", October, 2019. [Abstract] [BibTeX] Abstract: We present the design and optimization of a solver for efficient and high-throughput computation of the marginalized graph kernel on General Purpose GPUs. The graph kernel is computed using the conjugate gradient method to solve a generalized Laplacian of the tensor product between a pair of graphs. To cope with the large gap between the instruction throughput and the memory bandwidth of the GPUs, our solver forms the graph tensor product on-the-fly without storing it in memory. This is achieved by using threads in a warp cooperatively to stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. We propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to further exploit sparsity. We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales. BibTeX: @article{Tang2019, author = {Tang, Yu-Hang and Selvitopi, Oguz and Popovici, Doru and Buluç, Aydın}, title = {A High-Throughput Solver for Marginalized Graph Kernels on GPU}, year = {2019} } Tasseff B, Coffrin C, Wächter A and Laird C (2019), "Exploring Benefits of Linear Solver Parallelism on Modern Nonlinear Optimization Applications", September, 2019. [Abstract] [BibTeX] Abstract: The advent of efficient interior point optimization methods has enabled the tractable solution of large-scale linear and nonlinear programming (NLP) problems. A prominent example of such a method is seen in Ipopt, a widely-used, open-source nonlinear optimization solver. Algorithmically, Ipopt depends on the use of a sparse symmetric indefinite linear system solver, which is heavily employed within the optimization of barrier subproblems. As such, the performance and reliability of Ipopt is dependent on the properties of the selected linear solver. Inspired by a trend in mathematical programming toward solving larger and more challenging NLPs, this work explores two core questions: first, how does the scalability of available linear solvers, many of which exhibit shared-memory parallelism, impact Ipopt performance; and second, does the best linear solver vary across NLP problem classes, including nonlinear network problems and problems constrained by partial differential equations? To better understand these properties, this paper first describes available open- and closed-source, serial and parallel linear solvers and the fundamental differences among them. Second, it introduces the coupling of a new open-source linear solver capable of heterogeneous parallelism over multi-core central processing units and graphics processing units. Third, it compares linear solvers using a variety of mathematical programming problems, including standard test problems for linear and nonlinear optimization, optimal power flow benchmarks, and scalable two- and three-dimensional partial differential equation and optimal control problems. Finally, linear solver recommendations are provided to maximize Ipopt performance across different application domains. BibTeX: @article{Tasseff2019, author = {Tasseff, Byron and Coffrin, Carleton and Wächter, Andreas and Laird, Carl}, title = {Exploring Benefits of Linear Solver Parallelism on Modern Nonlinear Optimization Applications}, year = {2019} } Thien D, Zorn B, Panchenka P and Tatlock Z (2019), "Toward Multi-Precision, Multi-Format Numerics", Proceedings of the Third International Workshop on Software Correctness for HPC Applications. [Abstract] [BibTeX] Abstract: Recent research has provided new, domain-specific number systems that accelerate modern workloads. Using these number systems effectively requires analyzing subtle multiformat, multi-precision (MPMF) code. Ideally, recent programming tools that automate numerical analysis tasks could help make MPMF programs both accurate and fast. However, three key challenges must be addressed: existing automated tools are difficult to compose due to subtle incompatibilities; there is no “gold standard” for correct MPMF execution; and no methodology exists for generalizing existing, IEEE-754-specialized tools to support MPMF. In this paper we report on recent work towards mitigating these related challenges. First, we extend the FPBench standard to support multi-precision, multi-format (MPMF) applications. Second, we present Titanic, a tool which provides reference results for arbitrary MPMF computations. Third, we describe our experience adapting an existing numerical tool to support MPMF programs BibTeX: @article{Thien2019, author = {David Thien and Bill Zorn and Pavel Panchenka and Zachary Tatlock}, title = {Toward Multi-Precision, Multi-Format Numerics}, journal = {Proceedings of the Third International Workshop on Software Correctness for HPC Applications}, year = {2019} } Thuerck D (2019), "Stretching Jacobi: Two-Stage Pivoting in Block-Based Factorization", In Proceedings of the 9th IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms., 11, 2019. , pp. 51-58. [Abstract] [BibTeX] [DOI] Abstract: Solving numerically tough matrices often requires full pivoting and aggressive iterative refinement even with modern direct solvers. Iterative solvers and preconditioners, preferred in parallel computing often cannot keep up, especially novel, massively-parallel fixed-point methods. We show that even for tough, indefinite matrices, these methods can be an alternative by (a) using a blocked version and (b) introducing a data structure an algorithms for two level, global pivoting. Our approach allows register-based pivoting for high performance, batched CUDA kernels and, for the first time on GPUs, also flexible permutations on the block level. Our experiments show that these modifications help to mitigate the irregular computation stemming from pivoting. Our implementation generates fixed-point style preconditioners that can keep up with traditional, more accurate and static preconditioners - even for tough, indefinite systems. BibTeX: @inproceedings{Thuerck2019, author = {D. Thuerck}, title = {Stretching Jacobi: Two-Stage Pivoting in Block-Based Factorization}, booktitle = {Proceedings of the 9th IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms}, year = {2019}, pages = {51--58}, doi = {10.1109/IA349570.2019.00014} } Tomov S, Haidar A, Ayala A, Shaiek H and Dongarra J (2019), "FFT-ECP Implementation Optimizations and Features Phase". Thesis at: Innovative Computing Laboratory, University of Tennessee., September, 2019. (FFT-ECP ST-MS-10-1440) [Abstract] [BibTeX] [URL] Abstract: The goal of this milestone was the imlementation optimizations and features phase of 3-D FFTs in the FFT-ECP project. The target architectures are large-scale distributed GPU-accelerated platforms. \ In this milestone we describe the implmentation optimizations, features, and performance of the 3-D FFTs that we developed for heterogeneous systems with GPUs. Specifically, this milestone delivered on the following sub-tasks: itemize item Extend FFT-ECP to support various precisions, including real, and investigate the feasibility of mixed precision FFT solvers; item Develop support for flexible data layouts and enable the new library to handle data conversion/communication on the backend in an optimized and dynamic adaptive fashion based on the communication cost model analyzed in the previous milestone; item Optimize the distributed 3-D FFT-ECP solver to enable multiple FFTs per MPI process (with accelerators) and multiple GPUs per node. itemize A main part of this milestone were the performance optimizations and the additions of features that targeted ECP applications need. \ The artifacts delivered include the performance optimizations and features added to the solvers, and a tuned FFT-ECP software, freely available on the FFT-ECP's Git repository hosted on Bitbucket, https://bitbucket.org/icl/heffte/. This is the first software release under the FFT-ECP project. Released is a new FFT library, called heFFTe version 0.1 (Highly Efficient FFTs for Exascale).\ See also the FFT-ECP website, http://icl.utk.edu/fft/ for more details on the FFT-ECP project. BibTeX: @techreport{Tomov2019, author = {Tomov, Stanimire and Haidar, Azzam and Ayala, Alan and Shaiek, Hejer and Dongarra, Jack}, title = {FFT-ECP Implementation Optimizations and Features Phase}, school = {Innovative Computing Laboratory, University of Tennessee}, year = {2019}, number = {FFT-ECP ST-MS-10-1440}, note = {revision 09-2019}, url = {https://www.icl.utk.edu/files/publications/2019/icl-utk-1263-2019.pdf} } Uçar B (2019), "Partitioning, matching, and ordering: Combinatorial scientific computing with matrices and tensors". Thesis at: École Normale Supérieure de Lyon. [Abstract] [BibTeX] Abstract: This document investigates three classes of problems at the interplay of discrete algorithms, combinatorial optimization, and numerical methods. The general research area is called combinatorial scientific computing (CSC). In CSC, the contributions have practical and theoretical flavor. For all problems discussed in this document, we have the design, analysis, and implementation of algorithms along with many experiments. The theoretical results are included in this document, some with proofs; the reader is invited to the original papers for the omitted proofs. A similar approach is taken for presenting the experiments. While most results for observing theoretical findings in practice are included, the reader is referred to the original papers for some other results (e.g., run time analysis). BibTeX: @techreport{Ucar2019, author = {Uçar, Bora}, title = {Partitioning, matching, and ordering: Combinatorial scientific computing with matrices and tensors}, school = {École Normale Supérieure de Lyon}, year = {2019} } Usman S, Mehmood R, Katib I, Albeshri A and Altowaijri SM (2019), "ZAKI: A Smart Method and Tool for Automatic Performance Optimization of Parallel SpMV Computations on Distributed Memory Machines", Mobile Networks and Applications. [Abstract] [BibTeX] [DOI] Abstract: SpMV is a vital computing operation of many scientific, engineering, economic and social applications, increasingly being used to develop timely intelligence for the design and management of smart societies. Several factors affect the performance of SpMV computations, such as matrix characteristics, storage formats, software and hardware platforms. The complexity of the computer systems is on the rise with the increasing number of cores per processor, different levels of caches, processors per node and high speed interconnect. There is an ever-growing need for new optimization techniques and efficient ways of exploiting parallelism. In this paper, we propose ZAKI, a data-driven, machine-learning approach and tool, to predict the optimal number of processes for SpMV computations of an arbitrary sparse matrix on a distributed memory machine. The aim herein is to allow application scientists to automatically obtain the best configuration, and hence the best performance, for the execution of SpMV computations. We train and test the tool using nearly 2000 real world matrices obtained from 45 application domains including computational fluid dynamics (CFD), computer vision, and robotics. The tool uses three machine learning methods, decision trees, random forest, gradient boosting, and is evaluated in depth. A discussion on the applicability of our proposed tool to energy efficiency optimization of SpMV computations is given. This is the first work where the sparsity structure of matrices have been exploited to predict the optimal number of processes for a given matrix in distributed memory environments by using different base and ensemble machine learning methods. BibTeX: @article{Usman2019, author = {Usman, Sardar and Mehmood, Rashid and Katib, Iyad and Albeshri, Aiiad and Altowaijri, Saleh M.}, title = {ZAKI: A Smart Method and Tool for Automatic Performance Optimization of Parallel SpMV Computations on Distributed Memory Machines}, journal = {Mobile Networks and Applications}, year = {2019}, doi = {10.1007/s11036-019-01318-3} } Vlaski S and Sayed AH (2019), "Polynomial Escape-Time from Saddle Points in Distributed Non-Convex Optimization", In Proceedings of the 8th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing., 12, 2019. , pp. 171-175. [Abstract] [BibTeX] [DOI] Abstract: The diffusion strategy for distributed learning from streaming data employs local stochastic gradient updates along with exchange of iterates over neighborhoods. In this work we establish that agents cluster around a network centroid in the mean-fourth sense and proceeded to study the dynamics of this point. We establish expected descent in non-convex environments in the large-gradient regime and introduce a short-term model to examine the dynamics over finite-time horizons. Using this model, we establish that the diffusion strategy is able to escape from strict saddle-points in O(1/) iterations, where μ denotes the step-size; it is also able to return approximately second-order stationary points in a polynomial number of iterations. Relative to prior works on the polynomial escape from saddle-points, most of which focus on centralized perturbed or stochastic gradient descent, our approach requires less restrictive conditions on the gradient noise process. BibTeX: @inproceedings{Vlaski2019, author = {S. Vlaski and A. H. Sayed}, title = {Polynomial Escape-Time from Saddle Points in Distributed Non-Convex Optimization}, booktitle = {Proceedings of the 8th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing}, year = {2019}, pages = {171-175}, doi = {10.1109/CAMSAP45676.2019.9022458} } Wang J, Huang Z, Kong L, Xiao J, Wang P, Zhang L and Li C (2019), "Performance of Training Sparse Deep Neural Networks on GPUs", In Proceedings of the IEEE High Performance Extreme Computing Conference., September, 2019. , pp. 1-5. [Abstract] [BibTeX] [DOI] Abstract: Deep neural networks have revolutionized the field of machine learning by dramatically improving the state-of-the-art in various domains. The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to fast store and train them. Over the past few decades, researches have explored the prospect of sparse DNNs before, during, and after training by pruning edges from the underlying topology. After the above operation, the generated neural network is known as a sparse neural network. More recent works have demonstrated the remarkable results that certain sparse DNNs can train to the same precision as dense DNNs at lower runtime and storage cost. Although existing methods ease the situation that high demand for computation resources severely hinders the deployment of large-scale DNNs in resource-constrained devices, DNNs can be trained at a faster speed and lower cost. In this work, we propose a Fine-tune Structured Sparsity Learning (FSSL) method to regularize the structures of DNNs and accelerate the training of DNNs. FSSL can: (1) learn a compact structure from large sparse DNN to reduce computation cost; (2) obtain a hardware-friendly to accelerate the DNNs evaluation efficiently. Experimental results of the training time and the compression rate show that superior performance and efficiency than the Matlab example code. These speedups are about twice speedups of non-structured sparsity. BibTeX: @inproceedings{Wang2019, author = {Wang, J. and Huang, Z. and Kong, L. and Xiao, J. and Wang, P. and Zhang, L. and Li, C.}, title = {Performance of Training Sparse Deep Neural Networks on GPUs}, booktitle = {Proceedings of the IEEE High Performance Extreme Computing Conference}, year = {2019}, pages = {1--5}, doi = {10.1109/HPEC.2019.8916506} } Wang J, Magron V and Lasserre J-B (2019), "TSSOS: A Moment-SOS hierarchy that exploits term sparsity", December, 2019. [Abstract] [BibTeX] Abstract: This paper is concerned with polynomial optimization problems. We show how to exploit term (or monomial) sparsity of the input polynomials to obtain a new converging hierarchy of semidefinite programming relaxations. The novelty (and distinguishing feature) of such relaxations is to involve block-diagonal matrices obtained in an iterative procedure performing completion of the connected components of certain adjacency graphs. The graphs are related to the terms arising in the original data and not to the links between variables. Our theoretical framework is then applied to compute lower bounds for polynomial optimization problems either randomly generated or coming from the networked systems literature. BibTeX: @article{Wang2019a, author = {Jie Wang and Victor Magron and Jean-Bernard Lasserre}, title = {TSSOS: A Moment-SOS hierarchy that exploits term sparsity}, year = {2019} } Wang A and Gounaris CE (2019), "On Tackling Reverse Convex Constraints for Non-overlapping of Circles" [Abstract] [BibTeX] Abstract: We study the circle-circle non-overlapping constraints, a form of reverse convex constraints that often arise in optimization models for cutting and packing applications. The feasible region induced by the intersection of circle-circle non-overlapping constraints is highly non-convex, and standard approaches to construct convex relaxations for spatial branch-and-bound global optimization of such models typically yield unsatisfactory loose relaxations. Consequently, solving such non-convex models to guaranteed optimality remains extremely challenging even for the state-of-the-art codes. In this paper, we apply a purpose-built branching scheme on non-overlapping constraints and utilize strengthened intersection cuts and various feasibility-based tightening techniques to further tighten the model relaxation. We embed these techniques into a branch-and-bound code and test them on two variants of circle packing problems. Our computational studies on a suite of 75 benchmark instances yielded, for the first time in the open literature, a total of 54 provably optimal solutions, and it was demonstrated to be competitive over the use of the state-of-the-art general-purpose global optimization solvers. BibTeX: @article{Wang2019b, author = {Akang Wang and Chrysanthos E. Gounaris}, title = {On Tackling Reverse Convex Constraints for Non-overlapping of Circles}, year = {2019} } Wei X, Zhang R, Liu Y, Yue H and Tan J (2019), "Evaluating the Soft Error Resilience of Instructions for GPU Applications", In Proceedings of the IEEE International Conference on Computational Science and Engineering and IEEE International Conference on Embedded and Ubiquitous Computing., 8, 2019. , pp. 459-464. [Abstract] [BibTeX] [DOI] Abstract: Graphics Processing Units (GPUs) are widely used in a range of High Performance Computing fields because of high parallelism. As the technology scaling down, GPUs are more susceptible to soft errors which dramatically impact the applications output qualities. Silent Data Corruption (SDC) is one of the most concerned reliability issues, which require efficient protection mechanisms to eliminate it. Software-directed instruction replication has been a flexible technique to solve SDCs. However, this method requires a trade-off between reliability and overhead. To this end, it is imperative to explore the SDC criticality of the instructions. In this paper, we carry out fine-grained analysis on instruction error behavior of 11 benchmarks, while previous work focused on the error resilience of the entire application. Combining the error resilience of instructions with the dynamic data flow of applications, we find potential protection opportunities for the instructions. BibTeX: @inproceedings{Wei2019, author = {X. Wei and R. Zhang and Y. Liu and H. Yue and J. Tan}, title = {Evaluating the Soft Error Resilience of Instructions for GPU Applications}, booktitle = {Proceedings of the IEEE International Conference on Computational Science and Engineering and IEEE International Conference on Embedded and Ubiquitous Computing}, year = {2019}, pages = {459--464}, doi = {10.1109/CSE/EUC.2019.00091} } Williams-Young DB, Beckman PG and Yang C (2019), "A Shift Selection Strategy for Parallel Shift-Invert Spectrum Slicing in Symmetric Self-Consistent Eigenvalue Computation", August, 2019. [Abstract] [BibTeX] Abstract: The central importance of large scale eigenvalue problems in scientific computation necessitates the development massively parallel algorithms for their solution. Recent advances in dense numerical linear algebra have enabled the routine treatment of eigenvalue problems with dimensions on the order of hundreds of thousands on the world's largest supercomputers. In cases where dense treatments are not feasible, Krylov subspace methods offer an attractive alternative due to the fact that they do not require storage of the problem matrices. However, demonstration of scalability of either of these classes of eigenvalue algorithms on computing architectures capable of expressing excessive parallelism is non-trivial due to communication requirements and serial bottlenecks, respectively. In this work, we introduce the SISLICE method: a parallel shift-invert algorithm for the solution of the symmetric self-consistent field (SCF) eigenvalue problem. The SISLICE method drastically reduces the communication requirement of current parallel shift-invert eigenvalue algorithms through various shift selection and migration techniques based on density of states estimation and k-means clustering, respectively. This work demonstrates the robustness and parallel performance of the SISLICE method on a representative set of SCF eigenvalue problems and outlines research directions which will be explored in future work. BibTeX: @article{Williams-Young2019, author = {Williams-Young, David B. and Beckman, Paul G. and Yang, Chao}, title = {A Shift Selection Strategy for Parallel Shift-Invert Spectrum Slicing in Symmetric Self-Consistent Eigenvalue Computation}, year = {2019} } Winter M, Mlakar D, Zayer R, Seidel H-P and Steinberger M (2019), "Adaptive Sparse Matrix-matrix Multiplication on the GPU", In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. New York, NY, USA , pp. 68-81. ACM. [Abstract] [BibTeX] [DOI] Abstract: In the ongoing efforts targeting the vectorization of linear algebra primitives, sparse matrix-matrix multiplication (SpGEMM) has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important, this disparity can be attributed mainly to the additional formidable challenges raised by SpGEMM.\ In this paper, we present a dynamic approach for addressing SpGEMM on the GPU. Our approach works directly on the standard compressed sparse rows (CSR) data format. In comparison to previous SpGEMM implementations, our approach guarantees a homogeneous, load-balanced access pattern to the first input matrix and improves memory access to the second input matrix. It adaptively re-purposes GPU threads during execution and maximizes the time efficient on-chip scratchpad memory can be used. Adhering to a completely deterministic scheduling pattern guarantees bit-stable results during repetitive execution, a property missing from other approaches. Evaluation on an extensive sparse matrix benchmark suggests our approach being the fastest SpGEMM implementation for highly sparse matrices (80% of the set). When bit-stable results are sought, our approach is the fastest across the entire test set. BibTeX: @inproceedings{Winter2019, author = {Winter, Martin and Mlakar, Daniel and Zayer, Rhaleb and Seidel, Hans-Peter and Steinberger, Markus}, title = {Adaptive Sparse Matrix-matrix Multiplication on the GPU}, booktitle = {Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, year = {2019}, pages = {68--81}, doi = {10.1145/3293883.3295701} } Witzig J, Berthold T and Heinz S (2019), "Computational Aspects of Infeasibility Analysis in Mixed Integer Programming". Thesis at: Zuse Institute Berlin. (19-54) [Abstract] [BibTeX] Abstract: The analysis of infeasible subproblems plays an important role in solving mixed integer programs (MIPs) and is implemented in most major MIP solvers. There are two fundamentally different concepts to generate valid global constraints from infeasible subproblems. The first is to analyze the sequence of implications, obtained by domain propagation, that led to infeasibility. The result of this analysis is one or more sets of contradicting variable bounds from which so-called conflict constraints can be generated. This concept is called conflict graph analysis and has its origin in solving satisfiability problems and is similarly used in constraint programming. The second concept is to analyze infeasible linear programming (LP) relaxations. Every ray of the dual LP provides a set of multipliers that can be used to generate a single new globally valid linear constraint. This method is called dual proof analysis. The main contribution of this paper is twofold. Firstly, we present three enhancements of dual proof analysis: presolving via variable cancellation, strengthening by applying mixed integer rounding functions, and a filtering mechanism. Further, we provide an intense computational study evaluating the impact of every presented component regarding dual proof analysis. Secondly, this paper presents the first integrated approach to use both conflict graph and dual proof analysis simultaneously within a single MIP solution process. All experiments are carried out on general MIP instances from the standard public test set MIPLIB 2017; the presented algorithms have been implemented within the non-commercial MIP solver SCIP and the commercial MIP solver FICO Xpress. BibTeX: @techreport{Witzig2019, author = {Jakob Witzig and Timo Berthold and Stefan Heinz}, title = {Computational Aspects of Infeasibility Analysis in Mixed Integer Programming}, school = {Zuse Institute Berlin}, year = {2019}, number = {19--54} } Witzig J and Berthold T (2019), "Conflict-Free Learning for Mixed Integer Programming". Thesis at: Zuse Institute Berlin. (19-59) [Abstract] [BibTeX] Abstract: Conflict learning plays an important role in solving mixed integer programs (MIPs) and is implemented in most major MIP solvers. A major step for MIP conflict learning is to aggregate the LP relaxation of an infeasible subproblem to a single globally valid constraint, the dual proof, that proves infeasibility within the local bounds. Among others, one way of learning is to add these constraints to the problem formulation for the remainder of the search. We suggest to not restrict this procedure to infeasible subproblems, but to also use global proof constraints from subproblems that are not (yet) infeasible, but can be expected to be pruned soon. As a special case, we also consider learning from integer feasible LP solutions. First experiments of this conflict-free learning strategy show promising results on the MIPLIB2017 benchmark set. BibTeX: @techreport{Witzig2019a, author = {Jakob Witzig and Timo Berthold}, title = {Conflict-Free Learning for Mixed Integer Programming}, school = {Zuse Institute Berlin}, year = {2019}, number = {19--59} } Wu R (2019), "Dynamic Scheduling Strategy for Block Parallel Cholesky Factorization Based on Activity on Edge Network", IEEE Access. Vol. 7, pp. 66317-66324. [Abstract] [BibTeX] [DOI] Abstract: The efficient development of system software and design applications in parallel architecture is a notable challenge considering various aspects, such as load balancing, memory spaces, communication, and synchronization. This paper presents a block parallel Cholesky factorization algorithm for a multicore system, which is developed based on activity on edge network. First, the basic block computing tasks and their dependencies are taken as vertices and edges, respectively, and a directed acyclic graph corresponding to the specific block parallel Cholesky factorization is generated. Next, each edge of the directed acyclic graph is assigned to a weight equal to the processing time of the initial vertex of the edge, and the directed acyclic graph becomes an activity on edge network with only one starting and one ending vertex. Finally, a queuing algorithm is designed for the basic block computing tasks according to the edge activity on edge network, and a dynamic scheduling strategy is developed for block parallel Cholesky factorization. The results of the experiments concerning the parallel execution time of the algorithm in multicore systems with different configurations demonstrate that the proposed algorithm has notable advantages compared with the traditional static scheduling algorithm, and it exhibits satisfactory load balancing, parallelism, and scalability capacities. BibTeX: @article{Wu2019, author = {Wu, Rongteng}, title = {Dynamic Scheduling Strategy for Block Parallel Cholesky Factorization Based on Activity on Edge Network}, journal = {IEEE Access}, year = {2019}, volume = {7}, pages = {66317--66324}, doi = {10.1109/ACCESS.2019.2917714} } Xiao G, Li K, Chen Y, He W, Zomaya A and Li T (2019), "CASpMV: A Customized and Accelerative SpMV Framework for the Sunway TaihuLight", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. [Abstract] [BibTeX] [DOI] Abstract: The Sunway TaihuLight, equipped with 10 million cores, is currently the world's third fastest supercomputer. SpMV is one of core algorithms in many high-performance computing applications. This paper implements a fine-grained design for generic parallel SpMV based on the special Sunway architecture and finds three main performance limitations, i.e., storage limitation, load imbalance, and huge overhead of irregular memory accesses. To address these problems, this paper introduces a customized and accelerative framework for SpMV (CASpMV) on the Sunway. The CASpMV customizes an auto-tuning four-way partition scheme for SpMV based on the proposed statistical model, which describes the sparse matrix structure characteristics, to make it better fit in with the computing architecture and memory hierarchy of the Sunway. Moreover, the CASpMV provides an accelerative method and customized optimizations to avoid irregular memory accesses and further improve its performance on the Sunway. Our CASpMV achieves a performance improvement that ranges from 588.05% to 2118.62% over the generic parallel SpMV on a CG (which corresponds to an MPI process) of the Sunway on average and has good scalability on multiple CGs. The performance comparisons of the CASpMV with state-of-the-art methods on the Sunway indicate that the sparsity and irregularity of data structures have less impact on CASpMV. BibTeX: @article{Xiao2019, author = {Xiao, G. and Li, K. and Chen, Y. and He, W. and Zomaya, A. and Li, T.}, title = {CASpMV: A Customized and Accelerative SpMV Framework for the Sunway TaihuLight}, journal = {IEEE Transactions on Parallel and Distributed Systems}, year = {2019}, pages = {1--1}, doi = {10.1109/TPDS.2019.2907537} } Xie Z, Tan G, Liu W and Sun N (2019), "IA-SpGEMM: An Input-aware Auto-tuning Framework for Parallel Sparse Matrix-Matrix Multiplication", In Proceedings of the 33rd ACM Conference on Supercomputing. Phoenix, AZ, USA [Abstract] [BibTeX] [URL] Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a sparse kernel that is used in a number of scientific applications. Although several SpGEMM algorithms have been proposed, almost all of them are restricted to the compressed sparse row (CSR) format, and the possible performance gain from exploiting other formats has not been well studied. The particular format and algorithm that yield the best performance for SpGEMM also remain undetermined.\ In this work, we conduct a prospective study on format-specific parallel SpGEMM algorithms, and analyze their pros and cons. We then propose IA-SpGEMM, an input-aware auto-tuning Framework for SpGEMM, that provides a unified programming interface in the CSR format and automatically determines the best format and algorithm for arbitrary sparse matrices. For this purpose, we set-up an algorithm set and design a deep learning model called MatNet that is trained by over 2,700 matrices from the SuiteSparse Matrix Collection to quickly and accurately predict the best solution by using sparse features and density representations. We evaluate our framework on CPUs and a GPU, and the results show that IA-SpGEMM is on average 3.27× and 13.17× faster than MKL on an Intel and an AMD platform, respectively, and is 2.23× faster than cuSPARSE on an NVIDIA GPU. BibTeX: @inproceedings{Xie2019, author = {Xie, Zhen and Tan, Guangmin and Liu, Weifeing and Sun, Ninghui}, title = {IA-SpGEMM: An Input-aware Auto-tuning Framework for Parallel Sparse Matrix-Matrix Multiplication}, booktitle = {Proceedings of the 33rd ACM Conference on Supercomputing}, year = {2019}, url = {https://folk.idi.ntnu.no/weifengl/papers/spgemm_xie_ics19.pdf} } Xie J and Liang Y (2019), "SPART: Optimizing CNNs by Utilizing Both Sparsity of Weights and Feature Maps", In Advanced Parallel Processing Technologies. Cham , pp. 71-85. Springer International Publishing. [Abstract] [BibTeX] Abstract: Intense convolution computation and great memory requirement in CNNs constraint their wider deployments and applications. Although both the weights and feature maps in CNNs can be sparse, directly mapping sparse convolution to spGEMM in HPC domain fails to improve the actual performance. Besides, existing sparse formats like CSR are not suitable for encoding the sparse feature maps because convolution operates across rows. BibTeX: @inproceedings{Xie2019a, author = {Xie, Jiaming and Liang, Yun}, editor = {Yew, Pen-Chung and Stenström, Per and Wu, Junjie and Gong, Xiaoli and Li, Tao}, title = {SPART: Optimizing CNNs by Utilizing Both Sparsity of Weights and Feature Maps}, booktitle = {Advanced Parallel Processing Technologies}, publisher = {Springer International Publishing}, year = {2019}, pages = {71--85} } Xu Z, Chen X, Shen J, Zhang Y, Chen C and Yang C (2019), "GARDENIA: A Graph Processing Benchmark Suite for Next-Generation Accelerators", ACM Journal on Emerging Technologies in Computing Systems. New York, NY, USA, January, 2019. Vol. 15(1), pp. 9:1-9:13. ACM. [Abstract] [BibTeX] [DOI] Abstract: This article presents the Graph Algorithm Repository for Designing Next-generation Accelerators (GARDENIA), a benchmark suite for studying irregular graph algorithms on massively parallel accelerators. Applications with limited control and data irregularity are the main focus of existing generic benchmarks for accelerators, while available graph processing benchmarks do not apply state-of-the-art algorithms and/or optimization techniques. GARDENIA includes emerging graph processing workloads from graph analytics, sparse linear algebra, and machine-learning domains, which mimic massively multithreaded commercial programs running on modern large-scale datacenters. Our characterization shows that GARDENIA exhibits irregular microarchitectural behavior, which is quite different from structured workloads and straightforward-implemented graph benchmarks. BibTeX: @article{Xu2019, author = {Xu, Zhen and Chen, Xuhao and Shen, Jie and Zhang, Yang and Chen, Cheng and Yang, Canqun}, title = {GARDENIA: A Graph Processing Benchmark Suite for Next-Generation Accelerators}, journal = {ACM Journal on Emerging Technologies in Computing Systems}, publisher = {ACM}, year = {2019}, volume = {15}, number = {1}, pages = {9:1--9:13}, doi = {10.1145/3283450} } Xu W, Fan S, Wang T and Zhou Y (2019), "Blocking and sparsity for optimization of convolution calculation algorithm on GPUs", September, 2019. [Abstract] [BibTeX] Abstract: Convolution neural network (CNN) plays a paramount role in machine learning, which has made significant contributions, such as medical image classification, natural language processing, and recommender system. The success convolution neural network achieved excellent performance with fast execution time. Due to the convolution operation dominate the total operation time of Convolution neural network. In this paper, we propose a novel convolution method of Graphic Processing Units (GPUs), which reduce the convolution operation time and improve the execution speed approximately 2× than the state of the art convolution algorithm. Our work based on the observation is that the sparsity of the input feature map of convolution operation is relatively large, and the zero value of the feature map is redundancy for convolution result. Therefore, we skip the zero value calculation and improve the speed by compressing the feature map. Besides, the shape of the feature map for the deep network is small, and the number of threads is limited. Therefore, for a limited number of threads, it is necessary to reduce the amount of calculation to increase the calculation speed. Our algorithm has a good effect on the convolution operation of the feature map of the deep network with large sparsity and small size. In this work, our contributions can be summarized as follows: 1) A novel store format for hight-sparsity feature map. 2) A novel convolution algorithm based on block compression and Shared memory is proposed. 3) A feature map data-set for convolution algorithm optimization. 4) We performed a single-layer convolution comparison experiment with CuDNN for different models, and it is best to achieve 3.5× speedup. We also implemented the algorithm on the VGG-19 model, which can achieve 1.3×∼2.9× speedup in deep convolution operation, and the entire network can achieve 2.3× speedup. BibTeX: @article{Xu2019a, author = {Xu, Weizhi and Fan, Shengyu and Wang, Tiantian and Zhou, Yufeng}, title = {Blocking and sparsity for optimization of convolution calculation algorithm on GPUs}, year = {2019} } Yamamoto Y (2019), "High-Performance Algorithms for Numerical Linear Algebra", In The Art of High Performance Computing for Computational Science, Vol. 1: Techniques of Speedup and Parallelization for General Purposes. Singapore , pp. 113-136. Springer Singapore. [Abstract] [BibTeX] [DOI] Abstract: Matrix computations lie at the heart of many scientific computations. While sophisticated algorithms have been established for various numerical linear algebra problems such as the solution of linear simultaneous equationsLinear simultaneous equation and eigenvalue problemsEigenvalue problem, they require considerable modification with the advent of exaFLOPS- scale supercomputers, which are expected to have a huge number of computing cores, deep memory hierarchyMemory hierarchy, and increased probability of hardware errors. In this chapter, we discuss projected hardware characteristics of exaFLOPS machines and summarize the challenges to be faced by numerical linear algebra algorithms in the near future. Based on these preparations, we present a brief survey of recent research efforts in the field of numerical linear algebra targeted at meeting these challenges. BibTeX: @inbook{Yamamoto2019, author = {Yamamoto, Yusaku}, editor = {Geshi, Masaaki}, title = {High-Performance Algorithms for Numerical Linear Algebra}, booktitle = {The Art of High Performance Computing for Computational Science, Vol. 1: Techniques of Speedup and Parallelization for General Purposes}, publisher = {Springer Singapore}, year = {2019}, pages = {113--136}, doi = {10.1007/978-981-13-6194-4_7} } Yang CY (2019), "High-Performance Linear Algebra-based Graph Framework on the GPU". Thesis at: University of California, Davis. [Abstract] [BibTeX] [URL] Abstract: High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs, because of three challenges: (1) difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic ratio. To address these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based in sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. Initial research efforts in implementing GraphBLAS on GPUs has been promising, but performance still trails by an order of magnitude compared to state-of-the-art graph frameworks using the traditional graph-centric approach of describing operations on vertices or edges.\ This dissertation examines the performance challenges of a linear algebra-based approach to building graph frameworks and describes new design principles for overcoming these bottlenecks. Among the new design principles is making exploiting input sparsity a first-class citizen in the framework. This is an especially important optimization, because it allows users to write graph algorithms without specifying certain implementation details thus permitting the software backend to choose the optimal implementation based on the input sparsity. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. We examine when it is profitable to exploit this output sparsity to reduce computational complexity. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics.\ The design principles described in the thesis have been implemented in GraphBLAST, an open-source high-performance graph framework on GPU developed as part of this dissertation. It is notable for being the first graph framework based in linear algebra to get comparable or faster performance compared to the traditional, vertex-centric backends. The benefits of design principles described in this thesis have been shown to be important for single GPU, and it will grow in importance when it serves as a building block for distributed implementation in the future and as a single GPU backend for higher-level languages such as Python. A graph framework based in linear algebra not only improves performance of existing graph algorithms, but in quickly prototyping new algorithms as well. BibTeX: @phdthesis{Yang2019, author = {Yang, Carl Y.}, title = {High-Performance Linear Algebra-based Graph Framework on the GPU}, school = {University of California, Davis}, year = {2019}, url = {https://escholarship.org/uc/item/37j8j27d} } Yang C, Buluç A and Owens JD (2019), "GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU" [Abstract] [BibTeX] Abstract: High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs, because of three challenges: (1) difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based in sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first open-source linear algebra-based graph framework on GPU targeting high-performance computing. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model. BibTeX: @article{Yang2019a, author = {Yang, Carl and Buluç, Aydın and Owens, John D.}, title = {GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU}, year = {2019} } Yang T (2019), "Advancing Non-Convex and Constrained Learning: Challenges and Opportunities", AI Matters. Vol. 5, pp. 29-39. [BibTeX] [DOI] BibTeX: @article{Yang2019b, author = {Yang, Tianbao}, title = {Advancing Non-Convex and Constrained Learning: Challenges and Opportunities}, journal = {AI Matters}, year = {2019}, volume = {5}, pages = {29--39}, doi = {10.1145/3362077.3362085} } Yao Z, Gholami A, Keutzer K and Mahoney M (2019), "PyHessian: Neural Networks Through the Lens of the Hessian", December, 2019. [Abstract] [BibTeX] Abstract: We present PyHessian, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. This framework is developed in Pytorch, and it enables distributed-memory execution on cloud or supercomputer systems. PyHessian enables fast computations of the top Hessian eigenvalue, the Hessian trace, and the full Hessian eigenvalue density. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we apply PyHessian to analyze the effect of residual connections and Batch Normalization layers on the smoothness of the loss landscape. One recent claim, based on simpler first-order analysis, is that residual connections and batch normalization make the loss landscape smoother'', thus making it easier for Stochastic Gradient Descent to converge to a good solution. We perform an extensive analysis by measuring directly the Hessian spectrum using PyHessian. This analysis leads to finer-scale insight, demonstrating that while conventional wisdom is sometimes validated, in other cases it is simply incorrect. In particular, we find that batch normalization layers do not necessarily make the loss landscape smoother, especially for shallow networks. Instead, the claimed smoother loss landscape only becomes evident for deep neural networks. We perform extensive experiments on four residual networks (ResNet20/32/38/56) on Cifar-10/100 dataset. We have open-sourced our PyHessian framework for Hessian spectrum computation. BibTeX: @article{Yao2019, author = {Zhewei Yao and Amir Gholami and Kurt Keutzer and Michael Mahoney}, title = {PyHessian: Neural Networks Through the Lens of the Hessian}, year = {2019} } Yaşar A and Çatalyürek ÜV (2019), "Heuristics for Symmetric Rectilinear Matrix Partitioning", September, 2019. [Abstract] [BibTeX] Abstract: Partitioning sparse matrices and graphs is a common and important problem in many scientific and graph analytics applications. In this work, we are concerned with a spatial partitioning called rectilinear partitioning (also known as generalized block distribution) of sparse matrices, which is needed for tiled (or blocked) execution of sparse matrix and graph analytics kernels. More specifically, in this work, we address the problem of symmetric rectilinear partitioning of square matrices. By symmetric, we mean having the same partition on rows and columns of the matrix, yielding a special tiling where the diagonal tiles (blocks) will be squares. We propose five heuristics to solve two different variants of this problem, and present a thorough experimental evaluation showing the effectiveness of the proposed algorithms. BibTeX: @article{Yasar2019, author = {Yaşar, Abdurrahman and Çatalyürek, Ümit V.}, title = {Heuristics for Symmetric Rectilinear Matrix Partitioning}, year = {2019} } Yaşar A, Rajamanickam S, Berry JW, Wolf MM, Young J and Çatalyürek ÜV (2019), "Linear Algebra-Based Triangle Counting via Fine-Grained Tasking on Heterogeneous Environments", In Proceedings of the IEEE High Performance Extreme Computing Conference., September, 2019. , pp. 1-4. [Abstract] [BibTeX] [DOI] Abstract: Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic formulation of the triangle counting problem. Our new approach relies on fine-grained tasking based on a tile layout. We adopt this task based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8× speed up over past year's graph challenge submission. This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication of the graph to the GPUs. In terms of end-toend execution time, our implementation also achieves the fastest end-to-end times due to very low overhead costs. BibTeX: @inproceedings{Yasar2019a, author = {Yaşar, Abdurrahman and Rajamanickam, Sivasankaran and Berry, Jonathan W. and Wolf, Michael M. and Young, Jeffrey and Çatalyürek, Ümit V.}, title = {Linear Algebra-Based Triangle Counting via Fine-Grained Tasking on Heterogeneous Environments}, booktitle = {Proceedings of the IEEE High Performance Extreme Computing Conference}, year = {2019}, pages = {1--4}, doi = {10.1109/HPEC.2019.8916492} } Ye H, Wang S, Zhang Z and Zhang T (2019), "Fast Generalized Matrix Regression with Applications in Machine Learning", December, 2019. [Abstract] [BibTeX] Abstract: Fast matrix algorithms have become the fundamental tools of machine learning in big data era. The generalized matrix regression problem is widely used in the matrix approximation such as CUR decomposition, kernel matrix approximation, and stream singular value decomposition (SVD), etc. In this paper, we propose a fast generalized matrix regression algorithm (Fast GMR) which utilizes sketching technique to solve the GMR problem efficiently. Given error parameter 0 < 𝜖 < 1, the Fast GMR algorithm can achieve a O(1+) relative error with the sketching sizes being of order O(-1/2) for a large group of GMR problems. We apply the Fast GMR algorithm to the symmetric positive definite matrix approximation and single pass singular value decomposition and they achieve a better performance than conventional algorithms. Our empirical study also validates the effectiveness and efficiency of our proposed algorithms. BibTeX: @article{Ye2019, author = {Haishan Ye and Shusen Wang and Zhihua Zhang and Tong Zhang}, title = {Fast Generalized Matrix Regression with Applications in Machine Learning}, year = {2019} } Yin D (2019), "Towards More Scalable and Robust Machine Learning". Thesis at: Department of Electrical Engineering and Computer Sciences, Univerisity of California at Berkeley. [Abstract] [BibTeX] [URL] Abstract: For many data-intensive real-world applications, such as recognizing objects from images, detecting spam emails, and recommending items on retail websites, the most successful current approaches involve learning rich prediction rules from large datasets. There are many challenges in these machine learning tasks. For example, as the size of the datasets and the complexity of these prediction rules increase, there is a significant challenge in designing scalable methods that can effectively exploit the availability of distributed computing units. As another example, in many machine learning applications, there can be data corruptions, communication errors, and even adversarial attacks during training and test. Therefore, to build reliable machine learning models, we also have to tackle the challenge of robustness in machine learning. \ In this dissertation, we study several topics on the scalability and robustness in large-scale learning, with a focus of establishing solid theoretical foundations for these problems, and demonstrate recent progress towards the ambitious goal of building more scalable and robust machine learning models. We start with the speedup saturation problem in distributed stochastic gradient descent (SGD) algorithms with large mini-batches. We introduce the notion of gradient diversity, a metric of the dissimilarity between concurrent gradient updates, and show its key role in the convergence and generalization performance of mini-batch SGD. We then move forward to Byzantine distributed learning, a topic that involves both scalability and robustness in distributed learning. In the Byzantine setting that we consider, a fraction of distributed worker machines can have arbitrary or even adversarial behavior. We design statistically and computationally efficient algorithms to defend against Byzantine failures in distributed optimization with convex and non-convex objectives. Lastly, we discuss the adversarial example phenomenon. We provide theoretical analysis of the adversarially robust generalization properties of machine learning models through the lens of Radamacher complexity. BibTeX: @phdthesis{Yin2019, author = {Dong Yin}, title = {Towards More Scalable and Robust Machine Learning}, school = {Department of Electrical Engineering and Computer Sciences, Univerisity of California at Berkeley}, year = {2019}, url = {https://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-175.pdf} } Yu T and Liu M (2019), "A Memory Efficient Clique Enumeration Method for Sparse Graphs with a Parallel Implementation", Parallel Computing. [Abstract] [BibTeX] [DOI] [URL] Abstract: Maximal clique enumeration (MCE) is a widely studied problem that plays a crucial role in structure mining of undirected graphs. The increasing scale of real-world graphs has brought the challenges of high memory cost and high CPU workload to the problem. In this paper, we propose a memory efficient method named CMC-bit for MCE on sparse graphs. It reduces the memory cost via minimizing the candidate cliques and representing them by the data structure bitset. It generates an appropriate order for the vertex set according to two optimized principles to reduce the CPU cost. We further design a partition-based CMC-bit algorithm with a one-side extending strategy to solve the memory-limited problem. We parallelize the CMC-bit algorithm based on MapReduce with a range-based partition strategy to make an optimal trade-off between the shuffling workload of graph decomposition and load balance in the Reduce phase. We conduct extensive experiments on 30 real-world datasets. The results demonstrate that both the CMC-bit algorithm and its parallel implementation significantly outperform the respective state-of-the-art algorithms in speed. We also show that the parallel CMC-bit algorithm achieves good performance on the scalability with respect to both the reducer number and the CPU number. BibTeX: @article{Yu2019, author = {Yu, Ting and Liu, Mengchi}, title = {A Memory Efficient Clique Enumeration Method for Sparse Graphs with a Parallel Implementation}, journal = {Parallel Computing}, year = {2019}, url = {http://www.sciencedirect.com/science/article/pii/S0167819118301297}, doi = {10.1016/j.parco.2019.05.005} } Yuan G, Shen L and Zheng W-S (2019), "A Block Decomposition Algorithm for Sparse Optimization" [Abstract] [BibTeX] Abstract: Sparse optimization is a central problem in machine learning and computer vision. However, this problem is inherently NP-hard and thus difficult to solve in general. Combinatorial search methods find the global optimal solution but are confined to small-sized problems, while coordinate descent methods are efficient but often suffer from poor local minima. This paper considers a new block decomposition algorithm that combines the effectiveness of combinatorial search methods and the efficiency of coordinate descent methods. Specifically, we consider a random strategy or/and a greedy strategy to select a subset of coordinates as the working set, and then perform a global combinatorial search over the working set based on the original objective function. We show that our method finds stronger stationary points than Amir Beck et al.'s coordinate-wise optimization method. In addition, we establish the global convergence and convergence rate of our block decomposition algorithm. Our experiments on solving sparse regularized and sparsity constrained least squares optimization problems demonstrate that our method achieves state-ofthe-art performance in terms of accuracy. BibTeX: @article{Yuan2019, author = {Yuan, Ganzhao and Shen, Li and Zheng, Wei-Shi}, title = {A Block Decomposition Algorithm for Sparse Optimization}, year = {2019} } Zhang K, Liu J, Zhang J and Wang J (2019), "Greedy Orthogonal Pivoting Algorithm for Non-negative Matrix Factorization", In Proceedings of the 36th International Conference on Machine Learning. [Abstract] [BibTeX] [URL] Abstract: Non-negative matrix factorization is a powerful tool for learning useful representations in the data and has been widely applied in many problems such as data mining and signal processing. Orthogonal NMF, which can further improve the locality of decomposition, has drawn considerable interest in clustering problems. However, imposing simultaneous non-negative and orthogonal structure can be difficult, and so existing algorithms can only solve it approximately. To address this challenge, we propose an innovative procedure called Greedy Orthogonal Pivoting Algorithm (GOPA). The GOPA method fully exploits the sparsity of non-negative orthogonal solutions to break the global problem into a series of local optimizations, in which an adaptive subset of coordinates are updated in a greedy, closed-form manner. The biggest advantage of GOPA is that it promotes exact orthogonality and provides solid empirical evidence that stronger orthogonality does contribute favorably to better clustering performance. On the other hand, we have designed randomized and batch-mode version of GOPA, which can further reduce the computational cost and improve accuracy, making it suitable for large data. BibTeX: @inproceedings{Zhang2019, author = {Zhang, Kai and Liu, Jun and Zhang, Jie and Wang, Jun}, title = {Greedy Orthogonal Pivoting Algorithm for Non-negative Matrix Factorization}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, year = {2019}, url = {http://proceedings.mlr.press/v97/zhang19r/zhang19r.pdf} } Zhang Z, Wu X, Zhang N, Zhang S and Solomonik E (2019), "Enabling Distributed-Memory Tensor Completion in Python using New Sparse Tensor Kernels", October, 2019. [Abstract] [BibTeX] Abstract: Tensor computations are increasingly prevalent numerical techniques in data science.However, innovation and deployment of methods on large sparse tensor datasets are made challenging by the difficulty of efficient implementation thereof.We provide a Python extension to the Cyclops tensor algebra library, which fully automates the management of distributed-memory parallelism and sparsity for NumPy-style operations on multidimensional arrays.We showcase this functionality with novel high-level implementations of three algorithms for the tensor completion problem: alternating least squares (ALS) with an implicit conjugate gradient method, stochastic gradient descent (SGD), and coordinate descent (CCD++).To make possible tensor completion for very sparse tensors, we introduce a new multi-tensor routine that is asymptotically more efficient than pairwise tensor contraction for key components of the tensor completion methods.Further, we add support for hypersparse matrix representations to Cyclops.We provide microbenchmarking results on the Stampede2 supercomputer to demonstrate the efficiency of this functionality.Finally, we study the accuracy and performance of the tensor completion methods for a synthetic tensor with 10 billion nonzeros and the Netflix dataset. BibTeX: @article{Zhang2019a, author = {Zhang, Zecheng and Wu, Xiaoxiao and Zhang, Naijing and Zhang, Siyuan and Solomonik, Edgar}, title = {Enabling Distributed-Memory Tensor Completion in Python using New Sparse Tensor Kernels}, year = {2019} } Zhang Y, Azad A and Hu Z (2019), "FastSV: A Distributed-Memory Connected Component Algorithm with Fast Convergence", October, 2019. [Abstract] [BibTeX] Abstract: This paper presents a new distributed-memory algorithm called FastSV for finding connected components in an undirected graph. Our algorithm simplifies the classic Shiloach-Vishkin algorithm and employs several novel and efficient hooking strategies for faster convergence. We map different steps of FastSV to linear algebraic operations and implement them with the help of scalable graph libraries. FastSV uses sparse operations to avoid redundant work and optimized MPI communication to avoid bottlenecks. The resultant algorithm shows high-performance and scalability as it can find the connected components of a hyperlink graph with over 134B edges in 30 seconds using 262K cores on a Cray XC40 supercomputer. FastSV outperforms the state-of-the-art algorithm by an average speedup of 2.21× (max 4.27×) on a variety of real-world graphs. BibTeX: @article{Zhang2019b, author = {Zhang, Yongzhe and Azad, Ariful and Hu, Zhenjiang}, title = {FastSV: A Distributed-Memory Connected Component Algorithm with Fast Convergence}, year = {2019} } Zhang Y, Brahmakshatriya A, Chen X, Dhulipala L, Kamil S, Amarasinghe S and Shun J (2019), "PriorityGraph: A Unified Programming Model for Optimizing Ordered Graph Algorithms", November, 2019. [Abstract] [BibTeX] Abstract: Many graph problems can be solved using ordered parallel graph algorithms that achieve significant speedup over their unordered counterparts by reducing redundant work. This paper introduces PriorityGraph, a new programming framework that simplifies writing high-performance parallel ordered graph algorithms. The key to PriorityGraph is a novel priority-based abstraction that enables vertices to be processed in a dynamic order while hiding low-level implementation details from the user. PriorityGraph is implemented as a language and compiler extension to GraphIt. The PriorityGraph compiler extension leverages new program analyses, transformations, and code generation to produce fast implementations of ordered parallel graph algorithms. We also introduce bucket fusion, a new performance optimization that fuses together different rounds of ordered algorithms to reduce synchronization overhead, resulting in 1.2× to 3× speedup over the fastest existing ordered algorithm implementations on road networks with large diameters. PriorityGraph achieves up to 3× speedup on six ordered graph algorithms over state-of-the-art frameworks and hand-optimized implementations (Julienne, Galois, and GAPBS) that support ordered algorithms. BibTeX: @article{Zhang2019c, author = {Zhang, Yunming and Brahmakshatriya, Ajay and Chen, Xinyi and Dhulipala, Laxman and Kamil, Shoaib and Amarasinghe, Saman and Shun, Julian}, title = {PriorityGraph: A Unified Programming Model for Optimizing Ordered Graph Algorithms}, year = {2019} } Zhang H, Constantinescu EM and Smith BF (2019), "PETSc TSAdjoint: a discrete adjoint ODE solver for first-order and second-order sensitivity analysis", December, 2019. [Abstract] [BibTeX] Abstract: We present a new software system PETSc TSAdjoint for first-order and second-order adjoint sensitivity analysis of time-dependent nonlinear differential equations. The derivative calculation in PETSc TSAdjoint is essentially a high-level algorithmic differentiation process. The adjoint models are derived by differentiating the timestepping algorithms and implemented based on the parallel infrastructure in PETSc. Full differentiation of the library code including MPI routines thus is avoided, and users do not need to derive their own adjoint models for their specific applications. PETSc TSAdjoint can compute the first-order derivative, that is, the gradient of a scalar functional, and the Hessian-vector product that carries second-order derivative information, while requiring minimal input (a few callbacks) from the users. Optimal checkpointing schemes are employed by the adjoint model in a manner that is transparent to users. Usability, efficiency, and scalability are demonstrated through examples from a variety of applications. BibTeX: @article{Zhang2019d, author = {Hong Zhang and Emil M. Constantinescu and Barry F. Smith}, title = {PETSc TSAdjoint: a discrete adjoint ODE solver for first-order and second-order sensitivity analysis}, year = {2019} } Zhang J, Hong M and Zhang S (2019), "On Lower Iteration Complexity Bounds for the Saddle Point Problems", December, 2019. [Abstract] [BibTeX] Abstract: In this paper, we study the lower iteration complexity bounds for finding the saddle point of a strongly convex and strongly concave saddle point problem: _x_yF(x,y). We restrict the classes of algorithms in our investigation to be either pure first-order methods or methods using proximal mappings. The existing lower bound result for this type of problems is obtained via the framework of strongly monotone variational inequality problems, which corresponds to the case where the gradient Lipschitz constants (L_x, L_y and L_xy) and strong convexity/concavity constants (_x and _y) are uniform with respect to variables x and y. However, specific to the min-max saddle point problem these parameters are naturally different. Therefore, one is led to finding the best possible lower iteration complexity bounds, specific to the min-max saddle point models. In this paper we present the following results. For the class of pure first-order algorithms, our lower iteration complexity bound is Ω(\frac{L_x}{\mu_x}+\frac{L_{xy}^2}{\mu_x\mu_y}+\frac{L_y}{\mu_y}⋅ln(1𝜖)), where the term L_{xy}^2_x_y explains how the coupling influences the iteration complexity. Under several special parameter regimes, this lower bound has been achieved by corresponding optimal algorithms. However, whether or not the bound under the general parameter regime is optimal remains open. Additionally, for the special case of bilinear coupling problems, given the availability of certain proximal operators, a lower bound of Ω(\frac{L_{xy}^2}{\mu_x\mu_y}+1⋅(1𝜖)) is established in this paper, and optimal algorithms have already been developed in the literature. BibTeX: @article{Zhang2019e, author = {Junyu Zhang and Mingyi Hong and Shuzhong Zhang}, title = {On Lower Iteration Complexity Bounds for the Saddle Point Problems}, year = {2019} } Zhang T (2019), "Sparse Optimization on General Atomic Sets: Greedy and Forward-Backward Algorithms", December, 2019. [Abstract] [BibTeX] Abstract: We consider the problem of sparse atomic optimization, where the notion of "sparsity" is generalized to meaning some linear combination of few atoms. The definition of atomic set is very broad; popular examples include the standard basis, low-rank matrices, overcomplete dictionaries, permutation matrices, orthogonal matrices, etc. The model of sparse atomic optimization therefore includes problems coming from many fields, including statistics, signal processing, machine learning, computer vision and so on. Specifically, we consider the problem of maximizing a restricted strongly convex (or concave), smooth function restricted to a sparse linear combination of atoms. We extend recent work that establish linear convergence rates of greedy algorithms on restricted strongly concave, smooth functions on sparse vectors to the realm of general atomic sets, where the convergence rate involves a novel quantity: the "sparse atomic condition number". This leads to the strongest known multiplicative approximation guarantees for various flavors of greedy algorithms for sparse atomic optimization; in particular, we show that in many settings of interest the greedy algorithm can attain strong approximation guarantees while maintaining sparsity. Furthermore, we introduce a scheme for forward-backward algorithms that achieves the same approximation guarantees. Secondly, we define an alternate notion of weak submodularity, which we show is tightly related to the more familiar version that has been used to prove earlier linear convergence rates. We prove analogous multiplicative approximation guarantees using this alternate weak submodularity, and establish its distinct identity and applications. BibTeX: @article{Zhang2019f, author = {Thomas Zhang}, title = {Sparse Optimization on General Atomic Sets: Greedy and Forward-Backward Algorithms}, year = {2019} } Zhao K, Su J, Yu JX and Zhang H (2019), "SQL-G: Efficient Graph Analytics by SQL", IEEE Transactions on Knowledge and Data Engineering. [Abstract] [BibTeX] [DOI] Abstract: Querying graphs and conducting graph analytics become important in data processing since there are many real applications dealing with massive graphs, such as online social networks, Semantic Web, knowledge graphs, etc. Over the years, many distributed graph processing systems have been developed to support graph analytics using various programming models, and many graph querying languages have been proposed. A natural question that arises is how to integrate graph data and traditional non-graph data in a distributed system for users to conduct analytics. There are two issues. One issue is related to expressiveness on how to specify graph analytics as well as data analytics by a querying language. The other issue is related to efficiency on how to process analytics in a distributed system. For the first issue, SQL is a best candidate, since SQL is a well-accepted language for data processing. We concentrate on SQL for graph analytics. Our early work shows that graph analytics can be supported by SQL in a way from "semiring + while" to "relational algebra + while" via the enhanced recursive SQL queries. In this paper, we focus on the second issue on how to process such enhanced recursive SQL queries based on the GAS (Gather-Apply-Scatter) model under which efficient graph processing systems can be developed. To demonstrate the efficiency, we implemented a system by tightly coupling Spark SQL and GraphX on Spark which is one of the most popular in-memory data-flow processing platforms. First, we enhance Spark SQL by adding the capability of supporting the enhanced recursive SQL queries for graph analytics. In this regard, graph analytics can be processed using a distributed SQL engine alone. Second, we further propose new transformation rules to optimize/translate the operations for recursive SQL queries to the operations by GraphX. In this regard, graph analytics by SQL can be processed in a similar way as done by a distributed graph processing system using the APIs provided by the system. We conduct extensive performance studies to test graph analytics using large real graphs. We show that our approach can achieve similar or even higher efficiency, in comparison to the built-in graph algorithms in the existing graph processing systems. BibTeX: @article{Zhao2019, author = {Zhao, K. and Su, J. and Yu, J. X. and Zhang, H.}, title = {SQL-G: Efficient Graph Analytics by SQL}, journal = {IEEE Transactions on Knowledge and Data Engineering}, year = {2019}, doi = {10.1109/TKDE.2019.2950620} } Zheng H, Vong S and Liu L (2019), "A direct preconditioned modulus-based iteration method for solving nonlinear complementarity problems of H-matrices", Applied Mathematics and Computation. Vol. 353, pp. 396-405. [Abstract] [BibTeX] [DOI] [URL] Abstract: In this paper, we establish a direct preconditioned modulus-based iteration method for solving a class of nonlinear complementarity problems with the system matrix being an H-matrix. The convergence theorems of the proposed method are given, which generalize and improve the existing ones. Numerical examples show that the proposed method is efficient. BibTeX: @article{Zheng2019, author = {Zheng, Hua and Vong, Seakweng and Liu, Ling}, title = {A direct preconditioned modulus-based iteration method for solving nonlinear complementarity problems of H-matrices}, journal = {Applied Mathematics and Computation}, year = {2019}, volume = {353}, pages = {396--405}, url = {http://www.sciencedirect.com/science/article/pii/S0096300319301134}, doi = {10.1016/j.amc.2019.02.015} } Zheng T, Zhang Z and Cheng X (2019), "SilverChunk: An Efficient In-Memory Parallel Graph Processing System", In Database and Expert Systems Applications. , pp. 222-236. Springer International Publishing. [Abstract] [BibTeX] Abstract: One of the main constructs of graph processing is the two-level nested loop structure. Parallelizing nested loops is notoriously unfriendly to both CPU and memory access when dealing with real graph data due to its skewed distribution. To address this problem, we present SilverChunk, a high performance graph processing system. SilverChunk builds edge chunks of equal size from original graphs and unfolds nested loops statically in pull-based executions (VR-Chunk) and dynamically in push-based executions (D-Chunk). VR-Chunk slices the entire graph into several chunks. A virtual vertex is generated pointing to the first half of each sliced edge list so that no edge list lives in more than one chunk. D-Chunk builds its chunk list via binary searching over the prefix degree sum array of the active vertices. Each chunk has a local buffer for conflict-free maintenance of the next frontier. By changing the units of scheduling from edges to chunks, SilverChunk achieves better CPU and memory utilization. SilverChunk provides a high level programming interface combined with multiple optimization techniques to help developing efficient graph processing applications. Our evaluation results reveal that SilverChunk outperforms state-of-the-art shared-memory graph processing systems by up to 44, including Gemini, Grazelle, etc. Moreover, it has lower memory overheads and nearly zero pre-processing time. BibTeX: @inproceedings{Zheng2019a, author = {Zheng, Tianqi and Zhang, Zhibin and Cheng, Xueqi}, editor = {Hartmann, Sven and Küng, Josef and Chakravarthy, Sharma and Anderst-Kotsis, Gabriele and Tjoa, A Min and Khalil, Ismail}, title = {SilverChunk: An Efficient In-Memory Parallel Graph Processing System}, booktitle = {Database and Expert Systems Applications}, publisher = {Springer International Publishing}, year = {2019}, pages = {222--236} } Zhou G, Feng Y, Bo R and Zhang T (2019), "GPU-accelerated sparse matrices parallel inversion algorithm for large-scale power systems", International Journal of Electrical Power & Energy Systems. Vol. 111, pp. 34-43. [Abstract] [BibTeX] [DOI] [URL] Abstract: State-of-the-art Graphics Processing Unit (GPU) has superior performances on float-pointing calculation and memory bandwidth, and therefore has great potential in many computationally intensive power system applications, one of which is the inversion of large-scale sparse matrix. It is a fundamental component for many power system analyses which requires to solve massive number of forward and backward substitution (F&B) subtasks and seems to be a good GPU-accelerated candidate application. By means of solving multiple F&B subtasks concurrently and a serial of performance tunings in compliance with GPU's architectures, we successfully develop a batch F&B algorithm on GPUs, which not only extracts the intra-level and intra-level parallelisms inside single F&B subtask but also explores a more regular parallelism among massive F&B subtasks, called inter-task parallelism. Case study on a 9241-dimension case shows that the proposed batch F&B solver consumes 2.92 μs per forward substitution (FS) subtask when the batch size is equal to 3072, achieving 65 times speedup relative to KLU library. And on the basis the complete design process of GPU-based inversion algorithm is proposed. By offloading the tremendous computational burden to GPU, the inversion of 9241-dimension case consumes only 97 ms, which can achieve 8.1 times speedup relative to the 12-core CPU inversion solver based on KLU library. The proposed batch F&B solver is practically very promising in many other power system applications requiring solving massive F&B subtasks, such as probabilistic power flow analysis. BibTeX: @article{Zhou2019, author = {Zhou, Gan and Feng, Yanjun and Bo, Rui and Zhang, Tao}, title = {GPU-accelerated sparse matrices parallel inversion algorithm for large-scale power systems}, journal = {International Journal of Electrical Power & Energy Systems}, year = {2019}, volume = {111}, pages = {34--43}, url = {http://www.sciencedirect.com/science/article/pii/S0142061518325109}, doi = {10.1016/j.ijepes.2019.03.074} } Zhu M, Zhang T, Gu Z and Xie Y (2019), "Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs", In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. New York, NY, USA , pp. 359-371. ACM. [Abstract] [BibTeX] [DOI] Abstract: Deep neural networks have become the compelling solution for the applications such as image classification, object detection, speech recognition, and machine translation. However, the great success comes at the cost of excessive computation due to the over-provisioned parameter space. To improve the computation efficiency of neural networks, many pruning techniques have been proposed to reduce the amount of multiply-accumulate (MAC) operations, which results in high sparsity in the networks.\ Unfortunately, the sparse neural networks often run slower than their dense counterparts on modern GPUs due to their poor device utilization rate. In particular, as the sophisticated hardware primitives (e.g., Tensor Core) have been deployed to boost the performance of dense matrix multiplication by an order of magnitude, the performance of sparse neural networks lags behind significantly.\ In this work, we propose an algorithm and hardware co-design methodology to accelerate the sparse neural networks. A novel pruning algorithm is devised to improve the workload balance and reduce the decoding overhead of the sparse neural networks. Meanwhile, new instructions and micro-architecture optimization are proposed in Tensor Core to adapt to the structurally sparse neural networks. Our experimental results show that the pruning algorithm can achieve 63% performance gain with model accuracy sustained. Furthermore, the hardware optimization gives an additional 58% performance gain with negligible area overhead. BibTeX: @inproceedings{Zhu2019, author = {Zhu, Maohua and Zhang, Tao and Gu, Zhenyu and Xie, Yuan}, title = {Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs}, booktitle = {Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture}, publisher = {ACM}, year = {2019}, pages = {359--371}, doi = {10.1145/3352460.3358269} } Zhu G, Jiang P and Agrawal G (2019), "A Methodology for Characterizing Sparse Datasets and Its Application to SIMD Performance Prediction", In Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques., September, 2019. , pp. 445-456. [Abstract] [BibTeX] [DOI] Abstract: Irregular computations are commonly seen in many scientific and engineering domains that use unstructured meshes or sparse matrices. The performance of an irregular application is very dependent upon the dataset. This paper poses the following question: "given an unstructured mesh or a graph, what method(s) can be used to sample it, such that the execution on the resulting sampled dataset can accurately reflect performance characteristics on the full dataset". Our first insight is that developing a universal sampling approach for all sparse matrices is unpractical. According to the non-zero distribution of the sparse matrix, we propose two novel sampling strategies: Stride Average sampling and Random Tile sampling, which are suitable for uniform and skewed sparse matrices respectively. To help categorize a sparse matrix as uniform or skewed, we introduce clustering coefficient as an important feature which can be propagated into the decision tree model. We also adapt Random Node Neighbor sampling approach for efficient estimation of clustering coefficient. We apply our unstructured dataset characterization approach to modeling the performance for SIMD irregular applications, where the sampled dataset obtained is used to predict cache miss rate and SIMD utilization ratio. We also build analytical models to estimate overheads incurred by load imbalance among threads. With knowledge of these factors, we adapt a code skeleton framework SKOPE to capture the workload behaviors and aggregate performance statistics for execution time prediction. BibTeX: @inproceedings{Zhu2019a, author = {Zhu, G. and Jiang, P. and Agrawal, G.}, title = {A Methodology for Characterizing Sparse Datasets and Its Application to SIMD Performance Prediction}, booktitle = {Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques}, year = {2019}, pages = {445--456}, doi = {10.1109/PACT.2019.00042} }