# Sparsely thinking of 2021

#### A selection of research on numerical analysis, mathematical optimisation, and hardware accelerators published in the past year

A wordcloud from the publications of 2021.

The year 2021 is gone, which means it’s time to take a look at which publications were captured by my Google Scholar queries.

I moved on from the queries I discussed in my previous post, which turned out to be quite noisy. At the moment I have the following:

• a query for applications, specifically sustainability-oriented uses of optimisation and numerical algorithm parallel numerical sparse matrix optimization graph sustainability environment pollution economics algorithm model
• a query for the more theoretical stuff: (sparse linear algebra numerical) OR (parallel numerical optimization)
• queries for new publications by a set of people: Hartwig Anzt, Stephen Boyd, Aydın Buluç, Coralia Cartis, Ümit V. Çatalyürek, George Constantinides, Tim Davis, James W. Demmel, Jack Dongarra, Jaroslav Fowkes, Carla P. Gomes, Nick Gould, Nicholas J. Higham, Eric C. Kerrigan, Ignacio Laguna, Ruth Misener, Yurii Nesterov, Dominique Orban, Gabriel Peyré, Tyrone Rees, Michael Saunders, Jennifer Scott, Stanimire Tomov. Unfortunately I can’t have queries for all the people I would like to follow!

Since I have been tuning the queries quite a bit thoughout the year, I have likely missed some interesting work. If you’re reading this post and think I have missed something, please leave a comment and help me out.

The last was the fourth year of my bibliography project so I thought it would be good to take a look back at how its content have evolved in time. Last year, I started experimenting with a bit of data science, but this time I’d like to get real. I’ll discuss how the project has changed in terms of size and type of publications, how the topics have changed, and my considerations about this project and the literature in the field(s).

### Number and type of publications throughout the years

Last year I collected fewer publications compared to the previous year, probably because my queries have changed a couple of times. To be honest, when you almost reach an average of a new publication every day, fewer publications isn’t a bad thing…

Although journal articles are still the most frequent type of publication, I found interesting that an increasing number of PhD dissertations are ending up in the bibliography. However, I’m not sure what this means in terms of the field considering that part of these dissertations are more oriented to machine learning or data science than they are to numerical analysis. Is this a continuation of the trend we’ve seen in the past years? Can numerical methods now be considered part of data science?

It was good to see that some of these dissertations – for example, Rogelio Long’s – cover topics that are very close to mine.

### Topics throughout the years

I carried out some topic modeling on the bibliography using the gensim package. I extracted the data from titles and abstracts, and used nltk to carry out part of the data cleaning. I added a Jupyter notebook with the code in the repository.

The analysis found 4 topics. Let’s try to give an interpretation of what they represent.

#### Topic 0: algorithms for graph partitioning and dimensionality reduction

One way to understand what is covered by this topic could be to look at the most frequent terms in the papers, which is represented by the following wordcloud.

This figure tells us that words like graph, model, and analysis are very frequent in the publications with very high topic content. Unfortunately, this isn’t very helpful, beacause the same terms are frequent in other publications too. To help overcome this problem, the pyLDAvis package allows us to delve more into the topic and visualise the most relevant terms, by tuning a parameter called $\lambda$: this tells us that words like rank, vertex, approximation, and framework are the most characteristic of this topic.

The list of the top 5 publications by topic content shows what kind of publications are in this set.

Key Title
Burkhardt2019 Optimal algebraic Breadth-First Search for sparse graphs
Karypis1999 A fast and high quality multilevel scheme for partitioning irregular graphs
Tian2021 A High-Performance Sparse Tensor Algebra Compiler in Multi-Level IR
Ayall2019 Edge Property Based Stream Order Reduce the Performance of Stream Edge Graph Partition
Giles2019 Generalised multilevel Picard approximations

#### Topic 1: algorithms for mathematical optimisation

Here is the wordcloud for this topic.

The pyLDAvis helps us again: words like precision, optimization, solution, and convex characterise this topic. The top 5 publications by topic content are the following.

Key Title
Gao2020c Distributed Quasi-Newton Derivative-Free Optimization Method for Optimization Problems with Multiple Local Optima
Alghunaim2020 On the Performance and Linear Convergence of Decentralized Primal-Dual Methods
Belotti2013 Mixed-integer nonlinear optimization
Nowak2019 Multi-Tree Decomposition Methods for Large-Scale Mixed Integer Nonlinear Optimization
Andrei2020 Nonlinear Conjugate Gradient Methods for Unconstrained Optimization

#### Topic 2: parallelisation of algorithms

Maybe confusingly, this is a very general topic: words like parallel, model, node, message are characteristic to this set of publications, according to the pyLDAvis package. Here is the top 5 publications by topic content.

Key Title
SoltanMohammadi2020 Automatic Sparse Computation Parallelization By Utilizing Domain-Specific Knowledge In Data Dependence Analysis
Ahrens2020b Load Plus Communication Balancing in Contiguous Partitions for Distributed Sparse Matrices: Linear-Time Algorithms
Muro2019 Acceleration of Symmetric Sparse Matrix-Vector Product Using Improved Hierarchical Diagonal Blocking Format
Chang2019 The Complexity of $\Delta +1$ Coloring in Congested Clique, Massively Parallel Computation, and Centralized Local Computation
Ahrens2020a On Optimal Partitioning For Sparse Matrices In Variable Block Row Format

Here, instead, is the wordcloud for this topic

#### Topic 3: hardware architectures

The last topic is the one that most reflects my background in electronic engineering, because it is oriented towards the hardware and low-level memory optimisations.

Again, just looking at the most frequent terms in this topic – and therefore the wordcloud – is misleading. Let’s look at the top 5 publications (by content) in this category instead.

Key Title
Ramesh2019 Hardware-Software Co-Design Accelerators for Sparse BLAS
Choo2014 Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads
Candel2017 Accurately modeling the on-chip and off-chip GPU memory subsystem
LiD2015 Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
Kim2019 Static code transformations for thread-dense memory accesses in GPU computing

The pyLDAvis package tells us words like GPU, memory, NVIDIA, and AMD are most characteristic of this topic.

#### Topic content by year

Let’s first summarise my interpretation of the topics found by gensim in the following table:

Topic number My interpretation
0 Graph partitioning and dimensionality reduction
1 Mathematical optimisation
2 Parallelisation of algorithms
3 Hardware architectures

The following figure shows how the topic content has evolved through the years in the bibliography project.

We immediately notice that, while the content of topics 0 and 2 has been mostly stable, the hardware architectures content has been declining in the past couple of years. This is something I’ve noticed while adding publications throughout the year, and it’s mostly due to my interests shifting slightly more towards both software and theory. At the same time, the mathematical optimisation content has been increasing quite steeply. Why is this happening?

### My informal considerations

Here is some personal observations about the data I gathered, and this year’s publications in general.

• Publications about machine learning methods tend to be associated with the mathematical optimisation topic, which is why this topic has been growing so much in the past few years.
• Publications about applications of machine learning or data science show up as having a large content of parallelisation of algorithms. I think this is because the Latent Dirichlet Analysis (LDA) is perplexed by them. This also explains why the keywords for topic 2 are so vague.
• I expect papers like Huang’s CoSA: Scheduling by Constrained Optimization for Spatial Accelerators will become very common. This paper describes dealing with parameters of deep learning networks using conventional mathematical optimisation problems.
• I think we can safely say machine learning is now simply considered a broad category of mathematical optimisation algorithms, sitting next to conventional mathematical optimisation algorithms like – just to name one – the simplex method. This also explains why so many researchers are working towards estabilishing solid mathematical foundations for machine learning algorithms.
• The compenetration of machine learning and conventional mathematical optimisation is however occurring in both directions: more and more aspects of machine learning are being adapted to conventional mathematical optimisation, starting from the core elements of linear algebra. A practical example of this is the work of Nick Higham’s team on mixed-precision algorithms.
• The COVID-19 pandemic and COP26 have motivated reseachers to write a lot about applications. Keywords like sustainability have exploded so much in the past year that I had to remove a query of mine because of the high noise.
• Since machine learning and data science require accurate tuning based on the data at hand, I expect to see an increasing number of publications about exposing the structure of the data to lower-level algorithms and primitives.
• I’m still seeing papers about optimising sparse matrix-vector multiplications! It’s interesting that such a low-level primitive has still something to offer.

### Finally, this year’s list

Matching entries: 0
settings...
 Abdelfattah A, Anzt H, Boman EG, Carson E, Cojean T, Dongarra J, Gates M, Grützmacher T, Higham NJ, Li S, Lindquist N, Liu Y, Loe J, Luszczek P, Nayak P, Pranesh S, Rajamanickam S, Ribizel T, Smith B, Swirydowicz K, Thomas S, Tomov S, Tsai YM, Yamazaki I and Yang UM (2021), "A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic", The International Journal of High Performance Computing Applications. [Abstract] [BibTeX] Abstract: Within the past years, hardware vendors have started designing low precision special function units in response to the demand of the Machine Learning community and their demand for high compute power in low precision formats. Also the server-line products are increasingly featuring low-precision special function units, such as the NVIDIA tensor cores in ORNL's Summit supercomputer providing more than an order of magnitude higher performance than what is available in IEEE double precision. At the same time, the gap between the compute power on the one hand and the memory bandwidth on the other hand keeps increasing, making data access and communication prohibitively expensive compared to arithmetic operations. To start the multiprecision focus effort, we survey the numerical linear algebra community and summarize all existing multiprecision knowledge, expertise, and software capabilities in this landscape analysis report. We also include current efforts and preliminary results that may not yet be considered "mature technology," but have the potential to grow into production quality within the multiprecision focus effort. As we expect the reader to be familiar with the basics of numerical linear algebra, we refrain from providing a detailed background on the algorithms themselves but focus on how mixed- and multiprecision technology can help improving the performance of these methods and present highlights of application significantly outperforming the traditional fixed precision methods. BibTeX: @article{Abdelfattah2021, author = {Ahmad Abdelfattah and Hartwig Anzt and Erik G. Boman and Erin Carson and Terry Cojean and Jack Dongarra and Mark Gates and Thomas Grützmacher and Nicholas J. Higham and Sherry Li and Neil Lindquist and Yang Liu and Jennifer Loe and Piotr Luszczek and Pratik Nayak and Sri Pranesh and Siva Rajamanickam and Tobias Ribizel and Barry Smith and Kasia Swirydowicz and Stephen Thomas and Stanimire Tomov and Yaohung M. Tsai and Ichitaro Yamazaki and Urike Meier Yang}, title = {A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic}, journal = {The International Journal of High Performance Computing Applications}, year = {2021} }  Abdelfattah A, Barra V, Beams N, Bleile R, Brown J, Camier J-S, Carson R, Chalmers N, Dobrev V, Dudouit Y, Fischer P, Karakus A, Kerkemeier S, Kolev T, Lan Y-H, Merzari E, Min M, Phillips M, Rathnayake T, Rieben R, Stitt T, Tomboulides A, Tomov S, Tomov V, Vargas A, Warburton T and Weiss K (2021), "GPU Algorithms for Efficient Exascale Discretizations", September, 2021. [Abstract] [BibTeX] Abstract: In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems. BibTeX: @article{Abdelfattah2021a, author = {Ahmad Abdelfattah and Valeria Barra and Natalie Beams and Ryan Bleile and Jed Brown and Jean-Sylvain Camier and Robert Carson and Noel Chalmers and Veselin Dobrev and Yohann Dudouit and Paul Fischer and Ali Karakus and Stefan Kerkemeier and Tzanio Kolev and Yu-Hsiang Lan and Elia Merzari and Misun Min and Malachi Phillips and Thilina Rathnayake and Robert Rieben and Thomas Stitt and Ananias Tomboulides and Stanimire Tomov and Vladimir Tomov and Arturo Vargas and Tim Warburton and Kenneth Weiss}, title = {GPU Algorithms for Efficient Exascale Discretizations}, year = {2021} }  Abdelfattah A, Anzt H, Ayala A, Boman E, Carson E, Cayrols S, T.Cojean, Dongarra J, Falgout R, Gates M, Gruetzmacher T, N.Higham, Kruger S, Li X, Lindquist N, Liu Y, Loe J, Luszczek P, P.Nayak, Osei-Kuffuor D, Pranesh S, Rajamanickam S, Ribizel T, B.Smith, Swirydowicz K, Thomas S, Tomov S, Tsai Y, Yamazaki I and Yang U (2021), "Advances in Mixed PrecisionAlgorithms: 2021 Edition". Thesis at: Lawrence Livermore National Laboratory. [Abstract] [BibTeX] [URL] Abstract: Over the last year, the ECP xSDK-multiprecision effort has made tremendous progress in developing and deploying new mixed precision technology and customizing the algorithms for the hardware deployed in the ECP flagship supercomputers. The effort also has succeeded in creating a cross-laboratory community of scientists interested in mixed precision technology and now working together in deploying this technology for ECP applications. In this report, we highlight some of the most promising and impactful achievements of the last year. Among the highlights we present areitemize item Mixed precision IR using a dense LU factorization and achieving a 1.8× speedup on Spock; item Results and strategies for mixed precision IR using a sparse LU factorization; item A mixed precision eigenvalue solver; item Mixed Precision GMRES-IR being deployed in Trilinos, and achieving a speedup of 1.4× over standard GMRES; item Compressed Basis (CB) GMRES being deployed in Ginkgo and achieving an average 1.4× speedup over standard GMRES; item Preparing hypre for mixed precision execution; item Mixed precision sparse approximate inverse preconditioners achieving an average speedup of 1.2×; item Detailed description of the memory accessor separating the arithmetic precision from the memory precision, and enabling memory-bound low precision BLAS 1/2 operations to increase the accuracy by using high precision in the computations without degrading the performance;itemize We emphasize that many of the highlights presented here have also been submitted to peer-reviewed journals or established conferences, and are under peer-review or have already been published BibTeX: @techreport{Abdelfattah2021b, author = {A. Abdelfattah and H. Anzt and A. Ayala and E. Boman and E. Carson and S. Cayrols and T.Cojean and J. Dongarra and R. Falgout and M. Gates and T. Gruetzmacher and N.Higham and S. Kruger and X. Li and N. Lindquist and Y. Liu and J. Loe and P. Luszczek and P.Nayak and D. Osei-Kuffuor and S. Pranesh and S. Rajamanickam and T. Ribizel and B.Smith and K. Swirydowicz and S. Thomas and S. Tomov and Y. Tsai and I. Yamazaki and U.M. Yang}, title = {Advances in Mixed PrecisionAlgorithms: 2021 Edition}, school = {Lawrence Livermore National Laboratory}, year = {2021}, url = {https://www.osti.gov/servlets/purl/1814677} }  Acer S, Boman EG, Glusa CA and Rajamanickam S (2021), "Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems", Parallel Computing., April, 2021. , pp. 102769. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no distributed-memory-parallel, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. In Sphynx, we allow using different preconditioners and exploit their unique advantages. We use Sphynx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on the GPU performance. We perform those evaluations on two distinct classes of graphs: regular (such as meshes, matrices from finite element methods) and irregular (such as social networks and web graphs), and show that different settings and preconditioners are needed for these graph classes. The experimental results on the Summit supercomputer show that Sphynx is the fastest alternative on irregular graphs in an application-friendly setting and obtains a partitioning quality close to ParMETIS on regular graphs. When compared to nvGRAPH on a single GPU, Sphynx is faster and obtains better balance and better quality partitions. Sphynx provides a good and robust partitioning method across a wide range of graphs for applications looking for a GPU-based partitioner. BibTeX: @article{Acer2021, author = {Seher Acer and Erik G. Boman and Christian A. Glusa and Sivasankaran Rajamanickam}, title = {Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems}, journal = {Parallel Computing}, publisher = {Elsevier BV}, year = {2021}, pages = {102769}, doi = {10.1016/j.parco.2021.102769} }  Acer S, Azad A, Boman EG, Buluç A, Devine KD, Ferdous S, Gawande N, Ghosh S, Halappanavar M, Kalyanaraman A, Khan A, Minutoli M, Pothen A, Rajamanickam S, Selvitopi O, Tallent NR and Tumeo A (2021), "EXAGRAPH: Graph and combinatorial methods for enabling exascale applications", The International Journal of High Performance Computing Applications., September, 2021. , pp. 109434202110292. SAGE Publications. [Abstract] [BibTeX] [DOI] Abstract: Combinatorial algorithms in general and graph algorithms in particular play a critical enabling role in numerous scientific applications. However, the irregular memory access nature of these algorithms makes them one of the hardest algorithmic kernels to implement on parallel systems. With tens of billions of hardware threads and deep memory hierarchies, the exascale computing systems in particular pose extreme challenges in scaling graph algorithms. The codesign center on combinatorial algorithms, ExaGraph, was established to design and develop methods and techniques for efficient implementation of key combinatorial (graph) algorithms chosen from a diverse set of exascale applications. Algebraic and combinatorial methods have a complementary role in the advancement of computational science and engineering, including playing an enabling role on each other. In this paper, we survey the algorithmic and software development activities performed under the auspices of ExaGraph from both a combinatorial and an algebraic perspective. In particular, we detail our recent efforts in porting the algorithms to manycore accelerator (GPU) architectures. We also provide a brief survey of the applications that have benefited from the scalable implementations of different combinatorial algorithms to enable scientific discovery at scale. We believe that several applications will benefit from the algorithmic and software tools developed by the ExaGraph team. BibTeX: @article{Acer2021a, author = {Seher Acer and Ariful Azad and Erik G Boman and Aydın Buluç and Karen D. Devine and SM Ferdous and Nitin Gawande and Sayan Ghosh and Mahantesh Halappanavar and Ananth Kalyanaraman and Arif Khan and Marco Minutoli and Alex Pothen and Sivasankaran Rajamanickam and Oguz Selvitopi and Nathan R Tallent and Antonino Tumeo}, title = {EXAGRAPH: Graph and combinatorial methods for enabling exascale applications}, journal = {The International Journal of High Performance Computing Applications}, publisher = {SAGE Publications}, year = {2021}, pages = {109434202110292}, doi = {10.1177/10943420211029299} }  Adil M, Tavakkol S and Madani R (2021), "Rapid Convergence of First-Order Numerical Algorithms via Adaptive Conditioning", March, 2021. [Abstract] [BibTeX] Abstract: This paper is an attempt to remedy the problem of slow convergence for first-order numerical algorithms by proposing an adaptive conditioning heuristic. First, we propose a parallelizable numerical algorithm that is capable of solving large-scale conic optimization problems on distributed platforms such as graphics processing unit with orders-of-magnitude time improvement. Proof of global convergence is provided for the proposed algorithm. We argue that on the contrary to common belief, the condition number of the data matrix is not a reliable predictor of convergence speed. In light of this observation, an adaptive conditioning heuristic is proposed which enables higher accuracy compared to other first-order numerical algorithms. Numerical experiments on a wide range of large-scale linear programming and second-order cone programming problems demonstrate the scalability and computational advantages of the proposed algorithm compared to commercial and open-source state-of-the-art solvers. BibTeX: @article{Adil2021, author = {Muhammad Adil and Sasan Tavakkol and Ramtin Madani}, title = {Rapid Convergence of First-Order Numerical Algorithms via Adaptive Conditioning}, year = {2021} }  Adjé A, Khalifa DB and Martel M (2021), "Fast and Efficient Bit-Level Precision Tuning", March, 2021. [Abstract] [BibTeX] Abstract: In this article, we introduce a new technique for precision tuning. This problem consists of finding the least data types for numerical values such that the result of the computation satisfies some accuracy requirement. State of the art techniques for precision tuning use a try and fail approach. They change the data types of some variables of the program and evaluate the accuracy of the result. Depending on what is obtained, they change more or less data types and repeat the process. Our technique is radically different. Based on semantic equations, we generate an Integer Linear Problem (ILP) from the program source code. Basically, this is done by reasoning on the most significant bit and the number of significant bits of the values which are integer quantities. The integer solution to this problem, computed in polynomial time by a (real) linear programming solver, gives the optimal data types at the bit level. A finer set of semantic equations is also proposed which does not reduce directly to an ILP problem. So we use policy iteration to find the solution. Both techniques have been implemented and we show that our results encompass the results of state of the art tools. BibTeX: @article{Adje2021, author = {Assalé Adjé and Dorra Ben Khalifa and Matthieu Martel}, title = {Fast and Efficient Bit-Level Precision Tuning}, year = {2021} }  Advani R and O'Hagan S (2021), "Efficient Algorithms for Constructing an Interpolative Decomposition", May, 2021. [Abstract] [BibTeX] Abstract: Low-rank approximations are essential in modern data science. The interpolative decomposition provides one such approximation. Its distinguishing feature is that it reuses columns from the original matrix. This enables it to preserve matrix properties such as sparsity and non-negativity. It also helps save space in memory. In this work, we introduce two optimized algorithms to construct an interpolative decomposition along with numerical evidence that they outperform the current state of the art. BibTeX: @article{Advani2021, author = {Rishi Advani and Sean O'Hagan}, title = {Efficient Algorithms for Constructing an Interpolative Decomposition}, year = {2021} }  Afibuzzaman M (2021), "Optimization of Large Scale Iterative Eigensolvers". Thesis at: Michigan State University. [Abstract] [BibTeX] Abstract: Sparse matrix computations, in the form of solvers for systems of linear equations, eigenvalue problem or matrix factorizations constitute the main kernel in problems from fields as diverse as computational fluid dynamics, quantum many body problems, machine learning and graph analytics. Iterative eigensolvers have been preferred over the regular method because the regular method not being feasible with industrial sized matrices. Although dense linear algebra libraries like BLAS, LAPACK, SCALAPACK are well established and some vendor optimized implementation like mkl from Intel or Cray Libsci exist, it is not the same case for sparse linear algebra which is lagging far behind. The main reason behind slow progress in the standardization of sparse linear algebra or library development is the different forms and properties depending on the application area. It is worsened for deep memory hierarchies of modern architectures due to low arithmetic intensities and memory bound computations. Minimization of data movement and fast access to the matrix are critical in this case. Since the current technology is driven by deep memory architectures where we get the increased capacity at the expense of increased latency and decreased bandwidth when we go further from the processors. The key to achieve high performance in sparse matrix computations in deep memory hierarchy is to minimize data movement across layers of the memory and overlap data movement with computations. My thesis work contributes towards addressing the algorithmic challenges and developing a computational infrastructure to achieve high performance in scientific applications for both shared memory and distributed memory architectures. For this purpose, I started working on optimizing a blocked eigensolver and optimized specific computational kernels which uses a new storage format. Using this optimization as a building block, we introduce a shared memory task parallel framework focusing on optimizing the entire solvers rather than a specific kernel. Before extending this shared memory implementation to a distributed memory architecture, I simulated the communication pattern and overheads of a large scale distributed memory application and then I introduce the communication tasks in the framework to overlap communication and computation. Additionally, I also tried to find a custom scheduler for the tasks using a graph partitioner. To get acquainted with high performance computing and parallel libraries, I started my PhD journey with optimizing a DFT code named Sky3D where I used dense matrix libraries. Despite there might not be any single solution for this problem, I tried to find an optimized solution. Though the large distributed memory application MFDn is kind of the driver project of the thesis, but the framework we developed is not confined to MFDn only, rather it can be used for other scientific applications too. The output of this thesis is the task parallel HPC infrastructure that we envisioned for both shared and distributed memory architectures BibTeX: @phdthesis{Afibuzzaman2021, author = {Afibuzzaman, Md}, title = {Optimization of Large Scale Iterative Eigensolvers}, school = {Michigan State University}, year = {2021} }  Aggarwal I, Kashi A, Nayak P, Balos CJ, Woodward CS and Anzt H (2021), "Batched Sparse Iterative Solvers for Computational Chemistry Simulations on GPUs", In Proceedings of the 202112th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems., November, 2021. IEEE. [BibTeX] [DOI] BibTeX: @inproceedings{Aggarwal2021, author = {Isha Aggarwal and Aditya Kashi and Pratik Nayak and Cody J. Balos and Carol S. Woodward and Hartwig Anzt}, title = {Batched Sparse Iterative Solvers for Computational Chemistry Simulations on GPUs}, booktitle = {Proceedings of the 202112th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems}, publisher = {IEEE}, year = {2021}, doi = {10.1109/scala54577.2021.00010} }  Ahmad N, Ylmaz B and Unat D (2021), "A Split Execution Model for SpTRSV", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: Sparse Triangular Solve (SpTRSV) is an important and extensively used kernel in scientific computing. Parallelism within SpTRSV depends upon matrix sparsity pattern and, in many cases, is non-uniform from one computational step to the next. In cases where the SpTRSV computational steps have contrasting parallelism characteristics some steps are more parallel, others more sequential in nature, the performance of an SpTRSV algorithm may be limited by the contrasting parallelism characteristics. In this work, we propose a split-execution model for SpTRSV to automatically divide SpTRSV computation into two sub-SpTRSV systems and an SpMV, such that one of the sub-SpTRSVs has more parallelism than the other. Each sub-SpTRSV is then computed by using a different SpTRSV algorithm and possibly executes on a different platform (CPU or GPU). By analyzing the SpTRSV Directed Acyclic Graph (DAG) and matrix sparsity features, we use a heuristics-based approach to (i) automatically determine suitability of an SpTRSV for split-execution, (ii) find the appropriate split-point, and (iii) execute SpTRSV in a split fashion using two SpTRSV algorithms while managing any required inter-platform communication. Experimental evaluation of the execution model on two CPU-GPU machines with matrix dataset of 327 matrices from the SuiteSparse Matrix Collection shows that our approach correctly selects the fastest SpTRSV method (split or unsplit) for 88% of matrices on the Intel Xeon Gold (6148) + NVIDIA Tesla V100 and 83% on the Intel Core I7 + NVIDIA G1080 Ti platform achieving speedups in the range of 1.01 10 and 1.03 6.36, respectively. BibTeX: @article{Ahmad2021, author = {Najeeb Ahmad and Buse Ylmaz and Didem Unat}, title = {A Split Execution Model for SpTRSV}, journal = {IEEE Transactions on Parallel and Distributed Systems}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/tpds.2021.3074501} }  Ahmadi A, Manganiello F, Khademi A and Smith M (2021), "A Parallel Jacobi-Embedded Gauss-Seidel Method", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: A broad range of scientific simulations involve solving large-scale computationally expensive linear systems of equations. Iterative techniques are typically preferred over direct methods when it comes to large systems due to their lower memory requirements and shorter execution times. Gauss-Seidel (GS) is an iterative method for solving linear systems that are either strictly diagonally dominant or symmetric positive definite. This technique is an improved version of Jacobi and typically converges in fewer iterations. However, the sequential nature of this algorithm complicates the parallel extraction. In fact, most parallel derivatives of GS rely on the sparsity pattern of the coefficient matrix and require matrix reordering or domain decomposition. In this paper, we introduce a new algorithm that exploits the convergence property of GS and adapts the parallel structure of Jacobi. The proposed method works for both dense and sparse systems and is straightforward to implement. We have thoroughly examined the performance of our method on multicore and many-core architectures. Experimental results demonstrate the superior performance of proposed algorithm compared with GS and Jacobi. Additionally, performance comparison with built-in Krylov solvers in MATLAB showed that in terms of time per iteration, Krylov methods perform faster on CPUs, but our approach is significantly better when executed on GPUs. BibTeX: @article{Ahmadi2021, author = {Afshin Ahmadi and Felice Manganiello and Amin Khademi and Melissa Smith}, title = {A Parallel Jacobi-Embedded Gauss-Seidel Method}, journal = {IEEE Transactions on Parallel and Distributed Systems}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/tpds.2021.3052091} }  Ahn DH, Baker AH, Bentley M, Briggs I, Gopalakrishnan G, Hammerling DM, Laguna I, Lee GL, Milroy DJ and Vertenstein M (2021), "Keeping science on keel when software moves", Communications of the ACM., January, 2021. Vol. 64(2), pp. 66-74. Association for Computing Machinery (ACM). [Abstract] [BibTeX] [DOI] Abstract: An approach to reproducibility problems related to porting software across machines and compilers. BibTeX: @article{Ahn2021, author = {Dong H. Ahn and Allison H. Baker and Michael Bentley and Ian Briggs and Ganesh Gopalakrishnan and Dorit M. Hammerling and Ignacio Laguna and Gregory L. Lee and Daniel J. Milroy and Mariana Vertenstein}, title = {Keeping science on keel when software moves}, journal = {Communications of the ACM}, publisher = {Association for Computing Machinery (ACM)}, year = {2021}, volume = {64}, number = {2}, pages = {66--74}, doi = {10.1145/3382037} }  Ahookhosh M and Nesterov Y (2021), "High-order methods beyond the classical complexity bounds, I: inexact high-order proximal-point methods", July, 2021. [Abstract] [BibTeX] Abstract: In this paper, we introduce a Bi-level OPTimization (BiOPT) framework for minimizing the sum of two convex functions, where both can be nonsmooth. The BiOPT framework involves two levels of methodologies. At the upper level of BiOPT, we first regularize the objective by a (p+1)th-order proximal term and then develop the generic inexact high-order proximal-point scheme and its acceleration using the standard estimation sequence technique. At the lower level, we solve the corresponding pth-order proximal auxiliary problem inexactly either by one iteration of the pth-order tensor method or by a lower-order non-Euclidean composite gradient scheme with the complexity 𝒪(log 1varepsilon), for the accuracy parameter >0. Ultimately, if the accelerated proximal-point method is applied at the upper level, and the auxiliary problem is handled by a non-Euclidean composite gradient scheme, then we end up with a 2q-order method with the convergence rate 𝒪(k^-(p+1)), for q=⌊ p/2 ⌋, where k is the iteration counter. BibTeX: @article{Ahookhosh2021, author = {Masoud Ahookhosh and Yurii Nesterov}, title = {High-order methods beyond the classical complexity bounds, I: inexact high-order proximal-point methods}, year = {2021} }  Aksoy S, Young S, Firoz J, Gioiosa R, Raugas M and Escobedo J (2021), "SpectralFly: Ramanujan Graphs as Flexible and Efficient Interconnection Networks", April, 2021. [Abstract] [BibTeX] Abstract: In recent years, graph theoretic considerations have become increasingly important in the design of HPC interconnection topologies. One approach is to seek optimal or near-optimal families of graphs with respect to a particular graph theoretic property, such as diameter. In this work, we consider topologies which optimize the spectral gap. In particular, we study a novel HPC topology, SpectralFly, designed around the Ramanujan graph construction of Lubotzky, Phillips, and Sarnak (LPS). We show combinatorial properties, such as diameter, bisection bandwidth, average path length, and resilience to link failure, of SpectralFly topologies are better than, or comparable to, similarly constrained DragonFly, SlimFly, and BundleFly topologies. Additionally, we simulate the performance of SpectralFly topologies on a representative sample of physics-inspired HPC workloads using the Structure Simulation Toolkit Macroscale Element Library simulator and demonstrate considerable benefit to using the LPS construction as the basis of the SpectralFly topology. BibTeX: @article{Aksoy2021, author = {Sinan Aksoy and Stephen Young and Jesun Firoz and Roberto Gioiosa and Mark Raugas and Juan Escobedo}, title = {SpectralFly: Ramanujan Graphs as Flexible and Efficient Interconnection Networks}, year = {2021} }  Alanne K and Sierla S (2021), "An overview of machine learning applications for smart buildings", October, 2021. , pp. 103445. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The efficiency, flexibility, and resilience of building-integrated energy systems are challenged by unpredicted changes in operational environments due to climate change and its consequences. On the other hand, the rapid evolution of artificial intelligence (AI) and machine learning (ML) has equipped buildings with an ability to learn. A lot of research has been dedicated to specific machine learning applications for specific phases of a building's life-cycle. The reviews commonly take a specific, technological perspective without a vision for the integration of smart technologies at the level of the whole system. Especially, there is a lack of discussion on the roles of autonomous AI agents and training environments for boosting the learning process in complex and abruptly changing operational environments. This review article discusses the learning ability of buildings with a system-level perspective and presents an overview of autonomous machine learning applications that make independent decisions for building energy management. We conclude that the buildings’ adaptability to unpredicted changes can be enhanced at the system level through AI-initiated learning processes and by using digital twins as training environments. The greatest potential for energy efficiency improvement is achieved by integrating adaptability solutions at the timescales of HVAC control and electricity market participation. BibTeX: @article{Alanne2021, author = {Kari Alanne and Seppo Sierla}, title = {An overview of machine learning applications for smart buildings}, publisher = {Elsevier BV}, year = {2021}, pages = {103445}, doi = {10.1016/j.scs.2021.103445} }  Aldbaissy R, Hecht F, Mansour G, Sayah T and Tournier PH (2021), "Scalable domain decomposition preconditioner for Navier–Stokes equations coupled with the heat equation", International Journal of Computer Mathematics., May, 2021. , pp. 1-16. Informa UK Limited. [Abstract] [BibTeX] [DOI] Abstract: In this article, we study the thermal instability that appears from time to time while printing using a 3D printer. To solve the semi-discretized problem at each time-step, we use a scalable parallel algorithm based on a two-level Optimized Restricted Additive Schwarz (ORAS) domain decomposition preconditioner for GMRES. Parallel scalability tests are conducted with comparison against the parallel direct solver MUMPS and the one-level Schwarz method, which show lack of robustness for larger number of processors. 2D numerical tests illustrate that the number of iterations to reach GMRES convergence depends on the state of the physical simulation during time, and that the second level of preconditioning is needed to achieve robustness. BibTeX: @article{Aldbaissy2021, author = {Rim Aldbaissy and Frédéric Hecht and Gihane Mansour and Toni Sayah and Pierre Henri Tournier}, title = {Scalable domain decomposition preconditioner for Navier–Stokes equations coupled with the heat equation}, journal = {International Journal of Computer Mathematics}, publisher = {Informa UK Limited}, year = {2021}, pages = {1--16}, doi = {10.1080/00207160.2021.1925888} }  Aldinucci M, Agosta G, Andreini A, Ardagna CA, Bartolini A, Cilardo A, Cosenza B, Danelutto M, Esposito R, Fornaciari W, Giorgi R, Lengani D, Montella R, Olivieri M and Saponara S (2021), "The Italian research on HPC key technologies across EuroHPC", In Proceedings of the 2021 ACM International Conference on Computing Frontier. , pp. 1-7. [Abstract] [BibTeX] [DOI] Abstract: High-Performance Computing (HPC) is one of the strategic priorities for research and innovation worldwide due to its relevance for industrial and scientific applications. We envision HPC as composed of three pillars: infrastructures, applications, and key technologies and tools. While infrastructures are by construction centralized in large-scale HPC centers, and applications are generally within the purview of domain-specific organizations, key technologies fall in an intermediate case where coordination is needed, but design and development are often decentralized. A large group of Italian researchers has started a dedicated laboratory within the National Interuniversity Consortium for Informatics (CINI) to address this challenge. The laboratory, albeit young, has managed to succeed in its first attempts to propose a coordinated approach to HPC research within the EuroHPC Joint Undertaking, participating in the calls 2019-20 to five successful proposals for an aggregate total cost of 95M Euro. In this paper, we outline the working group's scope and goals and provide an overview of the five funded projects, which become fully operational in March 2021, and cover a selection of key technologies provided by the working group partners, highlighting their usage development within the projects. BibTeX: @inproceedings{Aldinucci2021, author = {Aldinucci, Marco and Agosta, Giovanni and Andreini, Antonio and Ardagna, Claudio A. and Bartolini, Andrea and Cilardo, Alessandro and Cosenza, Biagio and Danelutto, Marco and Esposito, Roberto and Fornaciari, William and Giorgi, Roberto and Lengani, Davide and Montella, Raffaele and Olivieri, Mauro and Saponara, Sergio}, title = {The Italian research on HPC key technologies across EuroHPC}, booktitle = {Proceedings of the 2021 ACM International Conference on Computing Frontier}, year = {2021}, pages = {1--7}, doi = {10.1145/3457388.3458508} }  Alhaddad S, Förstner J, Groth S, Grünewald D, Grynko Y, Hannig F, Kenter T, Pfreundt F-J, Plessl C, Schotte M, Steinke T, Teich J, Weiser M and Wende F (2021), "HighPerMeshes – A Domain-Specific Language for Numerical Algorithms on Unstructured Grids", In Euro-Par 2020: Parallel Processing Workshops. , pp. 185-196. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: Solving partial differential equations on unstructured grids is a cornerstone of engineering and scientific computing. Nowadays, heterogeneous parallel platforms with CPUs, GPUs, and FPGAs enable energy-efficient and computationally demanding simulations. We developed the HighPerMeshes C++-embedded Domain-Specific Language (DSL) for bridging the abstraction gap between the mathematical and algorithmic formulation of mesh-based algorithms for PDE problems on the one hand and an increasing number of heterogeneous platforms with their different parallel programming and runtime models on the other hand. Thus, the HighPerMeshes DSL aims at higher productivity in the code development process for multiple target platforms. We introduce the concepts as well as the basic structure of the HighPerMeshes DSL, and demonstrate its usage with three examples, a Poisson and monodomain problem, respectively, solved by the continuous finite element method, and the discontinuous Galerkin method for Maxwell’s equation. The mapping of the abstract algorithmic description onto parallel hardware, including distributed memory compute clusters, is presented. Finally, the achievable performance and scalability are demonstrated for a typical example problem on a multi-core CPU cluster. BibTeX: @incollection{Alhaddad2021, author = {Samer Alhaddad and Jens Förstner and Stefan Groth and Daniel Grünewald and Yevgen Grynko and Frank Hannig and Tobias Kenter and Franz-Josef Pfreundt and Christian Plessl and Merlind Schotte and Thomas Steinke and Jürgen Teich and Martin Weiser and Florian Wende}, title = {HighPerMeshes – A Domain-Specific Language for Numerical Algorithms on Unstructured Grids}, booktitle = {Euro-Par 2020: Parallel Processing Workshops}, publisher = {Springer International Publishing}, year = {2021}, pages = {185--196}, doi = {10.1007/978-3-030-71593-9_15} }  Aliaga JI, Anzt H, Quintana-Ortí ES, Tomás AE and Tsai YM (2021), "Balanced and Compressed Coordinate Layout for the Sparse Matrix-Vector Product on GPUs", In Euro-Par 2020: Parallel Processing Workshops. , pp. 83-95. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: We contribute to the optimization of the sparse matrix-vector product on graphics processing units by introducing a variant of the coordinate sparse matrix layout that compresses the integer representation of the matrix indices. In addition, we employ a look-ahead table to avoid the storage of repeated numerical values in the sparse matrix, yielding a more compact data representation that is easier to maintain in the cache. Our evaluation on the two most recent generations of NVIDIA GPUs, the V100 and the A100 architectures, shows considerable performance improvements over the kernels for the sparse matrix-vector product in cuSPARSE (CUDA 11.0.167). BibTeX: @incollection{Aliaga2021, author = {José Ignacio Aliaga and Hartwig Anzt and Enrique S. Quintana-Ortí and Andrés E. Tomás and Yuhsiang M. Tsai}, title = {Balanced and Compressed Coordinate Layout for the Sparse Matrix-Vector Product on GPUs}, booktitle = {Euro-Par 2020: Parallel Processing Workshops}, publisher = {Springer International Publishing}, year = {2021}, pages = {83--95}, doi = {10.1007/978-3-030-71593-9_7} }  Aliaga JI, Anzt H, Grützmacher T, Quintana-Ortí ES and Tomás AE (2021), "Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units", Concurrency and Computation: Practice and Experience., July, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: We contribute to the optimization of the sparse matrix-vector product by introducing a variant of the coordinate sparse matrix format that balances the workload distribution and compresses both the indexing arrays and the numerical information. Our approach is multi-platform, in the sense that the realizations for (general-purpose) multicore processors as well as graphics accelerators (GPUs) are built upon common principles, but differ in the implementation details, which are adapted to avoid thread divergence in the GPU case or maximize compression element-wise (i.e., for each matrix entry) for multicore architectures. Our evaluation on the two last generations of NVIDIA GPUs as well as Intel and AMD processors demonstrate the benefits of the new kernels when compared with the optimized implementations of the sparse matrix-vector product in NVIDIA's cuSPARSE and Intel's MKL, respectively. BibTeX: @article{Aliaga2021a, author = {José I. Aliaga and Hartwig Anzt and Thomas Grützmacher and Enrique S. Quintana-Ortí and Andrés E. Tomás}, title = {Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units}, journal = {Concurrency and Computation: Practice and Experience}, publisher = {Wiley}, year = {2021}, doi = {10.1002/cpe.6515} }  Aljundi AA, Akyıldız TA and Kaya K (2021), "Boosting Graph Embedding on a Single GPU", October, 2021. [Abstract] [BibTeX] Abstract: Graphs are ubiquitous, and they can model unique characteristics and complex relations of real-life systems. Although using machine learning (ML) on graphs is promising, their raw representation is not suitable for ML algorithms. Graph embedding represents each node of a graph as a d-dimensional vector which is more suitable for ML tasks. However, the embedding process is expensive, and CPU-based tools do not scale to real-world graphs. In this work, we present GOSH, a GPU-based tool for embedding large-scale graphs with minimum hardware constraints. GOSH employs a novel graph coarsening algorithm to enhance the impact of updates and minimize the work for embedding. It also incorporates a decomposition schema that enables any arbitrarily large graph to be embedded with a single GPU. As a result, GOSH sets a new state-of-the-art in link prediction both in accuracy and speed, and delivers high-quality embeddings for node classification at a fraction of the time compared to the state-of-the-art. For instance, it can embed a graph with over 65 million vertices and 1.8 billion edges in less than 30 minutes on a single GPU. BibTeX: @article{Aljundi2021, author = {Amro Alabsi Aljundi and Taha Atahan Akyıldız and Kamer Kaya}, title = {Boosting Graph Embedding on a Single GPU}, year = {2021} }  Al-Mohy A, Higham NJ and Liu X (2021), "Arbitrary Precision Algorithms for Computing the Matrix Cosine and its Fréchet Derivative" [Abstract] [BibTeX] Abstract: Existing algorithms for computing the matrix cosine are tightly coupled to a specific precision of floating-point arithmetic for optimal efficiency so they do not conveniently extend to an arbitrary precision environment. We develop an algorithm for computing the matrix cosine that takes the unit roundoff of the working precision as input, and so works in an arbitrary precision. The algorithm employs a Taylor approximation with scaling and recovering and it can be used with a Schur decomposition or in a decomposition-free manner. We also derive a framework for computing the Fréchet derivative, construct an efficient evaluation scheme for computing the cosine and its Fréchet derivative simultaneously in arbitrary precision, and show how this scheme can be extended to compute the matrix sine, cosine, and their Fréchet derivatives all together. Numerical experiments show that the new algorithms behave in a forward stable way over a wide range of precisions. The transformation-free version of the algorithm for computing the cosine is competitive in accuracy with the state-of-the-art algorithms in double precision and surpasses existing alternatives in both speed and accuracy in working precisions higher than double. BibTeX: @article{AlMohy2021, author = {Al-Mohy, Awad and Higham, Nicholas J. and Liu,Xiaobo}, title = {Arbitrary Precision Algorithms for Computing the Matrix Cosine and its Fréchet Derivative}, year = {2021} }  Alperen A, Afibuzzaman M, Rabbi F, Ozkaya MY, Catalyurek U and Aktulga HM (2021), "An Evaluation of Task-Parallel Frameworks for Sparse Solvers on Multicore and Manycore CPU Architectures", In 50th International Conference on Parallel Processing., August, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: Recently, several task-parallel programming models have emerged to address the high synchronization and load imbalance issues as well as data movement overheads in modern shared memory architectures. OpenMP, the most commonly used shared memory parallel programming model, has added task execution support with dataflow dependencies. HPX and Regent are two more recent runtime systems that also support the dataflow execution model and extend it to distributed memory environments. We focus on parallelization of sparse matrix computations on shared memory architectures. We evaluate the OpenMP, HPX and Regent runtime systems in terms of performance and ease of implementation, and compare them against the traditional BSP model for two popular eigensolvers, Lanczos and LOBPCG. We give a general outline in regards to achieving parallelism using these runtime systems, and present a heuristic for tuning their performance to balance tasking overheads with the degree of parallelism that can be exposed. We then demonstrate their merits on two architectures, Intel Broadwell (a multicore processor) and AMD EPYC (a modern manycore processor). We observe that these frameworks achieve up to 13.7 × fewer cache misses over an efficient BSP implementation across L1, L2 and L3 cache layers. They also obtain up to 9.9 × improvement in execution time over the same BSP implementation. BibTeX: @inproceedings{Alperen2021, author = {Abdullah Alperen and Md Afibuzzaman and Fazlay Rabbi and M. Yusuf Ozkaya and Umit Catalyurek and Hasan Metin Aktulga}, title = {An Evaluation of Task-Parallel Frameworks for Sparse Solvers on Multicore and Manycore CPU Architectures}, booktitle = {50th International Conference on Parallel Processing}, publisher = {ACM}, year = {2021}, doi = {10.1145/3472456.3472476} }  Amestoy P, Boiteau O, Buttari A, Gerest M, FabienneJézéquel, l’Excellent J-Y and Mary T (2021), "Mixed Precision Low Rank Approximations and theirApplication to Block Low Rank LU Factorization" [Abstract] [BibTeX] [URL] Abstract: We introduce a novel approach to exploit mixed precision arithmetic for low-rank approximations. Our approach is based on the observation that singular vectors associated with small singular values can be stored in lower precisions while preserving high accuracy overall. We provide an explicit criterion to determine which level of precision is needed for each singular vector. We apply this approach to block low-rank (BLR) matrices, most of whose off-diagonal blocks have low rank. We propose a new BLR LU factorization algorithm that exploits the mixed precision representation of the blocks. We carry out the rounding error analysis of this algorithm and prove that the use of mixed precision arithmetic does not compromise the numerical stability of BLR LU factorization. Moreover our analysis determines which level of precision is needed for each floating-point operation (flop), and therefore guides us towards an implementation that is both robust and efficient. We evaluate the potential of this new algorithm on a range of matrices coming from real-life problems in industrial and academic applications. We show that a large fraction of the entries in the LU factors and flops to perform the BLR LU factorization can be safely switched to lower precisions, leading to significant reductions of the storage and flop costs, of up to a factor three using fp64, fp32, and bfloat16 arithmetics. BibTeX: @techreport{Amestoy2021, author = {Patrick Amestoy and Olivier Boiteau and Alfredo Buttari and Matthieu Gerest and FabienneJézéquel and Jean-Yves l’Excellent and Théo Mary}, title = {Mixed Precision Low Rank Approximations and theirApplication to Block Low Rank LU Factorization}, year = {2021}, url = {https://hal.archives-ouvertes.fr/hal-03251738/document} }  Aminifard Z, Hosseini A and Babaie-Kafaki S (2022), "Modified conjugate gradient method for solving sparse recovery problem with nonconvex penalty", Signal Processing., April, 2022. Vol. 193, pp. 108424. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Sparse recovery is a strategy for effectively reconstructing a signal by obtaining sparse solutions of underdetermined linear systems. As an important feature of a signal, sparsity is often measured by the l-norm. However, the approximate solutions obtained via -norm regularization mostly underestimate the original signal. To overcome this defect, here we employ a class of nonconvex penalty functions proposed by Selesnick and Farshchian which preserve convexity of the cost function under certain conditions. To solve the problem, we suggest a nonmonotone modification of the generalized shrinkage conjugate gradient method proposed by Esmaeili et al., based on a modified secant equation. We establish global convergence of the method with standard suppositions. Numerical tests are made to shed light on performance of the proposed method. BibTeX: @article{Aminifard2022, author = {Zohre Aminifard and Alireza Hosseini and Saman Babaie-Kafaki}, title = {Modified conjugate gradient method for solving sparse recovery problem with nonconvex penalty}, journal = {Signal Processing}, publisher = {Elsevier BV}, year = {2022}, volume = {193}, pages = {108424}, doi = {10.1016/j.sigpro.2021.108424} }  Anagnostidis S, Lucchi A and Diouane Y (2021), "Direct-Search for a Class of Stochastic Min-Max Problems", February, 2021. [Abstract] [BibTeX] Abstract: Recent applications in machine learning have renewed the interest of the community in min-max optimization problems. While gradient-based optimization methods are widely used to solve such problems, there are however many scenarios where these techniques are not well-suited, or even not applicable when the gradient is not accessible. We investigate the use of direct-search methods that belong to a class of derivative-free techniques that only access the objective function through an oracle. In this work, we design a novel algorithm in the context of min-max saddle point games where one sequentially updates the min and the max player. We prove convergence of this algorithm under mild assumptions, where the objective of the max-player satisfies the Polyak-Łojasiewicz (PL) condition, while the min-player is characterized by a nonconvex objective. Our method only assumes dynamically adjusted accurate estimates of the oracle with a fixed probability. To the best of our knowledge, our analysis is the first one to address the convergence of a direct-search method for min-max objectives in a stochastic setting. BibTeX: @article{Anagnostidis2021, author = {Sotiris Anagnostidis and Aurelien Lucchi and Youssef Diouane}, title = {Direct-Search for a Class of Stochastic Min-Max Problems}, year = {2021} }  Anagreh M, Vainikko E and Laud P (2021), "Parallel Privacy-Preserving Shortest Paths by Radius-Stepping", In Proceedings of the 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing., 3, 2021. , pp. 276-280. [Abstract] [BibTeX] [DOI] Abstract: The radius-stepping algorithm is an efficient, parallelizable algorithm for finding the shortest paths in graphs. It solved the problem in ▵-Stepping algorithm, which has no known theoretical bounds for general graphs. In this paper, we describe a parallel privacy-preserving method for finding Single-Source Shortest Paths (SSSP). Our optimized method is based on the Radius-Stepping algorithm. The method is implemented on top of the Secure Multiparty Computation (SMC) Sharemind platform. We have re-shaped the radius-stepping algorithm to work on vectors representing the graph in a SIMD manner, in order to enable a fast execution using the secret-sharing based SMC protocol set of Sharemind. The results of the real implementation show an efficient method that reduced the execution time hundreds of times in comparison with a standard case of the privacy-preserving radius-stepping and ▵-Stepping algorithms. BibTeX: @inproceedings{Anagreh2021, author = {Anagreh, Mohammad and Vainikko, Eero and Laud, Peeter}, title = {Parallel Privacy-Preserving Shortest Paths by Radius-Stepping}, booktitle = {Proceedings of the 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing}, year = {2021}, pages = {276--280}, doi = {10.1109/PDP52278.2021.00051} }  Aravkin AY, Baraldi R and Orban D (2021), "A Proximal Quasi-Newton Trust-Region Method for Nonsmooth Regularized Optimization", March, 2021. [Abstract] [BibTeX] [DOI] Abstract: We develop a trust-region method for minimizing the sum of a smooth term f and a nonsmooth term h, both of which can be nonconvex. Each iteration of our method minimizes a possibly nonconvex model of f + h in a trust region. The model coincides with f + h in value and subdifferential at the center. We establish global convergence to a first-order stationary point when f satisfies a smoothness condition that holds, in particular, when it has Lipschitz-continuous gradient, and h is proper and lower semi-continuous. The model of h is required to be proper, lower-semi-continuous and prox-bounded. Under these weak assumptions, we establish a worst-case O(1/2) iteration complexity bound that matches the best known complexity bound of standard trust-region methods for smooth optimization. We detail a special instance in which we use a limited-memory quasi-Newton model of f and compute a step with the proximal gradient method, resulting in a practical proximal quasi-Newton method. We establish similar convergence properties and complexity bound for a quadratic regularization variant, and provide an interpretation as a proximal gradient method with adaptive step size for nonconvex problems. We describe our Julia implementations and report numerical results on inverse problems from sparse optimization and signal processing. Our trust-region algorithm exhibits promising performance and compares favorably with linesearch proximal quasi-Newton methods based on convex models. BibTeX: @article{Aravkin2021, author = {Aleksandr Y. Aravkin and Robert Baraldi and Dominique Orban}, title = {A Proximal Quasi-Newton Trust-Region Method for Nonsmooth Regularized Optimization}, year = {2021}, doi = {10.13140/RG.2.2.18509.15845} }  Arnstrom D, Bemporad A and Axehill D (2021), "A Linear Programming Method Based on Proximal-Point Iterations with Applications to Multi-Parametric Programming", IEEE Control Systems Letters. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: We propose a linear programming method that is based on active-set changes and proximal-point iterations. The method solves a sequence of least-distance problems using a warm-started quadratic programming solver that can reuse internal matrix factorizations from the previously solved least-distance problem. We show that the proposed method terminates in a finite number of iterations and that it outperforms state-of-the-art LP solvers in scenarios where an extensive number of small/medium scale LPs need to be solved rapidly, occurring in, for example, multi-parametric programming algorithms. In particular, we show how the proposed method can accelerate operations such as redundancy removal, computation of Chebyshev centers and solving linear feasibility problems. BibTeX: @article{Arnstrom2021, author = {Daniel Arnstrom and Alberto Bemporad and Daniel Axehill}, title = {A Linear Programming Method Based on Proximal-Point Iterations with Applications to Multi-Parametric Programming}, journal = {IEEE Control Systems Letters}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/lcsys.2021.3138218} }  Augustino B, Nannicini G, Terlaky T and Zuluaga LF (2021), "Quantum Interior Point Methods for Semidefinite Optimization", December, 2021. [Abstract] [BibTeX] Abstract: We present two quantum interior point methods for semidefinite optimization problems, building on recent advances in quantum linear system algorithms. The first scheme, more similar to a classical solution algorithm, computes an inexact search direction and is not guaranteed to explore only feasible points; the second scheme uses a nullspace representation of the Newton linear system to ensure feasibility even with inexact search directions. The second is a novel scheme that might seem impractical in the classical world, but it is well-suited for a hybrid quantum-classical setting. We show that both schemes converge to an optimal solution of the semidefinite optimization problem under standard assumptions. By comparing the theoretical performance of classical and quantum interior point methods with respect to various input parameters, we show that our second scheme obtains a speedup over classical algorithms in terms of the dimension of the problem n, but has worse dependence on other numerical parameters. BibTeX: @article{Augustino2021, author = {Brandon Augustino and Giacomo Nannicini and Tamás Terlaky and Luis F. Zuluaga}, title = {Quantum Interior Point Methods for Semidefinite Optimization}, year = {2021} }  Awan MG, Hofmeyr S, Egan R, Ding N, Buluc A, Deslippe J, Oliker L and Yelick K (2021), "Accelerating large scale de novo metagenome assembly using GPUs", November, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: Metagenomic workflows involve studying uncultured microorganisms directly from the environment. These environmental samples when processed by modern sequencing machines yield large and complex datasets that exceed the capabilities of metagenomic software. The increasing sizes and complexities of datasets make a strong case for exascale-capable metagenome assemblers. However, the underlying algorithmic motifs are not well suited for GPUs. This poses a challenge since the majority of next-generation supercomputers will rely primarily on GPUs for computation. In this paper we present the first of its kind GPU-accelerated implementation of the local assembly approach that is an integral part of a widely used large-scale metagenome assembler, MetaHipMer. Local assembly uses algorithms that induce random memory accesses and non-deterministic workloads, which make GPU offloading a challenging task. Our GPU implementation outperforms the CPU version by about 7x and boosts the performance of MetaHipMer by 42% when running on 64 Summit nodes. BibTeX: @inproceedings{Awan2021, author = {Muaaz Gul Awan and Steven Hofmeyr and Rob Egan and Nan Ding and Aydin Buluc and Jack Deslippe and Leonid Oliker and Katherine Yelick}, title = {Accelerating large scale de novo metagenome assembly using GPUs}, publisher = {ACM}, year = {2021}, doi = {10.1145/3458817.3476212} }  Ayala A, Tomov S, Luszczek P, Cayrols S, Ragghianti G and Dongarra J (2021), "Interim Report on Benchmarking FFT Libraries on High Performance Systems". Thesis at: ICL Innovative Computing Laboratory. [Abstract] [BibTeX] [URL] Abstract: The Fast Fourier Transform (FFT) is used in many applications such as molecular dynamics, spectrum estimation, fast convolution and correlation, signal modulation, and many wireless multimedia applications. FFTs are also heavily used in ECP applications, such as EXAALT, Copa, ExaSky-HACC, ExaWind, WarpX, and many others. As these applications’ accuracy and speed depend on the performance of the FFTs, we designed an FFT benchmark to measure performance and scalability of currently available FFT packages and present the results from a pre-Exascale platform. Our benchmarking also stresses the overall capacity of system interconnect; thus, it may be considered as an indicator of the bisection bandwidth, communication contention noise, and the software overheads in MPI collectives that are of interest to many other ECP applications and libraries.\ This FFT benchmarking project aims to show the strengths and weaknesses of multiple FFT libraries and to indicate what can be done to improve their performance. In particular, we believe that the benchmarking results could help design and implement a fast and robust FFT library for 2D and 3D inputs, while targeting large-scale heterogeneous systems with multicore processors and hardware accelerators that are a co-designed in tandem with ECP applications. Our work involves studying and analyzing state-of-the-art FFT software both from vendors and available as open-source codes to better understand their performance. BibTeX: @techreport{Ayala2021, author = {Alan Ayala and Stanimire Tomov and Piotr Luszczek and Sebastien Cayrols and Gerald Ragghianti and Jack Dongarra}, title = {Interim Report on Benchmarking FFT Libraries on High Performance Systems}, school = {ICL Innovative Computing Laboratory}, year = {2021}, url = {https://www.icl.utk.edu/files/publications/2021/icl-utk-1492-2021.pdf} }  Ayala A, Tomov S, Stoyanov M and Dongarra J (2021), "Scalability Issues in FFT Computation", In Lecture Notes in Computer Science. , pp. 279-287. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: The fast Fourier transform (FFT), is one the most important tools in mathematics, and it is widely required by several applications of science and engineering. State-of-the-art parallel implementations of the FFT algorithm, based on Cooley-Tukey developments, are known to be communication-bound, which causes critical issues when scaling the computational and architectural capabilities. In this paper, we study the main performance bottleneck of FFT computations on hybrid CPU and GPU systems at large-scale. We provide numerical simulations and potential acceleration techniques that can be easily integrated into FFT distributed libraries. We present different experiments on performance scalability and runtime analysis on the world’s most powerful supercomputers today: Summit, using up to 6,144 NVIDIA V100 GPUs, and Fugaku, using more than one million Fujitsu A64FX cores. BibTeX: @incollection{Ayala2021a, author = {Alan Ayala and Stanimire Tomov and Miroslav Stoyanov and Jack Dongarra}, title = {Scalability Issues in FFT Computation}, booktitle = {Lecture Notes in Computer Science}, publisher = {Springer International Publishing}, year = {2021}, pages = {279--287}, doi = {10.1007/978-3-030-86359-3_21} }  Ayala A, Tomov S, Stoyanov M, Haidar A and Dongarra J (2021), "Accelerating Multi - Process Communication for Parallel 3-D FFT", In Proceedings of the 2021 Workshop on Exascale MPI., November, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Today largest and most powerful supercomputers in the world are built on heterogeneous platforms; and using the combined power of multi-core CPUs and GPUs, has had a great impact accelerating large-scale applications. However, on these architectures, parallel algorithms, such as the Fast Fourier Transform (FFT), encounter that inter-processor communication become a bottleneck and limits their scalability. In this paper, we present techniques for speeding up multi-process communication cost during the computation of FFTs, considering hybrid network connections as those expected on upcoming exascale machines. Among our techniques, we present algorithmic tuning, making use of phase diagrams; parametric tuning, using different FFT settings; and MPI distribution tuning based on FFT size and computational resources available. We present several experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 40,960 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs. BibTeX: @inproceedings{Ayala2021b, author = {Alan Ayala and Stan Tomov and Miroslav Stoyanov and Azzam Haidar and Jack Dongarra}, title = {Accelerating Multi - Process Communication for Parallel 3-D FFT}, booktitle = {Proceedings of the 2021 Workshop on Exascale MPI}, publisher = {IEEE}, year = {2021}, doi = {10.1109/exampi54564.2021.00011} }  Azad A, Selvitopi O, Hussain MT, Gilbert JR and Buluc A (2021), "Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems", June, 2021. [Abstract] [BibTeX] Abstract: Combinatorial algorithms such as those that arise in graph analysis, modeling of discrete systems, bioinformatics, and chemistry, are often hard to parallelize. The Combinatorial BLAS library implements key computational primitives for rapid development of combinatorial algorithms in distributed-memory systems. During the decade since its first introduction, the Combinatorial BLAS library has evolved and expanded significantly. This paper details many of the key technical features of Combinatorial BLAS version 2.0, such as communication avoidance, hierarchical parallelism via in-node multithreading, accelerator support via GPU kernels, generalized semiring support, implementations of key data structures and functions, and scalable distributed I/O operations for human-readable files. Our paper also presents several rules of thumb for choosing the right data structures and functions in Combinatorial BLAS 2.0, under various common application scenarios. BibTeX: @article{Azad2021, author = {Ariful Azad and Oguz Selvitopi and Md Taufique Hussain and John R. Gilbert and Aydin Buluc}, title = {Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems}, year = {2021} }  Bäckström K, Walulya I, Papatriantafilou M and Tsigas P (2021), "Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence", February, 2021. [Abstract] [BibTeX] Abstract: Stochastic gradient descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous parallel shared-memory SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have received interest in certain contexts, due to reduced overhead compared to synchronous parallelization. Despite that they induce staleness and inconsistency, they have shown speedup for problems satisfying smooth, strongly convex targets, and gradient sparsity. Recent works take important steps towards understanding the potential of parallel SGD for problems not conforming to these strong assumptions, in particular for deep learning (DL). There is however a gap in current literature in understanding when AsyncSGD algorithms are useful in practice, and in particular how mechanisms for synchronization and consistency play a role. We focus on the impact of consistency-preserving non-blocking synchronization in SGD convergence, and in sensitivity to hyper-parameter tuning. We propose Leashed-SGD, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency. We argue analytically about the dynamics of the algorithms, memory consumption, the threads' progress over time, and the expected contention. We provide a comprehensive empirical evaluation, validating the analytical claims, benchmarking the proposed Leashed-SGD framework, and comparing to baselines for training multilayer perceptrons (MLP) and convolutional neural networks (CNN). We observe the crucial impact of contention, staleness and consistency and show how Leashed-SGD provides significant improvements in stability as well as wall-clock time to convergence (from 20-80% up to 4x improvements) compared to the standard lock-based AsyncSGD algorithm and HOGWILD!, while reducing the overall memory footprint. BibTeX: @article{Baeckstroem2021, author = {Karl Bäckström and Ivan Walulya and Marina Papatriantafilou and Philippas Tsigas}, title = {Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence}, year = {2021} }  Baek D, Hwang S, Heo T, Kim D and Huh J (2021), "InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator withLocality-aware Inner Product Processing", In Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques. [Abstract] [BibTeX] Abstract: Sparse matrix multiplication is one of the key computational kernels in large-scale data analytics. However, a naive implementation suffers from the overheads of irregular memory accesses due to the representation of sparsity. To mitigate the memory access overheads, recent accelerator designs advocated the outer product processing which minimizes input accesses but generates intermediate products to be merged to the final output matrix. Using real-world sparse matrices, this study first identifies the memory bloating problem of the outer product designs due to the unpredictable intermediate products. Such an unpredictable increase in memory requirement during computation can limit the applicability of accelerators. To address the memory bloating problem, this study revisits an alternative inner product approach, and proposes a new accelerator design called InnerSP. This study shows that nonzero element distributions in real-world sparse matrices have a certain level of locality. Using a smart caching scheme designed for inner product, the locality is effectively exploited with a modest on-chip cache. However, the row-wise inner product relies on on-chip aggregation of intermediate products. Due to uneven sparsity per row, overflows or underflows of the on-chip storage for aggregation can occur. To maximize the parallelism while avoiding costly overflows, the proposed accelerator uses pre-scanning for row splitting and merging. The simulation results show that the performance of InnerSP can exceed or be similar to those of the prior outer product approaches without any memory bloating problem. BibTeX: @inproceedings{Baek2021, author = {Daehyeon Baek and Soojin Hwang and Taekyung Heo and Daehoon Kim and Jaehyuk Huh}, title = {InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator withLocality-aware Inner Product Processing}, booktitle = {Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques}, year = {2021} }  Bai H, Gan X, Xu T, Jia M, Tan W, Chen J and Zhang Y (2021), "VPC: Pruning connected components using vector-based path compression for Graph500", CCF Transactions on High Performance Computing., July, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: Graphs are an effective approach for data representation and organization, and graph analysis is a promising killer application for AI systems. However, recently emerging extremely large graphs (consisting of trillions of vertices and edges) exceed the capacity of any small-/medium-scale clusters and thus necessitate the adoption of supercomputers for efficient graph processing. Graph500 is the de facto standard for benchmarking supercomputers’ graph processing performance, and connected component (CC) is an important basic algorithm for Graph500’s BFS and SSSP tests. However, current CC algorithms are inefficient on supercomputers and fast CC is expensive and challenging. In this paper, we propose VPC, an efficient method that prunes connected components using vector-based path compression. It includes the following innovations: (i) The data structure of the traversal algorithm is customized with the two-dimensional adjacency vector. (ii) The vector-based path compression is proposed for the union-find algorithm. (iii) Parallel VPC is proposed customized with Tianhe. Experimental results validate that the two-dimensional adjacency vector has better performance than other data structures and the vector-based path compression is used in the realization of the union-find algorithm. When the scale is 26, the performance of our algorithm is 1.38×, 1.69× and 2.57× that of other algorithms. The union-find algorithm is used for connected components, and the performance of the algorithm is 5.14× and 5.01× that of BFS and DFS respectively. BibTeX: @article{Bai2021, author = {Hao Bai and Xinbiao Gan and Tianjing Xu and Menghan Jia and Wen Tan and Juan Chen and Yiming Zhang}, title = {VPC: Pruning connected components using vector-based path compression for Graph500}, journal = {CCF Transactions on High Performance Computing}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s42514-021-00070-z} }  Bai Y, Chen D and Gomes CP (2021), "CLR-DRNets: Curriculum Learning with Restarts to Solve Visual Combinatorial Games" Schloss Dagstuhl - Leibniz-Zentrum für Informatik. [Abstract] [BibTeX] [DOI] Abstract: We introduce a curriculum learning framework for challenging tasks that require a combination of pattern recognition and combinatorial reasoning, such as single-player visual combinatorial games. Our work harnesses Deep Reasoning Nets (DRNets) [4], a framework that combines deep learning with constraint reasoning for unsupervised pattern demixing. We propose CLR-DRNets (pronounced Clear-DRNets), a curriculum-learning-with-restarts framework to boost the performance of DRNets. CLR-DRNets incrementally increase the difficulty of the training instances and use restarts, a new model selection method that selects multiple models from the same training trajectory to learn a set of diverse heuristics and apply them at inference time. An enhanced reasoning module is also proposed for CLR-DRNets to improve the ability of reasoning and generalize to unseen instances. We consider Visual Sudoku, i.e., Sudoku with hand-written digits or letters, and Visual Mixed Sudoku, a substantially more challenging task that requires the demixing and completion of two overlapping Visual Sudokus. We propose an enhanced reasoning module for the DRNets framework for encoding these visual games We show how CLR-DRNets considerably outperform DRNets and other approaches on these visual combinatorial games. BibTeX: @inproceedings{Bai2021a, author = {Bai, Yiwei and Chen, Di and Gomes, Carla P.}, title = {CLR-DRNets: Curriculum Learning with Restarts to Solve Visual Combinatorial Games}, publisher = {Schloss Dagstuhl - Leibniz-Zentrum für Informatik}, year = {2021}, doi = {10.4230/LIPICS.CP.2021.17} }  Bai J, Kong S and Gomes CP (2021), "Gaussian Mixture Variational Autoencoder with Contrastive Learning for Multi-Label Classification", December, 2021. [Abstract] [BibTeX] Abstract: Multi-label classification (MLC) is a prediction task where each sample can have more than one label. We propose a novel contrastive learning boosted multi-label prediction model based on a Gaussian mixture variational autoencoder (C-GMVAE), which learns a multimodal prior space and employs a contrastive loss. Many existing methods introduce extra complex neural modules to capture the label correlations, in addition to the prediction modules. We found that by using contrastive learning in the supervised setting, we can exploit label information effectively, and learn meaningful feature and label embeddings capturing both the label correlations and predictive power, without extra neural modules. Our method also adopts the idea of learning and aligning latent spaces for both features and labels. C-GMVAE imposes a Gaussian mixture structure on the latent space, to alleviate posterior collapse and over-regularization issues, in contrast to previous works based on a unimodal prior. C-GMVAE outperforms existing methods on multiple public datasets and can often match other models' full performance with only 50% of the training data. Furthermore, we show that the learnt embeddings provide insights into the interpretation of label-label interactions. BibTeX: @article{Bai2021b, author = {Junwen Bai and Shufeng Kong and Carla P. Gomes}, title = {Gaussian Mixture Variational Autoencoder with Contrastive Learning for Multi-Label Classification}, year = {2021} }  Balın MF, Sancak K and Catalyurek Ü (2021), "MG-GCN: Scalable Multi-GPU GCN Training Framework", October, 2021. [Abstract] [BibTeX] Abstract: Full batch training of Graph Convolutional Network (GCN) models is not feasible on a single GPU for large graphs containing tens of millions of vertices or more. Recent work has shown that, for the graphs used in the machine learning community, communication becomes a bottleneck and scaling is blocked outside of the single machine regime. Thus, we propose MG-GCN, a multi-GPU GCN training framework taking advantage of the high-speed communication links between the GPUs present in multi-GPU systems. MG-GCN employs multiple High-Performance Computing optimizations, including efficient re-use of memory buffers to reduce the memory footprint of training GNN models, as well as communication and computation overlap. These optimizations enable execution on larger datasets, that generally do not fit into memory of a single GPU in state-of-the-art implementations. Furthermore, they contribute to achieve superior speedup compared to the state-of-the-art. For example, MG-GCN achieves super-linear speedup with respect to DGL, on the Reddit graph on both DGX-1 (V100) and DGX-A100. BibTeX: @article{Balin2021, author = {Muhammed Fatih Balın and Kaan Sancak and ÜmitV. Catalyurek}, title = {MG-GCN: Scalable Multi-GPU GCN Training Framework}, year = {2021} }  Banerjee A, Shah P, Nandani S, Tyagi S, Kumar S and Chaudhury B (2021), "An Empirical Investigation of OpenMP Based Implementation of Simplex Algorithm", In OpenMP: Enabling Massive Node-Level Parallelism. , pp. 96-110. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: This paper presents a shared-memory based parallel implementation of the standard simplex algorithm. The simplex algorithm is a popular technique for linear programming used to solve minimization and maximization problems that are subject to linear constraints. The simplex algorithm reduces the optimization problem to a series of iterative matrix operations. In this paper we perform an empirical analysis of our algorithm and also study the impact of the density of the underlying matrix on the overall performance. We observed a maximum speedup of 10.2 at 16 threads and also demonstrated that our proposed parallel algorithm scales well over a range of matrix densities. We also make an important observation that the effect of increasing the number of constraints is more significant than the effect of varying the number of variables. BibTeX: @incollection{Banerjee2021, author = {Arkaprabha Banerjee and Pratvi Shah and Shivani Nandani and Shantanu Tyagi and Sidharth Kumar and Bhaskar Chaudhury}, title = {An Empirical Investigation of OpenMP Based Implementation of Simplex Algorithm}, booktitle = {OpenMP: Enabling Massive Node-Level Parallelism}, publisher = {Springer International Publishing}, year = {2021}, pages = {96--110}, doi = {10.1007/978-3-030-85262-7_7} }  Bao W-B and Miao S-X (2021), "A splitting iterative method and preconditioner for complex symmetric linear system via real equivalent form", Advanced Studies: Euro-Tbilisi Mathematical Journal. [Abstract] [BibTeX] [DOI] [URL] Abstract: In this paper, a splitting iterative method and the corresponding preconditioner are studied for solving a class of complex symmetric linear systems via real equivalent forms. The unconditional convergence theory of the new iterative method is established, and the eigenvalue distribution of the corresponding preconditioned matrix is analyzed. Numerical experiments are given to verify our theoretical results and illustrate effectiveness of the proposed iterative method and the corresponding splitting preconditioner. BibTeX: @article{Bao2021, author = {Wen-Bin Bao and Shu-Xin Miao}, title = {A splitting iterative method and preconditioner for complex symmetric linear system via real equivalent form}, journal = {Advanced Studies: Euro-Tbilisi Mathematical Journal}, year = {2021}, url = {https://projecteuclid.org/journals/advanced-studies-euro-tbilisi-mathematical-journal/volume-14/issue-4/A-splitting-iterative-method-and-preconditioner-for-complex-symmetric-linear/10.3251/asetmj/1932200821.short}, doi = {10.3251/asetmj/1932200821} }  Barboni R, Peyré G and Vialard F-X (2021), "Global convergence of ResNets: From finite to infinite width using linear parameterization" [Abstract] [BibTeX] [URL] Abstract: Overparameterization is a key factor in the absence of convexity to explain global convergence of gradient descent (GD) for neural networks. Beside the well studied lazy regime, infinite width (mean field) analysis has been developed for shallow networks, using on convex optimization technics. To bridge the gap between the lazy and mean field regimes, we study Residual Networks (ResNets) in which the residual block has linear parameterization while still being nonlinear. Such ResNets admit both infinite depth and width limits, encoding residual blocks in a Reproducing Kernel Hilbert Space (RKHS). In this limit, we prove a local Polyak-Lojasiewicz inequality. Thus, every critical point is a global minimizer and a local convergence result of GD holds, retrieving the lazy regime. In contrast with other mean-field studies, it applies to both parametric and non-parametric cases under an expressivity condition on the residuals. Our analysis leads to a practical and quantified recipe: starting from a universal RKHS, Random Fourier Features are applied to obtain a finite dimensional parameterization satisfying with high-probability our expressivity condition. BibTeX: @article{Barboni2021, author = {Raphaël Barboni and Gabriel Peyré and François-Xavier Vialard}, title = {Global convergence of ResNets: From finite to infinite width using linear parameterization}, year = {2021}, url = {https://hal.archives-ouvertes.fr/hal-03473699/document} }  Bellavia S and Gurioli G (2021), "Stochastic analysis of an adaptive cubic regularization method under inexact gradient evaluations and dynamic Hessian accuracy", Optimization., February, 2021. , pp. 1-35. Informa UK Limited. [Abstract] [BibTeX] [DOI] Abstract: We here adapt an extended version of the adaptive cubic regularization method with dynamic inexact Hessian information for nonconvex optimization in Bellavia et al. [Adaptive cubic regularization methods with dynamic inexact hessian information and applications to finite-sum minimization. IMA Journal of Numerical Analysis. 2021;41(1):764–799] to the stochastic optimization setting. While exact function evaluations are still considered, this novel variant inherits the innovative use of adaptive accuracy requirements for Hessian approximations introduced in the just quoted paper and additionally employs inexact computations of the gradient. Without restrictions on the variance of the errors, we assume that these approximations are available within a sufficiently large, but fixed, probability and we extend, in the spirit of Cartis and Scheinberg [Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math Program Ser A. 2018;159(2):337–375], the deterministic analysis of the framework to its stochastic counterpart, showing that the expected number of iterations to reach a first-order stationary point matches the well-known worst-case optimal complexity. This is, in fact, still given by O(-3/2), with respect to the first-order 𝜖 tolerance. Finally, numerical tests on nonconvex finite-sum minimization confirm that using inexact first- and second-order derivatives can be beneficial in terms of the computational savings. BibTeX: @article{Bellavia2021, author = {Stefania Bellavia and Gianmarco Gurioli}, title = {Stochastic analysis of an adaptive cubic regularization method under inexact gradient evaluations and dynamic Hessian accuracy}, journal = {Optimization}, publisher = {Informa UK Limited}, year = {2021}, pages = {1--35}, doi = {10.1080/02331934.2021.1892104} }  Bellavia S, Gurioli G, Morini B and Toint PL (2021), "Quadratic and Cubic Regularisation Methods with Inexact function and Random Derivatives for Finite-Sum Minimisation", March, 2021. [Abstract] [BibTeX] Abstract: This paper focuses on regularisation methods using models up to the third order to search for up to second-order critical points of a finite-sum minimisation problem. The variant presented belongs to the framework of [3]: it employs random models with accuracy guaranteed with a sufficiently large prefixed probability and deterministic inexact function evaluations within a prescribed level of accuracy. Without assuming unbiased estimators, the expected number of iterations is 𝒪(_1^-2) or 𝒪(_1^-3/2) when searching for a first-order critical point using a second or third order model, respectively, and of 𝒪([_1^-3/2,_2^-3]) when seeking for second-order critical points with a third order model, in which _j, j∊1,2\, is the jth-order tolerance. These results match the worst-case optimal complexity for the deterministic counterpart of the method. Preliminary numerical tests for first-order optimality in the context of nonconvex binary classification in imaging, with and without Artifical Neural Networks (ANNs), are presented and discussed. BibTeX: @article{Bellavia2021a, author = {Stefania Bellavia and Gianmarco Gurioli and Benedetta Morini and Philippe L. Toint}, title = {Quadratic and Cubic Regularisation Methods with Inexact function and Random Derivatives for Finite-Sum Minimisation}, year = {2021} }  Bemporad A and Cimini G (2021), "Variable Elimination in Model Predictive Control Based on K-SVD and QR Factorization" [Abstract] [BibTeX] Abstract: For linearly constrained least-squares problems that depend on a vector of parameters, this paper proposes techniques for reducing the number of involved optimization variables. After first eliminating equality constraints in a numerically robust way by QR factorization, we propose a technique based on singular value decomposition (SVD) and unsupervised learning, that we call K-SVD, and neural classifiers to automatically partition the set of parameter vectors in K nonlinear regions in which the original problem is approximated by using a smaller set of variables. For the special case of parametric constrained least-squares problems that arise from model predictive control (MPC) formulations, we propose a novel and very efficient QR factorization method for eliminating equality constraints. Together with SVD or K-SVD, the method provides a numerically robust alternative to standard condensing and move blocking, and to other complexity reduction methods for MPC based on basis functions. We show the good performance of the proposed techniques in numerical tests and in a problem of linearized MPC of a nonlinear benchmark process. BibTeX: @article{Bemporad2021, author = {A. Bemporad and G. Cimini}, title = {Variable Elimination in Model Predictive Control Based on K-SVD and QR Factorization}, year = {2021} }  Berahmand K, Mohammadi M, Faroughi A and Mohammadiani RP (2021), "A novel method of spectral clustering in attributed networks by constructing parameter-free affinity matrix", Cluster Computing., November, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: The most basic and significant issue in complex network analysis is community detection, which is a branch of machine learning. Most current community detection approaches, only consider a network's topology structures, which lose the potential to use node attribute information. In attributed networks, both topological structure and node attributed are important features for community detection. In recent years, the spectral clustering algorithm has received much interest as one of the best performing algorithms in the subcategory of dimensionality reduction. This algorithm applies the eigenvalues of the affinity matrix to map data to low-dimensional space. In the present paper, a new version of the spectral cluster, named Attributed Spectral Clustering (ASC), is applied for attributed graphs that the identified communities have structural cohesiveness and attribute homogeneity. Since the performance of spectral clustering heavily depends on the goodness of the affinity matrix, the ASC algorithm will use the Topological and Attribute Random Walk Affinity Matrix (TARWAM) as a new affinity matrix to calculate the similarity between nodes. TARWAM utilizes the biased random walk to integrate network topology and attribute information. It can improve the similarity degree among the pairs of nodes in the same density region of the attributed network, without the need for parameter tuning. The proposed approach has been compared to other primary and new attributed graph clustering algorithms based on synthetic and real datasets. The experimental results show that the proposed approach is more effective and accurate compared to other state-of-the-art attributed graph clustering techniques. BibTeX: @article{Berahmand2021, author = {Kamal Berahmand and Mehrnoush Mohammadi and Azadeh Faroughi and Rojiar Pir Mohammadiani}, title = {A novel method of spectral clustering in attributed networks by constructing parameter-free affinity matrix}, journal = {Cluster Computing}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s10586-021-03430-0} }  Berger G, Freire M, Marini R, Dufrechou E and Ezzatti P (2021), "Unleashing the performance of bmSparse for the sparse matrix multiplication in GPUs", In Proceedings of the 2021 12th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems., November, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: The evolution of data science and machine learning has increased the applicability of the sparse matrix multiplication (SPGEMM) kernel. Unlike more well-known operations such as the SPMV, in the SPGEMM the nonzero pattern of the result is determined by the interaction between the nonzero patterns of the inputs, which impose serious challenges to the development of high-performance implementations for accelerators. Recent efforts in this subject aim to mitigate this irregularity through the use of block-based sparse storage formats, obtaining promissing results on accelerators such as GPUs. In this work we study the format bmSparse [1] and propose optimizations to attack the principal bottlenecks of the original SPGEMM implementation for Nvidia GPUs. We evaluate the proposal using nine sparse matrices of different sizes, showing remarkable speedups with respect to CUSPARSE's CSR variant. BibTeX: @inproceedings{Berger2021, author = {Gonzalo Berger and Manuel Freire and Renzo Marini and Ernesto Dufrechou and Pablo Ezzatti}, title = {Unleashing the performance of bmSparse for the sparse matrix multiplication in GPUs}, booktitle = {Proceedings of the 2021 12th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems}, publisher = {IEEE}, year = {2021}, doi = {10.1109/scala54577.2021.00008} }  Berger-Vergiat L, Kelley B, Rajamanickam S, Hu J, Swirydowicz K, Mullowney P, Thomas S and Yamazaki I (2021), "Two-Stage Gauss--Seidel Preconditioners and Smoothers for Krylov Solvers on a GPU cluster", April, 2021. [Abstract] [BibTeX] Abstract: Gauss-Seidel (GS) relaxation is often employed as a preconditioner for a Krylov solver or as a smoother for Algebraic Multigrid (AMG). However, the requisite sparse triangular solve is difficult to parallelize on many-core architectures such as graphics processing units (GPUs). In the present study, the performance of the traditional GS relaxation based on a triangular solve is compared with two-stage variants, replacing the direct triangular solve with a fixed number of inner Jacobi-Richardson (JR) iterations. When a small number of inner iterations is sufficient to maintain the Krylov convergence rate, the two-stage GS (GS2) often outperforms the traditional algorithm on many-core architectures. We also compare GS2 with JR. When they perform the same number of flops for SpMV (e.g. three JR sweeps compared to two GS sweeps with one inner JR sweep), the GS2 iterations, and the Krylov solver preconditioned with GS2, may converge faster than the JR iterations. Moreover, for some problems (e.g. elasticity), it was found that JR may diverge with a damping factor of one, whereas two-stage GS may improve the convergence with more inner iterations. Finally, to study the performance of the two-stage smoother and preconditioner for a practical problem, (e.g. using tuned damping factors), these were applied to incompressible fluid flow simulations on GPUs. BibTeX: @article{BergerVergiat2021, author = {Luc Berger-Vergiat and Brian Kelley and Sivasankaran Rajamanickam and Jonathan Hu and Katarzyna Swirydowicz and Paul Mullowney and Stephen Thomas and Ichitaro Yamazaki}, title = {Two-Stage Gauss--Seidel Preconditioners and Smoothers for Krylov Solvers on a GPU cluster}, year = {2021} }  Berner J, Grohs P, Kutyniok G and Petersen P (2021), "The Modern Mathematics of Deep Learning", May, 2021. [Abstract] [BibTeX] Abstract: We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, the surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail. BibTeX: @article{Berner2021, author = {Julius Berner and Philipp Grohs and Gitta Kutyniok and Philipp Petersen}, title = {The Modern Mathematics of Deep Learning}, year = {2021} }  Bertsimas D, Cory-Wright R and Johnson NAG (2021), "Sparse Plus Low Rank Matrix Decomposition: A Discrete Optimization Approach", September, 2021. [Abstract] [BibTeX] Abstract: We study the Sparse Plus Low Rank decomposition problem (SLR), which is the problem of decomposing a corrupted data matrix D into a sparse matrix Y containing the perturbations plus a low rank matrix X. SLR is a fundamental problem in Operations Research and Machine Learning arising in many applications such as data compression, latent semantic indexing, collaborative filtering and medical imaging. We introduce a novel formulation for SLR that directly models the underlying discreteness of the problem. For this formulation, we develop an alternating minimization heuristic to compute high quality solutions and a novel semidefinite relaxation that provides meaningful bounds for the solutions returned by our heuristic. We further develop a custom branch and bound routine that leverages our heuristic and convex relaxation that solves small instances of SLR to certifiable near-optimality. Our heuristic can scale to n=10000 in hours, our relaxation can scale to n=200 in hours, and our branch and bound algorithm can scale to n=25 in minutes. Our numerical results demonstrate that our approach outperforms existing state-of-the-art approaches in terms of the MSE of the low rank matrix and that of the sparse matrix. BibTeX: @article{Bertsimas2021, author = {Dimitris Bertsimas and Ryan Cory-Wright and Nicholas A. G. Johnson}, title = {Sparse Plus Low Rank Matrix Decomposition: A Discrete Optimization Approach}, year = {2021} }  Besse C, Duboscq R and Coz SL (2021), "Numerical Simulations on Nonlinear Quantum Graphs with the GraFiDi Library", March, 2021. [Abstract] [BibTeX] Abstract: Nonlinear quantum graphs are metric graphs equipped with a nonlinear Schrödinger equation. Whereas in the last ten years they have known considerable developments on the theoretical side, their study from the numerical point of view remains in its early stages. The goal of this paper is to present the Grafidi library, a Python library which has been developed with the numerical simulation of nonlinear Schrödinger equations on graphs in mind. We will show how, with the help of the Grafidi library, one can implement the popular normalized gradient flow and nonlinear conjugate gradient flow methods to compute ground states of a nonlinear quantum graph. We will also simulate the dynamics of the nonlinear Schrödinger equation with a Crank-Nicolson relaxation scheme and a Strang splitting scheme. Finally, in a series of numerical experiments on various types of graphs, we will compare the outcome of our numerical calculations for ground states with the existing theoretical results, thereby illustrating the versatility and efficiency of our implementations in the framework of the Grafidi library. BibTeX: @article{Besse2021, author = {Christophe Besse and Romain Duboscq and Stefan Le Coz}, title = {Numerical Simulations on Nonlinear Quantum Graphs with the GraFiDi Library}, year = {2021} }  Beuzeville T, Boudier P, Buttari A, Gratton S, Mary T and Pralet S (2021), "Adversarial attacks via backward error analysis" [Abstract] [BibTeX] [URL] Abstract: Backward error (BE) analysis was developed and popularized by James Wilkinson in the 1950s and 1960s, with origins in the works of Neumann and Goldstine (1947) and Turing (1948). It is a fundamental notion used in numerical linear algebra software, both as a theoretical and a practical tool for the rounding error analysis of numerical algorithms. Broadly speaking the backward error quantifies, in terms of perturbation of input data, by how much the output of an algorithm fails to be equal to an expected quantity. For a given computed solution y, this amounts to computing the norm of the smallest perturbation Δ x of the input data x such that y is an exact solution of a perturbed system: f (x + Δ x) = y. Up to now, BE analysis has been applied to numerous linear algebra problems, always with the objective of quantifying the robustness of algebraic processes with respect to rounding errors stemming from finite precision computations. While deep neural networks (DNN) have achieved an unprecedented success in numerous machine learning tasks in various domains, their robustness to adversarial attacks, rounding errors, or quantization processes has raised considerable concerns from the machine learning community. In this work, we generalize BE analysis to DNN. This enables us to obtain closed formulas and a numerical algorithm for computing adversarial attacks. By construction, these attacks are optimal, and thereby smaller, in norm, than perturbations obtained with existing gradient-based approaches. We produce numerical results that support our theoretical findings and illustrate the relevance of our approach on well-known datasets. BibTeX: @article{Beuzeville2021, author = {Théo Beuzeville and Pierre Boudier and Alfredo Buttari and Serge Gratton and Théo Mary and Stéphane Pralet}, title = {Adversarial attacks via backward error analysis}, year = {2021}, url = {https://hal-univ-tlse3.archives-ouvertes.fr/hal-03296180/document} }  Bingham NH and Symons TL (2021), "Gaussian random fields: with and without covariances", November, 2021. [Abstract] [BibTeX] Abstract: We begin with isotropic Gaussian random fields, and show how the Bochner-Godement theorem gives a natural way to describe their covariance structure. We continue with a study of Matérn processes on Euclidean space, spheres, manifolds and graphs, using Bessel potentials and stochastic partial differential equations (SPDEs). We then turn from this continuous setting to approximating discrete settings, Gaussian Markov random fields (GMRFs), and the computational advantages they bring in handling large data sets, by exploiting the sparseness properties of the relevant precision (concentration matrices). BibTeX: @article{Bingham2021, author = {N. H. Bingham and Tasmin L. Symons}, title = {Gaussian random fields: with and without covariances}, year = {2021} }  Biondi E, Barnier G, Clapp RG, Picetti F and Farris S (2021), "An object-oriented optimization framework for large-scale inverse problems", Computers & Geosciences., May, 2021. , pp. 104790. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: We present an object-oriented optimization framework that can be employed to solve small- and large-scale problems based on the concept of vectors and operators. By using such a strategy, we implement different iterative optimization algorithms that can be used in combination with architecture-independent vectors and operators, allowing the minimization of single-machine or cluster-based problems with a unique codebase. We implement a Python library following the described structure with a user-friendly interface that is designed to seamlessly scale to high-performance-computing (HPC) environments. We demonstrate its flexibility and scalability on multiple inverse problems, where convex and non-convex objective functions are optimized with different iterative algorithms. BibTeX: @article{Biondi2021, author = {Ettore Biondi and Guillaume Barnier and Robert G. Clapp and Francesco Picetti and Stuart Farris}, title = {An object-oriented optimization framework for large-scale inverse problems}, journal = {Computers & Geosciences}, publisher = {Elsevier BV}, year = {2021}, pages = {104790}, doi = {10.1016/j.cageo.2021.104790} }  Birgin EG and Mart´ınez JM (2021), "Accelerated derivative-free nonlinear least-squaresapplied to the estimation of Manning coefficients" [Abstract] [BibTeX] [URL] Abstract: A general framework for solving nonlinear least squares problems without the employment of derivatives is proposed in the present paper together with a new general global convergence theory. With the aim to cope with the case in which the number of variables is big (for the standards of derivative-free optimization), two dimension-reduction procedures are introduced. One of them is based on iterative subspace minimization and the other one is based on spline interpolation with variable nodes. Each iteration based on those procedures is followed by an acceleration step inspired in the Sequential Secant Method. The practical motivation for this work is the estimation of parameters in Hydraulic models applied to dam breaking problems. Numerical examples of the application of the new method to those problems are given. BibTeX: @article{Birgin2021, author = {E. G. Birgin and J. M. Mart´ınez}, title = {Accelerated derivative-free nonlinear least-squaresapplied to the estimation of Manning coefficients}, year = {2021}, url = {https://www.ime.usp.br/ egbirgin/publications/bmsesem.pdf} }  Bjorck J, Gomes CP and Weinberger KQ (2021), "Is High Variance Unavoidable in RL? A Case Study in Continuous Control", October, 2021. [Abstract] [BibTeX] Abstract: Reinforcement learning (RL) experiments have notoriously high variance, and minor details can have disproportionately large effects on measured outcomes. This is problematic for creating reproducible research and also serves as an obstacle for real-world applications, where safety and predictability are paramount. In this paper, we investigate causes for this perceived instability. To allow for an in-depth analysis, we focus on a specifically popular setup with high variance -- continuous control from pixels with an actor-critic agent. In this setting, we demonstrate that variance mostly arises early in training as a result of poor "outlier" runs, but that weight initialization and initial exploration are not to blame. We show that one cause for early variance is numerical instability which leads to saturating nonlinearities. We investigate several fixes to this issue and find that one particular method is surprisingly effective and simple -- normalizing penultimate features. Addressing the learning instability allows for larger learning rates, and significantly decreases the variance of outcomes. This demonstrates that the perceived variance in RL is not necessarily inherent to the problem definition and may be addressed through simple architectural modifications. BibTeX: @article{Bjorck2021, author = {Johan Bjorck and Carla P. Gomes and Kilian Q. Weinberger}, title = {Is High Variance Unavoidable in RL? A Case Study in Continuous Control}, year = {2021} }  Bogle I, Boman EG, Devine KD, Rajamanickam S and Slota GM (2021), "Parallel Graph Coloring Algorithms for Distributed GPU Environments", June, 2021. [Abstract] [BibTeX] Abstract: Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but to the best of our knowledge, hybrid MPI+GPU algorithms have been unexplored until this work. We present several MPI+GPU coloring approaches based on the distributed coloring algorithms of Gebremedhin et al. and the shared-memory algorithms of Deveci et al. . The on-node parallel coloring uses implementations in KokkosKernels, which provide parallelization for both multicore CPUs and GPUs. We further extend our approaches to compute distance-2 and partial distance-2 colorings, giving the first known distributed, multi-GPU algorithm for these problems. In addition, we propose a novel heuristic to reduce communication for recoloring in distributed graph coloring. Our experiments show that our approaches operate efficiently on inputs too large to fit on a single GPU and scale up to graphs with 76.7 billion edges running on 128 GPUs. BibTeX: @article{Bogle2021, author = {Ian Bogle and Erik G Boman and Karen D Devine and Sivasankaran Rajamanickam and George M Slota}, title = {Parallel Graph Coloring Algorithms for Distributed GPU Environments}, year = {2021} }  Bollhöfer M, Schenk O and Verbosio F (2021), "A high performance level-block approximate LU factorization preconditioner algorithm", Applied Numerical Mathematics., April, 2021. Vol. 162, pp. 265-282. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Many application problems that lead to solving linear systems make use of preconditioned Krylov subspace solvers to compute their solution. Among the most popular preconditioning approaches are incomplete factorization methods either as single-level approaches or within a multilevel framework. We will present a block incomplete factorization that is based on skillfully blocking the system initially and throughout the factorization. Our objective is to develop algebraic block preconditioners for the efficient solution of such systems by Krylov subspace methods. We will demonstrate how our level-block approximate algorithm outperforms a single level-block scalar method often by orders of magnitude on modern architectures, thus paving the way for its prospective use inside various multilevel incomplete factorization approaches or other applications where the core part relies on an efficient incomplete factorization algorithms. BibTeX: @article{Bollhoefer2021, author = {Matthias Bollhöfer and Olaf Schenk and Fabio Verbosio}, title = {A high performance level-block approximate LU factorization preconditioner algorithm}, journal = {Applied Numerical Mathematics}, publisher = {Elsevier BV}, year = {2021}, volume = {162}, pages = {265--282}, doi = {10.1016/j.apnum.2020.12.023} }  Bonettini S, Prato M and Rebegoldi S (2021), "A nested primal-dual FISTA-like scheme for composite convex optimization problems" [Abstract] [BibTeX] Abstract: We propose a nested primal–dual algorithm with extrapolation on the primal variable suited for minimizing the sum of two convex functions, one of which is continuously differentiable. The proposed algorithm can be interpreted as an inexact inertial forward–backward algorithm equipped with a prefixed number of inner primal–dual iterations for the proximal evaluation and a “warm–start” strategy for starting the inner loop, and generalizes several nested primal–dual algorithms already available in the literature. By appropriately choosing the inertial parameters, we prove the convergence of the iterates to a saddle point of the problem, and provide an O(1n) convergence rate on the primal–dual gap evaluated at the corresponding ergodic sequences. Numerical experiments on an image restoration problem show that the combination of the “warm–start” strategy with an appropriate choice of the inertial parameters is strictly required in order to guarantee the convergence to the real minimum point of the objective function. BibTeX: @article{Bonettini2021, author = {S. Bonettini and M. Prato and S. Rebegoldi}, title = {A nested primal-dual FISTA-like scheme for composite convex optimization problems}, year = {2021} }  Booth JD and Lane PA (2021), "An adaptive self-scheduling loop scheduler", Concurrency and Computation: Practice and Experience., December, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: Many shared-memory parallel irregular applications, such as sparse linear algebra and graph algorithms, depend on efficient loop scheduling (LS) in a fork-join manner despite that the work per loop iteration can greatly vary depending on the application and the input. Because of the importance of LS, many different methods (e.g., workload-aware self-scheduling) and parameters (e.g., chunk size) have been explored to achieve reasonable performance, and many of these methods require expert prior knowledge about the application and input before runtime. This work proposes a new LS method that requires little to no expert knowledge to achieve speedups close to those of tuned LS methods by self-managing chunk size based on a heuristic of throughput and using work-stealing to recover from workload imbalances. This method, named iCh, is implemented into libgomp for testing. It is evaluated against OpenMP's guided, dynamic, and taskloop methods and is evaluated against BinLPT and generic work-stealing on an array of applications that includes: a synthetic benchmark, breadth-first search, K-Means, the molecular dynamics code LavaMD, and sparse matrix-vector multiplication. On a 28 thread Intel system, iCh is the only method to always be one of the top three LS methods. On average across all applications, iCh is within 5.4% of the best method and is even able to outperform other LS methods for breadth-first search and K-Means. BibTeX: @article{Booth2021, author = {Joshua Dennis Booth and Phillip Allen Lane}, title = {An adaptive self-scheduling loop scheduler}, journal = {Concurrency and Computation: Practice and Experience}, publisher = {Wiley}, year = {2021}, doi = {10.1002/cpe.6750} }  Boroujeni RA, Schiffl J, Darulova E, Ulbrich M and Ahrendt W (2021), "Deductive Verification of Floating-Point Java Programs in KeY", January, 2021. [Abstract] [BibTeX] Abstract: Deductive verification has been successful in verifying interesting properties of real-world programs. One notable gap is the limited support for floating-point reasoning. This is unfortunate, as floating-point arithmetic is particularly unintuitive to reason about due to rounding as well as the presence of the special values infinity and Not a Number' (NaN). In this paper, we present the first floating-point support in a deductive verification tool for the Java programming language. Our support in the KeY verifier handles arithmetic via floating-point decision procedures inside SMT solvers and transcendental functions via axiomatization. We evaluate this integration on new benchmarks, and show that this approach is powerful enough to prove the absence of floating-point special values -- often a prerequisite for further reasoning about numerical computations -- as well as certain functional properties for realistic benchmarks. BibTeX: @article{Boroujeni2021, author = {Rosa Abbasi Boroujeni and Jonas Schiffl and Eva Darulova and Mattias Ulbrich and Wolfgang Ahrendt}, title = {Deductive Verification of Floating-Point Java Programs in KeY}, year = {2021} }  Briceño-Arias LM and Roldán F (2021), "Split-Douglas--Rachford Algorithm for Composite Monotone Inclusions and Split-ADMM", SIAM Journal on Optimization., 1, 2021. Vol. 31(4), pp. 2987-3013. Society for Industrial & Applied Mathematics (SIAM). [Abstract] [BibTeX] [DOI] Abstract: In this paper we provide a generalization of the Douglas--Rachford splitting (DRS) and the primal-dual algorithm [L. Condat, J. Optim. Theory Appl., 158 (2013), pp. 460--479; B. C. Vũ, Adv. Comput. Math., 38 (2013), pp. 667--681] for solving monotone inclusions in a real Hilbert space involving a general linear operator. The proposed method allows for primal and dual nonstandard metrics and activates the linear operator separately from the monotone operators appearing in the inclusion. In the simplest case when the linear operator has full range, it reduces to classical DRS. Moreover, the weak convergence of primal-dual sequences to a Kuhn--Tucker point is guaranteed, generalizing the main result in [B. F. Svaiter, SIAM J. Control Optim., 49 (2011), pp. 280--287]. Inspired by [D. Gabay, Applications of the method of multipliers to variational inequalities, in Augmented Lagrangian Methods: Applications to the Numerical Solution of Boundary-Value Problems, M. Fortin and R. Glowinski, eds., Stud. Math. Appl. 15, North-Holland, Amsterdam, 1983, pp. 299--331], we also derive a new split alternating direction method of multipliers (SADMM) by applying our method to the dual of a convex optimization problem involving a linear operator which can be expressed as the composition of two linear operators. The proposed SADMM activates one linear operator implicitly and the other one explicitly, and we recover ADMM when the latter is set as the identity. Connections and comparisons of our theoretical results with respect to the literature are provided for the main algorithm and SADMM. The flexibility and efficiency of both methods is illustrated via numerical simulations in total variation image restoration and a sparse minimization problem. BibTeX: @article{BriceñoArias2021, author = {Luis M. Briceño-Arias and Fernando Roldán}, title = {Split-Douglas--Rachford Algorithm for Composite Monotone Inclusions and Split-ADMM}, journal = {SIAM Journal on Optimization}, publisher = {Society for Industrial & Applied Mathematics (SIAM)}, year = {2021}, volume = {31}, number = {4}, pages = {2987--3013}, doi = {10.1137/21m1395144} }  Brock B, Buluc A, Mattson TG, McMillan S and Moreira JE (2021), "Introduction to GraphBLAS 2.0", In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops., June, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: The GraphBLAS is a set of basic building blocks for constructing graph algorithms in terms of linear algebra. They are first and foremost defined mathematically with the goal that language bindings will be produced for a wide range of programming languages. We started with the C programming language and over the last four years have produced multiple versions of the GraphBLAS C API specification. In this paper, we describe our next version of the C GraphBLAS specification. It introduces a number of major changes including support for multithreading, import/export functionality, and functions that use the indices of matrix/vector elements. Since some of these changes introduce small backwards compatibility issues, this is a major release we call GraphBLAS 2.0. BibTeX: @inproceedings{Brock2021, author = {Benjamin Brock and Aydin Buluc and Timothy G. Mattson and Scott McMillan and Jose E. Moreira}, title = {Introduction to GraphBLAS 2.0}, booktitle = {Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops}, publisher = {IEEE}, year = {2021}, doi = {10.1109/ipdpsw52791.2021.00047} }  Brown GE and Narain R (2021), "WRAPD", ACM Transactions on Graphics., August, 2021. Vol. 40(4), pp. 1-14. Association for Computing Machinery (ACM). [Abstract] [BibTeX] [DOI] Abstract: Local-global solvers such as ADMM for elastic simulation and geometry optimization struggle to resolve large rotations such as bending and twisting modes, and large distortions in the presence of barrier energies. We propose two improvements to address these challenges. First, we introduce a novel local-global splitting based on the polar decomposition that separates the geometric nonlinearity of rotations from the material nonlinearity of the deformation energy. The resulting ADMM-based algorithm is a combination of an L-BFGS solve in the global step and proximal updates of element stretches in the local step. We also introduce a novel method for dynamic reweighting that is used to adjust element weights at runtime for improved convergence. With both improved rotation handling and element weighting, our algorithm is considerably faster than state-of-the-art approaches for quasi-static simulations. It is also much faster at making early progress in parameterization problems, making it valuable as an initializer to jump-start second-order algorithms. BibTeX: @article{Brown2021, author = {George E. Brown and Rahul Narain}, title = {WRAPD}, journal = {ACM Transactions on Graphics}, publisher = {Association for Computing Machinery (ACM)}, year = {2021}, volume = {40}, number = {4}, pages = {1--14}, doi = {10.1145/3450626.3459942} }  Brun E, Defour D, De Oliveira Castro P, Istoan M, Mancusi D, Petit E and Vaquet A (2021), "A Study of the Effects and Benefits of Custom-Precision Mathematical Libraries for HPC Codes", IEEE Transactions on Emerging Topics in Computing. , pp. 1-1. [Abstract] [BibTeX] [DOI] Abstract: Mathematical libraries are being specifically developed to use fixed-width data-paths on processors and target common floating-point formats like binary32 and binary64. In this article we propose a framework to evaluate the effects of mathematical library calls accuracy in scientific computations. First, our tool collects for each call-site of a mathematical function the input-data profile. Then, using a heuristic exploration algorithm, we estimate the minimal required accuracy by rounding the result to lower precisions. The data profile and accuracy measurement per call-site is used to speculatively select the mathematical function implementation with the most appropriate accuracy for a given scenario. We have tested the methodology with the Intel MKL VML library with predefined accuracy levels. We demonstrate the benefits of our approach on two real-world applications: SGP4, a satellite tracking application, and PATMOS, a Monte Carlo neutron transport code. We experiment and discuss its generalization across data-sets, and finally propose a speculative runtime implementation for PATMOS. The experiment provides an insight into the performance improvements that can be achieved by leveraging the control of per-function call-site accuracy-mode execution of the Intel MKL VML library. BibTeX: @article{Brun2021, author = {Brun, Emeric and Defour, David and De Oliveira Castro, Pablo and Istoan, Matei and Mancusi, Davide and Petit, Eric and Vaquet, Alan}, title = {A Study of the Effects and Benefits of Custom-Precision Mathematical Libraries for HPC Codes}, journal = {IEEE Transactions on Emerging Topics in Computing}, year = {2021}, pages = {1-1}, doi = {10.1109/TETC.2021.3070422} }  Brust JJ, Marcia RF, Petra CG and Saunders MA (2021), "Large-scale Optimization with Linear Equality Constraints using Reduced Compact Representation", January, 2021. [Abstract] [BibTeX] Abstract: For optimization problems with sparse linear equality constraints, we observe that the (1,1) block of the inverse KKT matrix remains unchanged when projected onto the nullspace of the constraints. We develop reduced compact representations of the limited-memory BFGS Hessian to compute search directions efficiently. Orthogonal projections are implemented by sparse QR factorization or preconditioned LSQR iteration. In numerical experiments two proposed trust-region algorithms improve in computation times, often significantly, compared to previous implementations and compared to IPOPT. BibTeX: @article{Brust2021, author = {Johannes J. Brust and Roummel F. Marcia and Cosmin G. Petra and Michael A. Saunders}, title = {Large-scale Optimization with Linear Equality Constraints using Reduced Compact Representation}, year = {2021} }  Buizza C, Casas CQ, Nadler P, Mack J, Marrone S, Titus Z, Cornec CL, Heylen E, Dur T, Ruiz LB, Heaney C, Lopez JAD, Kumar KSS and Arcucci R (2021), "Data Learning: Integrating Data Assimilation and Machine Learning", Journal of Computational Science., December, 2021. , pp. 101525. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Data Assimilation (DA) is the approximation of the true state of some physical system by combining observations with a dynamic model. DA incorporates observational data into a prediction model to improve forecasted results. These models have increased in sophistication to better fit application requirements and circumvent implementation issues. Nevertheless, these approaches are incapable of fully overcoming their unrealistic assumptions. Machine Learning (ML) shows great capability in approximating nonlinear systems and extracting meaningful features from high–dimensional data. ML algorithms are capable of assisting or replacing traditional forecasting methods. However, the data used during training in any Machine Learning (ML) algorithm include numerical, approximation and round off errors, which are trained into the forecasting model. Integration of ML with DA increases the reliability of prediction by including information with a physical meaning. This work provides an introduction to Data Learning, a field that integrates Data Assimilation and Machine Learning to overcome limitations in applying these fields to real-world data. The fundamental equations of DA and ML are presented and developed to show how they can be combined into Data Learning. We present a number of Data Learning methods and results for some test cases, though the equations are general and can easily be applied elsewhere. BibTeX: @article{Buizza2021, author = {Caterina Buizza and César Quilodrán Casas and Philip Nadler and Julian Mack and Stefano Marrone and Zainab Titus and Clémence Le Cornec and Evelyn Heylen and Tolga Dur and Luis Baca Ruiz and Claire Heaney and Julio Amador D\iaz Lopez and K S Sesh Kumar and Rossella Arcucci}, title = {Data Learning: Integrating Data Assimilation and Machine Learning}, journal = {Journal of Computational Science}, publisher = {Elsevier BV}, year = {2021}, pages = {101525}, doi = {10.1016/j.jocs.2021.101525} }  Buluc A, Kolda TG, Wild SM, Anitescu M, DeGennaro A, Jakeman J, Kamath C, Ramakrishnan, Kannan, Lopes ME, Martinsson P-G, Myers K, Nelson J, Restrepo JM, Seshadhri C, Vrabie D, Wohlberg B, Wright SJ, Yang C and Zwart P (2021), "Randomized Algorithms for Scientific Computing (RASC)", April, 2021. [Abstract] [BibTeX] Abstract: Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021. BibTeX: @article{Buluc2021, author = {Aydin Buluc and Tamara G. Kolda and Stefan M. Wild and Mihai Anitescu and Anthony DeGennaro and John Jakeman and Chandrika Kamath and Ramakrishnan and Kannan and Miles E. Lopes and Per-Gunnar Martinsson and Kary Myers and Jelani Nelson and Juan M. Restrepo and C. Seshadhri and Draguna Vrabie and Brendt Wohlberg and Stephen J. Wright and Chao Yang and Peter Zwart}, title = {Randomized Algorithms for Scientific Computing (RASC)}, year = {2021} }  Buluc A and Kizilkale C (2021), "HUNTRESS: a fast heuristic for reconstructing phylogenetic trees of tumor evolution (HUNTRESS) v0.1". [Abstract] [BibTeX] [DOI] Abstract: We introduce HUNTRESS (Histogrammed UNion Tree REconStruction heuriStic), a computational method for tumor phylogeny reconstruction from noisy genotype matrices derived from single-cell sequencing data, whose running time is linear with the number of cells and quadratic with the number of mutations. Provided that the input genotype matrix includes no false positives, each cellular subpopulation is at least a user defined fraction of the total number of cells, and the number of cells are much bigger than the number of mutations considered, HUNTRESS computes the ground truth tumor phylogeny with high probability. On simulated data HUNTRESS is faster than available alternatives with comparable or better accuracy. Additionally, the phylogenies reconstructed by HUNTRESS on two single-cell sequencing data sets agree with the best known evolutionary scenarios for the associated tumors. BibTeX: @misc{Buluc2021a, author = {Buluc, Aydin and Kizilkale, Can}, title = {HUNTRESS: a fast heuristic for reconstructing phylogenetic trees of tumor evolution (HUNTRESS) v0.1}, publisher = {Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)}, year = {2021}, doi = {10.11578/DC.20210416.2} }  Cai S-R and Hwang F-N (2022), "A hybrid-line-and-curve search globalization technique for inexact Newton methods", Applied Numerical Mathematics., March, 2022. Vol. 173, pp. 79-93. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The backtracking line search (LS) is one of the most commonly used techniques for enhancing the robustness of Newton-type methods. The Newton method consists of two key steps: search and update. LS tries to find a decreasing-most updated point along with Newton's search direction with an appropriate damping factor from the current approximation. The determination of Newton's search direction relies only on current information. When Newton's search direction is a weak descent direction, the damping factor determined by LS can be unacceptably small, which often happens for the numerical solution of large, sparse systems of equations with strong local nonlinearity. As a result, the solution process falls into the vicious cycle between no update and almost the same search direction. The intermediate solution is trapped within the same region without any progress. This work proposes a new globalization strategy, namely, the hybrid line and curve search (HLCS) technique for Newton-type methods to resolve their potential failure problems when line-search is used. If the classical line search fails, we activate the curve search phase. In that case, we first decompose the solution space into two orthogonal subspaces based on the predicted value obtained from Newton's search direction, referred to “good” and “bad” subspaces. The bad one corresponds to the components causing the violation of the sufficient decrease condition. Next, we project the original predicted value on the good subspace and then perform the nonlinear elimination process to obtain the corrected solution on the bad subspace. Hopefully, the new update can satisfy the sufficient decrease condition to enhance the convergence of inexact Newton. As proof of concept, we present three numerical examples to illustrate the effectiveness of our proposed inexact Newton-HLCS approach. BibTeX: @article{Cai2022, author = {Shang-Rong Cai and Feng-Nan Hwang}, title = {A hybrid-line-and-curve search globalization technique for inexact Newton methods}, journal = {Applied Numerical Mathematics}, publisher = {Elsevier BV}, year = {2022}, volume = {173}, pages = {79--93}, doi = {10.1016/j.apnum.2021.11.011} }  Cali S, Detmold W, Korcyl G, Korcyl P and Shanahan P (2021), "Implementation of the conjugate gradient algorithm for heterogeneous systems", November, 2021. [Abstract] [BibTeX] Abstract: Lattice QCD calculations require significant computational effort, with the dominant fraction of resources typically spent in the numerical inversion of the Dirac operator. One of the simplest methods to solve such large and sparse linear systems is the conjugate gradient (CG) approach. In this work we present an implementation of CG that can be executed on different devices, including CPUs, GPUs, and FPGAs. This is achieved by using the SYCL/DPC++ framework, which allows the execution of the same source code on heterogeneous systems. BibTeX: @article{Cali2021, author = {Salvatore Cali and William Detmold and Grzegorz Korcyl and Piotr Korcyl and Phiala Shanahan}, title = {Implementation of the conjugate gradient algorithm for heterogeneous systems}, year = {2021} }  Carson E, Lund K and Rozlosnik M (2021), "The Stability of Block Variants of the Classical Gram-Schmidt" [Abstract] [BibTeX] [URL] Abstract: The block version of classical Gram-Schmidt (BCGS) is often employed to efficiently compute orthogonal bases for Krylov subspace methods and eigenvalue solvers, but a rigorous proof of its stability behavior has not yet been established. It is shown that the usual implementation of BCGS can lose orthogonality at a rate worse than O()2(X), where 𝜖 is the unit round-off. A useful intermediate quantity denoted as the Cholesky residual is given special attention and, along with a block generalization of the Pythagorean theorem, this quantity is used to develop more stable variants of BCGS. These variants are proved to have O()2(X) loss of orthogonality with relatively relaxed conditions on the intra-block orthogonalization routine. A variety of numerical examples illustrate the theoretical bounds. BibTeX: @article{Carson2021, author = {Erin Carson and Kathryn Lund and Miroslav Rozlosnik}, title = {The Stability of Block Variants of the Classical Gram-Schmidt}, year = {2021}, url = {https://www.math.cas.cz/fichier/preprints/IM_20210124200723_43.pdf} }  Carson E and Gergelits T (2021), "Mixed Precision s-step Lanczos and Conjugate Gradient Algorithms", March, 2021. [Abstract] [BibTeX] Abstract: Compared to the classical Lanczos algorithm, the s-step Lanczos variant has the potential to improve performance by asymptotically decreasing the synchronization cost per iteration. However, this comes at a cost. Despite being mathematically equivalent, the s-step variant is known to behave quite differently in finite precision, with potential for greater loss of accuracy and a decrease in the convergence rate relative to the classical algorithm. It has previously been shown that the errors that occur in the s-step version follow the same structure as the errors in the classical algorithm, but with the addition of an amplification factor that depends on the square of the condition number of the O(s)-dimensional Krylov bases computed in each outer loop. As the condition number of these s-step bases grows (in some cases very quickly) with s, this limits the parameter s that can be chosen and thus limits the performance that can be achieved. In this work we show that if a select few computations in s-step Lanczos are performed in double the working precision, the error terms then depend only linearly on the conditioning of the s-step bases. This has the potential for drastically improving the numerical behavior of the algorithm with little impact on per-iteration performance. Our numerical experiments demonstrate the improved numerical behavior possible with the mixed precision approach, and also show that this improved behavior extends to the s-step CG algorithm in mixed precision. BibTeX: @article{Carson2021a, author = {Erin Carson and Tomáš Gergelits}, title = {Mixed Precision s-step Lanczos and Conjugate Gradient Algorithms}, year = {2021} }  Cartis C, Roberts L and Sheridan-Methven O (2021), "Escaping local minima with local derivative-free methods: a numerical investigation", Optimization., February, 2021. , pp. 1-31. Informa UK Limited. [Abstract] [BibTeX] [DOI] Abstract: We investigate the potential of applying a state-of-the-art, local derivative-free solver, Py-BOBYQA to global optimization problems. In particular, we demonstrate the potential of a restarts procedure – as distinct from multistart methods – to allow Py-BOBYQA to escape local minima (where ordinarily it would terminate at the first local minimum found). We also introduce an adaptive variant of restarts which yields improved performance on global optimization problems. As Py-BOBYQA is a model-based trust-region method, we compare largely with other global optimization methods for which (global) models are important, such as Bayesian optimization and response surface methods; we also consider state-of-the-art representative deterministic and stochastic codes, such as DIRECT and CMA-ES. We find numerically that the restarts procedures in Py-BOBYQA are effective at helping it to escape local minima, when compared to using no restarts in Py-BOBYQA. Additionally, we find that Py-BOBYQA with adaptive restarts has comparable performance with global optimization solvers for all accuracy/budget regimes, in both smooth and noisy settings. In particular, Py-BOBYQA variants are best performing for smooth and multiplicative noise problems in high-accuracy regimes. As a by-product, some preliminary conclusions can be drawn on the relative performance of the global solvers we have tested with default settings. BibTeX: @article{Cartis2021, author = {Coralia Cartis and Lindon Roberts and Oliver Sheridan-Methven}, title = {Escaping local minima with local derivative-free methods: a numerical investigation}, journal = {Optimization}, publisher = {Informa UK Limited}, year = {2021}, pages = {1--31}, doi = {10.1080/02331934.2021.1883015} }  Cartis C and Roberts L (2021), "Scalable Subspace Methods for Derivative-Free Nonlinear Least-Squares Optimization", February, 2021. [Abstract] [BibTeX] Abstract: We introduce a general framework for large-scale model-based derivative-free optimization based on iterative minimization within random subspaces. We present a probabilistic worst-case complexity analysis for our method, where in particular we prove high-probability bounds on the number of iterations before a given optimality is achieved. This framework is specialized to nonlinear least-squares problems, with a model-based framework based on the Gauss-Newton method. This method achieves scalability by constructing local linear interpolation models to approximate the Jacobian, and computes new steps at each iteration in a subspace with user-determined dimension. We then describe a practical implementation of this framework, which we call DFBGN. We outline efficient techniques for selecting the interpolation points and search subspace, yielding an implementation that has a low per-iteration linear algebra cost (linear in the problem dimension) while also achieving fast objective decrease as measured by evaluations. Extensive numerical results demonstrate that DFBGN has improved scalability, yielding strong performance on large-scale nonlinear least-squares problems. BibTeX: @article{Cartis2021a, author = {Coralia Cartis and Lindon Roberts}, title = {Scalable Subspace Methods for Derivative-Free Nonlinear Least-Squares Optimization}, year = {2021} }  Cartis C, Fiala J and Shao Z (2021), "Hashing embeddings of optimal dimension, with applications to linear least squares", May, 2021. [Abstract] [BibTeX] Abstract: The aim of this paper is two-fold: firstly, to present subspace embedding properties for s-hashing sketching matrices, with s≥ 1, that are optimal in the projection dimension m of the sketch, namely, m=𝒪(d), where d is the dimension of the subspace. A diverse set of results are presented that address the case when the input matrix has sufficiently low coherence (thus removing the 2 d factor dependence in m, in the low-coherence result of Bourgain et al (2015) at the expense of a smaller coherence requirement); how this coherence changes with the number s of column nonzeros (allowing a scaling of s of the coherence bound), or is reduced through suitable transformations (when considering hashed -- instead of subsampled -- coherence reducing transformations such as randomised Hadamard). Secondly, we apply these general hashing sketching results to the special case of Linear Least Squares (LLS), and develop Ski-LLS, a generic software package for these problems, that builds upon and improves the Blendenpik solver on dense input and the (sequential) LSRN performance on sparse problems. In addition to the hashing sketching improvements, we add suitable linear algebra tools for rank-deficient and for sparse problems that lead Ski-LLS to outperform not only sketching-based routines on randomly generated input, but also state of the art direct solver SPQR and iterative code HSL on certain subsets of the sparse Florida matrix collection; namely, on least squares problems that are significantly overdetermined, or moderately sparse, or difficult. BibTeX: @article{Cartis2021b, author = {Coralia Cartis and Jan Fiala and Zhen Shao}, title = {Hashing embeddings of optimal dimension, with applications to linear least squares}, year = {2021} }  Cartis C, Massart E and Otemissov A (2021), "Global optimization using random embeddings", July, 2021. [Abstract] [BibTeX] Abstract: We propose a random-subspace algorithmic framework for global optimization of Lipschitz-continuous objectives, and analyse its convergence using novel tools from conic integral geometry. X-REGO randomly projects, in a sequential or simultaneous manner, the high-dimensional original problem into low-dimensional subproblems that can then be solved with any global, or even local, optimization solver. We estimate the probability that the randomly-embedded subproblem shares (approximately) the same global optimum as the original problem. This success probability is then used to show convergence of X-REGO to an approximate global solution of the original problem, under weak assumptions on the problem (having a strictly feasible global solution) and on the solver (guaranteed to find an approximate global solution of the reduced problem with sufficiently high probability). In the particular case of unconstrained objectives with low effective dimension, that only vary over a low-dimensional subspace, we propose an X-REGO variant that explores random subspaces of increasing dimension until finding the effective dimension of the problem, leading to X-REGO globally converging after a finite number of embeddings, proportional to the effective dimension. We show numerically that this variant efficiently finds both the effective dimension and an approximate global minimizer of the original problem. BibTeX: @article{Cartis2021c, author = {Coralia Cartis and Estelle Massart and Adilet Otemissov}, title = {Global optimization using random embeddings}, year = {2021} }  Cartis C, Kaouri MH, Lawless AS and Nichols NK (2021), "Convergent least-squares optimisation methods for variational data assimilation", July, 2021. [Abstract] [BibTeX] Abstract: Data assimilation combines prior (or background) information with observations to estimate the initial state of a dynamical system over a given time-window. A common application is in numerical weather prediction where a previous forecast and atmospheric observations are used to obtain the initial conditions for a numerical weather forecast. In four-dimensional variational data assimilation (4D-Var), the problem is formulated as a nonlinear least-squares problem, usually solved using a variant of the classical Gauss-Newton (GN) method. However, we show that GN may not converge if poorly initialised. In particular, we show that this may occur when there is greater uncertainty in the background information compared to the observations, or when a long time-window is used in 4D-Var allowing more observations. The difficulties GN encounters may lead to inaccurate initial state conditions for subsequent forecasts. To overcome this, we apply two convergent GN variants (line search and regularisation) to the long time-window 4D-Var problem and investigate the cases where they locate a more accurate estimate compared to GN within a given budget of computational time and cost. We show that these methods are able to improve the estimate of the initial state, which may lead to a more accurate forecast. BibTeX: @article{Cartis2021d, author = {Coralia Cartis and Maha H. Kaouri and Amos S. Lawless and Nancy K. Nichols}, title = {Convergent least-squares optimisation methods for variational data assimilation}, year = {2021} }  Catalyurek UV (2021), "Implementing Performance Portable Graph Algorithms Using Task-Based Execution", In Proceedings of the 2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms., November, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Summary form only given, as follows. The complete presentation was not made available for publication as part of the conference proceedings. Designing flexible graph kernels that can run well on various platforms is a crucial research problem due to the frequent usage of graphs for modeling data and recent architectural advances and variety. In this talk, I will present our recent graph processing model and framework, PGAbB, for modern shared-memory heterogeneous platforms. PGAbB implements a block-based programming model. This allows a user to express a graph algorithm using functors that operate on an ordered list of blocks (subgraphs). Our framework deploys these computations to all available resources in a heterogeneous architecture. We will demonstrate that one can implement a diverse set of graph algorithms in our framework, and task-based execution enables graph computations even on large graphs that do not fit in GPU device memory. Our experimental results show that PGAbB achieves competitive or superior performance compared to hand-optimized implementations or existing state-of-the-art graph computing frameworks. BibTeX: @inproceedings{Catalyurek2021, author = {Umit V. Catalyurek}, title = {Implementing Performance Portable Graph Algorithms Using Task-Based Execution}, booktitle = {Proceedings of the 2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms}, publisher = {IEEE}, year = {2021}, doi = {10.1109/ia354616.2021.00007} }  Ceccon F and Misener R (2021), "Solving the pooling problem at scale with extensible solver GALINI", May, 2021. [Abstract] [BibTeX] Abstract: This paper presents a Python library to model pooling problems, a class of network flow problems with many engineering applications. The library automatically generates a mixed-integer quadratically-constrained quadratic optimization problem from a given network structure. The library additionally uses the network structure to build 1) a convex linear relaxation of the non-convex quadratic program and 2) a mixed-integer linear restriction of the problem. We integrate the pooling network library with galini, an open-source extensible global solver for quadratic optimization. We demonstrate galini's extensible characteristics by using the pooling library to develop two galini plug-ins: 1) a cut generator plug-in that adds valid inequalities in the galini cut loop and 2) a primal heuristic plug-in that uses the mixed-integer linear restriction. We test galini on large scale pooling problems and show that, thanks to the good upper bound provided by the mixed-integer linear restriction and the good lower bounds provided by the convex relaxation, we obtain optimality gaps that are competitive with Gurobi 9.1 on the largest problem instances. BibTeX: @article{Ceccon2021, author = {Francesco Ceccon and Ruth Misener}, title = {Solving the pooling problem at scale with extensible solver GALINI}, year = {2021} }  Chakrabarti K, Gupta N and Chopra N (2022), "Iterative pre-conditioning for expediting the distributed gradient-descent method: The case of linear least-squares problem", Automatica., March, 2022. Vol. 137, pp. 110095. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: This paper considers the multi-agent linear least-squares problem in a server–agent network architecture. The system comprises multiple agents, each with a set of local data points. The agents are connected to a server, and there is no inter-agent communication. The agents’ goal is to compute a linear model that optimally fits the collective data. The agents, however, cannot share their data points. In principle, the agents can solve this problem by collaborating with the server using the server–agent network variant of the classical gradient-descent method. However, when the data points are ill-conditioned, the gradient-descent method requires a large number of iterations to converge. We propose an iterative pre-conditioning technique to mitigate the deleterious impact of the data points’ conditioning on the convergence rate of the gradient-descent method. Unlike the conventional pre-conditioning techniques, the pre-conditioner matrix used in our proposed technique evolves iteratively. We show that our proposed algorithm converges linearly with an improved rate of convergence in comparison to both the classical and the accelerated gradient-descent methods. For the special case, when the solution of the least-squares problem is unique, our algorithm converges to the solution superlinearly. Through numerical experiments on benchmark least-squares problems, we validate our theoretical findings, and also demonstrate our algorithm’s improved robustness against process noise. BibTeX: @article{Chakrabarti2022, author = {Kushal Chakrabarti and Nirupam Gupta and Nikhil Chopra}, title = {Iterative pre-conditioning for expediting the distributed gradient-descent method: The case of linear least-squares problem}, journal = {Automatica}, publisher = {Elsevier BV}, year = {2022}, volume = {137}, pages = {110095}, doi = {10.1016/j.automatica.2021.110095} }  Champion C, Champion M, Blazère M and Loubes J-M (2021), "l1-spectral clustering algorithm: a robust spectralclustering using Lasso regularization" [Abstract] [BibTeX] [URL] Abstract: Detecting cluster structure is a fundamental task to understand and visualize functional characteristics of a graph. Among the different clustering methods available, spectral clustering is one of the most widely used due to its speed and simplicity, while still being sensitive to perturbations imposed on the graph. This paper presents a robust variant of spectral clustering, called l_1-spectral clustering, based on Lasso regularization and adapted to perturbed graph models. By promoting sparse eigenbases solutions of specific l_1-minimization problems, it detects the hidden natural cluster structure of the graph. The effectiveness and robustness to noise perturbations of the l_1-spectral clustering algorithm is confirmed through a collection of simulated and real biological data. BibTeX: @article{Champion2021, author = {Camille Champion and Magali Champion and Mélanie Blazère and Jean-Michel Loubes}, title = {l1-spectral clustering algorithm: a robust spectralclustering using Lasso regularization}, year = {2021}, url = {https://hal.archives-ouvertes.fr/hal-03095805} }  Chen Y, Li W, Fan R and Liu X (2021), "GPU Optimization for High-Quality Kinetic Fluid Simulation", January, 2021. [Abstract] [BibTeX] Abstract: Fluid simulations are often performed using the incompressible Navier-Stokes equations (INSE), leading to sparse linear systems which are difficult to solve efficiently in parallel. Recently, kinetic methods based on the adaptive-central-moment multiple-relaxation-time (ACM-MRT) model have demonstrated impressive capabilities to simulate both laminar and turbulent flows, with quality matching or surpassing that of state-of-the-art INSE solvers. Furthermore, due to its local formulation, this method presents the opportunity for highly scalable implementations on parallel systems such as GPUs. However, an efficient ACM-MRT-based kinetic solver needs to overcome a number of computational challenges, especially when dealing with complex solids inside the fluid domain. In this paper, we present multiple novel GPU optimization techniques to efficiently implement high-quality ACM-MRT-based kinetic fluid simulations in domains containing complex solids. Our techniques include a new communication-efficient data layout, a load-balanced immersed-boundary method, a multi-kernel launch method using a simplified formulation of ACM-MRT calculations to enable greater parallelism, and the integration of these techniques into a parametric cost model to enable automated parameter search to achieve optimal execution performance. We also extended our method to multi-GPU systems to enable large-scale simulations. To demonstrate the state-of-the-art performance and high visual quality of our solver, we present extensive experimental results and comparisons to other solvers. BibTeX: @article{Chen2021, author = {Yixin Chen and Wei Li and Rui Fan and Xiaopei Liu}, title = {GPU Optimization for High-Quality Kinetic Fluid Simulation}, year = {2021} }  Chen P, Wahib M, Wang X, Takizawa S, Hirofuchi T, Ogawa H and Matsuoka S (2021), "Performance Portable Back-projection Algorithms on CPUs: Agnostic Data Locality and Vectorization Optimizations", April, 2021. [Abstract] [BibTeX] [DOI] Abstract: Computed Tomography (CT) is a key 3D imaging technology that fundamentally relies on the compute-intense back-projection operation to generate 3D volumes. GPUs are typically used for back-projection in production CT devices. However, with the rise of power-constrained micro-CT devices, and also the emergence of CPUs comparable in performance to GPUs, back-projection for CPUs could become favorable. Unlike GPUs, extracting parallelism for back-projection algorithms on CPUs is complex given that parallelism and locality are not explicitly defined and controlled by the programmer, as is the case when using CUDA for instance. We propose a collection of novel back-projection algorithms that reduce the arithmetic computation, robustly enable vectorization, enforce a regular memory access pattern, and maximize the data locality. We also implement the novel algorithms as efficient back-projection kernels that are performance portable over a wide range of CPUs. Performance evaluation using a variety of CPUs from different vendors and generations demonstrates that our back-projection implementation achieves on average 5.2x speedup over the multi-threaded implementation of the most widely used, and optimized, open library. With a state-of-the-art CPU, we reach performance that rivals top-performing GPUs. BibTeX: @article{Chen2021a, author = {Peng Chen and Mohamed Wahib and Xiao Wang and Shinichiro Takizawa and Takahiro Hirofuchi and Hirotaka Ogawa and Satoshi Matsuoka}, title = {Performance Portable Back-projection Algorithms on CPUs: Agnostic Data Locality and Vectorization Optimizations}, year = {2021}, doi = {10.1145/3447818.3460353} }  Chen J, Schäfer F, Huang J and Desbrun M (2021), "Multiscale Cholesky Preconditioning for Ill-conditioned Problems", ACM Transactions on Graphics. Vol. 40(4) [Abstract] [BibTeX] [DOI] [URL] Abstract: Many computer graphics applications boil down to solving sparse systems of linear equations. While the current arsenal of numerical solvers available in various specialized libraries and for different computer architectures often allow efficient and scalable solutions to image processing, modeling and simulation applications, an increasing number of graphics problems face large-scale and ill-conditioned sparse linear systems -- a numerical challenge which typically chokes both direct factorizations (due to high memory requirements) and iterative solvers (because of slow convergence). We propose a novel approach to the efficient preconditioning of such problems which often emerge from the discretization over unstructured meshes of partial differential equations with heterogeneous and anisotropic coefficients. Our numerical approach consists in simply performing a fine-to-coarse ordering and a multiscale sparsity pattern of the degrees of freedom, using which we apply an incomplete Cholesky factorization. By further leveraging supernodes for cache coherence, graph coloring to improve parallelism and partial diagonal shifting to remedy negative pivots, we obtain a preconditioner which, combined with a conjugate gradient solver, far exceeds the performance of existing carefully-engineered libraries for graphics problems involving bad mesh elements and/or high contrast of coefficients. We also back the core concepts behind our simple solver with theoretical foundations linking the recent method of operator-adapted wavelets used in numerical homogenization to the traditional Cholesky factorization of a matrix, providing us with a clear bridge between incomplete Cholesky factorization and multiscale analysis that we leverage numerically. BibTeX: @article{Chen2021b, author = {Jiong Chen and Florian Schäfer and Jin Huang and Mathieu Desbrun}, title = {Multiscale Cholesky Preconditioning for Ill-conditioned Problems}, journal = {ACM Transactions on Graphics}, year = {2021}, volume = {40}, number = {4}, url = {http://www.geometry.caltech.edu/pubs/CSHD21.pdf}, doi = {10.1145/3450626.3459851} }  Chen Y, Özsu MT, Xiao G, Tang Z and Li K (2021), "GSmart: An Efficient SPARQL Query Engine Using Sparse Matrix Algebra -- Full Version", June, 2021. [Abstract] [BibTeX] Abstract: Efficient execution of SPARQL queries over large RDF datasets is a topic of considerable interest due to increased use of RDF to encode data. Most of this work has followed either relational or graph-based approaches. In this paper, we propose an alternative query engine, called gSmart, based on matrix algebra. This approach can potentially better exploit the computing power of high-performance heterogeneous architectures that we target. gSmart incorporates: (1) grouped incident edge-based SPARQL query evaluation, in which all unevaluated edges of a vertex are evaluated together using a series of matrix operations to fully utilize query constraints and narrow down the solution space; (2) a graph query planner that determines the order in which vertices in query graphs should be evaluated; (3) memory- and computation-efficient data structures including the light-weight sparse matrix (LSpM) storage for RDF data and the tree-based representation for evaluation results; (4) a multi-stage data partitioner to map the incident edge-based query evaluation into heterogeneous HPC architectures and develop multi-level parallelism; and (5) a parallel executor that uses the fine-grained processing scheme, pre-pruning technique, and tree-pruning technique to lower inter-node communication and enable high throughput. Evaluations of gSmart on a CPU+GPU HPC architecture show execution time speedups of up to 46920.00x compared to the existing SPARQL query engines on a single node machine. Additionally, gSmart on the Tianhe-1A supercomputer achieves a maximum speedup of 6.90x scaling from 2 to 16 CPU+GPU nodes. BibTeX: @article{Chen2021c, author = {Yuedan Chen and M. Tamer Özsu and Guoqing Xiao and Zhuo Tang and Kenli Li}, title = {GSmart: An Efficient SPARQL Query Engine Using Sparse Matrix Algebra -- Full Version}, year = {2021} }  Chen Y and Chung Y-C (2021), "Workload Balancing via Graph Reordering on Multicore Systems", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: In a shared-memory multicore system, the intrinsic irregular data structure of graphs leads to poor cache utilization, and therefore deteriorates the performance of graph analytics. To address the problem, prior works have proposed a variety of lightweight reordering methods with focus on the optimization of cache locality. However, there is a compromise between cache locality and workload balance. Little insight has been devoted into the issue of workload imbalance for the underlying multicore system, which degrades the effectiveness of parallel graph processing. In this work, a measurement approach is proposed to quantify the imbalance incurred by the concentration of vertices. Inspired by it, we present Cache-aware Reorder (Corder), a lightweight reordering method exploiting the cache hierarchy of multicore systems. At the shared-memory level, Corder promotes even distribution of computation loads amongst multicores. At the private-cache level, Corder facilitates cache efficiency by applying further refinement to local vertex order. Comprehensive performance evaluation of Corder is conducted on various graph applications and datasets. Experimental results show that Corder yields speedup of up to 2.59X and on average 1.45X, which significantly outperforms existing lightweight reordering methods. To identify the root causes of performance boost delivered by Corder, multicore activities are investigated in terms of thread behavior, cache efficiency, and memory utilization. Statistical analysis demonstrates that the issue of imbalanced thread execution time dominates other factors in determining the overall graph processing time. Moreover, Corder achieves remarkable advantages in cross-platform scalability and reordering overhead. BibTeX: @article{Chen2021d, author = {Yuang Chen and Yeh-Ching Chung}, title = {Workload Balancing via Graph Reordering on Multicore Systems}, journal = {IEEE Transactions on Parallel and Distributed Systems}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/tpds.2021.3105323} }  Chen Y, Brock B, Porumbescu S, Buluç A, Yelick K and Owens JD (2021), "Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations", November, 2021. [Abstract] [BibTeX] Abstract: We present Atos, a task-parallel GPU dynamic scheduling framework that is especially suited to dynamic irregular applications. Compared to the dominant Bulk Synchronous Parallel (BSP) frameworks, Atos exposes additional concurrency by supporting task-parallel formulations of applications with relaxed dependencies, achieving higher GPU utilization, which is particularly significant for problems with concurrency bottlenecks. Atos also offers implicit task-parallel load balancing in addition to data-parallel load balancing, providing users the flexibility to balance between them to achieve optimal performance. Finally, Atos allows users to adapt to different use cases by controlling the kernel strategy and task-parallel granularity. We demonstrate that each of these controls is important in practice. We evaluate and analyze the performance of Atos vs. BSP on three applications: breadth-first search, PageRank, and graph coloring. Atos implementations achieve geomean speedups of 3.44x, 2.1x, and 2.77x and peak speedups of 12.8x, 3.2x, and 9.08x across three case studies, compared to a state-of-the-art BSP GPU implementation. Beyond simply quantifying the speedup, we extensively analyze the reasons behind each speedup. This deeper understanding allows us to derive general guidelines for how to select the optimal Atos configuration for different applications. Finally, our analysis provides insights for future dynamic scheduling framework designs. BibTeX: @article{Chen2021e, author = {Yuxin Chen and Benjamin Brock and Serban Porumbescu and Aydın Buluç and Katherine Yelick and John D. Owens}, title = {Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations}, year = {2021} }  Chen F, Cheung G and Zhang X (2021), "Fast Computation of Generalized Eigenvectors for Manifold Graph Embedding", December, 2021. [Abstract] [BibTeX] Abstract: Our goal is to efficiently compute low-dimensional latent coordinates for nodes in an input graph -- known as graph embedding -- for subsequent data processing such as clustering. Focusing on finite graphs that are interpreted as uniformly samples on continuous manifolds (called manifold graphs), we leverage existing fast extreme eigenvector computation algorithms for speedy execution. We first pose a generalized eigenvalue problem for sparse matrix pair (A, B), where A = L - μ Q + 𝜖 I is a sum of graph Laplacian L and disconnected two-hop difference matrix Q. Eigenvector v minimizing Rayleigh quotient v^{\top} A vv^top v thus minimizes 1-hop neighbor distances while maximizing distances between disconnected 2-hop neighbors, preserving graph structure. Matrix B = diag(b_i\) that defines eigenvector orthogonality is then chosen so that boundary / interior nodes in the sampling domain have the same generalized degrees. K-dimensional latent vectors for the N graph nodes are the first K generalized eigenvectors for (A,B), computed in 𝒪(N) using LOBPCG, where K ≪ N. Experiments show that our embedding is among the fastest in the literature, while producing the best clustering performance for manifold graphs. BibTeX: @article{Chen2021f, author = {Fei Chen and Gene Cheung and Xue Zhang}, title = {Fast Computation of Generalized Eigenvectors for Manifold Graph Embedding}, year = {2021} }  Chen J, Davis TA, Lourenco C and Moreno-Centeno E (2021), "Sparse Exact Factorization Update", In Proceedings of the 2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms., November, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: To meet the growing need for extended or exact precision solvers, an efficient framework based on Integer-Preserving Gaussian Elimination (IPGE) has been recently developed which includes dense/sparse LU/Cholesky factorizations and dense LU/Cholesky factorization updates for column and/or row replacement. In this paper, we discuss our on-going work developing the sparse LU/Cholesky column/row-replacement update and the sparse rank-l update/downdate. We first present some basic background for the exact factorization framework based on IPGE. Then we give our proposed algorithms along with some implementation and data-structure details. Finally, we provide some experimental results showcasing the performance of our update algorithms. Specifically, we show that updating these exact factorizations can be typically 10x to 100x faster than (re-)factorizing the matrices from scratch. BibTeX: @inproceedings{Chen2021g, author = {Jinhao Chen and Timothy A. Davis and Christopher Lourenco and Erick Moreno-Centeno}, title = {Sparse Exact Factorization Update}, booktitle = {Proceedings of the 2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms}, publisher = {IEEE}, year = {2021}, doi = {10.1109/ia354616.2021.00012} }  Cheshmi K, Strout MM and Dehnavi MM (2021), "Composing Loop-carried Dependence with Other Loops", November, 2021. [Abstract] [BibTeX] Abstract: Sparse fusion is a compile-time loop transformation and runtime scheduling implemented as a domain-specific code generator. Sparse fusion generates efficient parallel code for the combination of two sparse matrix kernels where at least one of the kernels has loop-carried dependencies. Available implementations optimize individual sparse kernels. When optimized separately, the irregular dependence patterns of sparse kernels create synchronization overheads and load imbalance, and their irregular memory access patterns result in inefficient cache usage, which reduces parallel efficiency. Sparse fusion uses a novel inspection strategy with code transformations to generate parallel fused code for sparse kernel combinations that is optimized for data locality and load balance. Code generated by Sparse fusion outperforms the existing implementations ParSy and MKL on average 1.6X and 5.1X respectively and outperforms the LBC and DAGP coarsening strategies applied to a fused data dependence graph on average 5.1X and 7.2X respectively for various kernel combinations. BibTeX: @article{Cheshmi2021, author = {Kazem Cheshmi and Michelle Mills Strout and Maryam Mehri Dehnavi}, title = {Composing Loop-carried Dependence with Other Loops}, year = {2021} }  Chiang N-Y, Peles S, Petra CG, Regev S, Swirydowicz K and Wang JF (2021), "ExaSGD: 2021 Kernel Thrust Activities, WBS 2.2.4.02, Milestone ADSE22-214". Thesis at: Lawrence Livermore National Laboratory. [Abstract] [BibTeX] [URL] Abstract: The Kernel Thrust milestone ADSE22-214 covers the development of device-capable optimization algorithms and solvers technologies required by the ExaSGD project’s software stack in order to solve security-constrained alternating current optimal power flow (SC-ACOPF) problems on emerging exascale architectures. To this extent, in FY21 the main objective of the Kernel Thrust was (i) provide robust optimization solver(s) that run efficiently on hardware accelerator devices (i.e., NVIDIA and AMD GPUs) to perform intra-node computations and (ii) provide coarse-grain parallel optimization capabilities that exploit the decomposition opportunities present in the SC-ACOPF challenge problems to provide exascale-capable solvers. This document presents the developments and contributions done by the Kernels Thrust Team in FY21 toward completion of the two above-mentioned objectives. These contributions progressed along four main development areas: enumerate item Design and development of a memory-distributed optimization solver (HiOp-PriDec) that uses a mathematically sound optimization algorithm and nonblocking MPI directives to achieve efficient coarse-grain parallelism required by exascale machines; item Completion of the device-only mixed dense-sparse optimization solver (HiOp-MDS) to to ensure good device utilization for select optimization subproblems; item Porting of the sparse optimization kernels to device/GPU via RAJA performance portability layer; item Benchmarking and development of sparse linear solvers for device computations, relevant to device-capable optimization solvers for general sparse ACOPF analyses. enumerate The development of the HiOp-PriDec solver and the underlying algorithm was new in FY21 and ended in a highly parallel C++ MPI-based implementation used on multiple HPC platforms including Summit@ORNL to solve the SC-ACOPF challenge problem.\ The second task was a continuation of the efforts from FY20 and provided an highly efficient device optimization solver for select ACOPF subproblems of the challenge SC-ACOPF problem. The last two tasks are an holistic computational effort to design device computations for solving nonlinear sparse optimization problems (e.g., the ACOPF subproblems of the SC-ACOPF challenge problems) in a way that achieves superior device utilization (e.g., high FLOPs rate). The developments of tasks 1–3 are part of the HiOp suite of optimization solvers and have been released at https://github.com/LLNL/hiop.\ In a large collaborative effort, teams from multiple labs (LLNL, PNNL, ORNL, and NREL) performed a large-scale demonstration of the ExaSGD software stack, namely the optimization solvers of HiOp interfaced with the modeling front-end ExaGO and the stochastic sampler PowerScenarios. These demonstration efforts solved large-scale instances of the SC-ACOPF challenge problem of medium network sizes and large number of contingencies (10 000) up to 1 920 MPI ranks (480 nodes) on Summit. It is worth mentioning that the most of the above mentioned software stack was also successfully ported to the pre-Frontier hardware (activities formally done under the Software Stack). These efforts will intensify in FY22 with the goal of ensuring good device utilization and good coarse-grain parallel efficiency, as well as providing exascale capability runs for the challenge problem SC-ACOPF. BibTeX: @techreport{Chiang2021, author = {Nai-Yuan Chiang and Slaven Peles and Cosmin G. Petra and Shaked Regev and Katarzyna Swirydowicz and Jingyi Frank Wang}, title = {ExaSGD: 2021 Kernel Thrust Activities, WBS 2.2.4.02, Milestone ADSE22-214}, school = {Lawrence Livermore National Laboratory}, year = {2021}, url = {https://www.osti.gov/biblio/1828670} }  Chien C-H, Fan H, Abdelfattah A, Tsigaridas E, Tomov S and Kimia B (2021), "GPU-Based Homotopy Continuation for Minimal Problems in Computer Vision", December, 2021. [Abstract] [BibTeX] Abstract: Systems of polynomial equations arise frequently in computer vision, especially in multiview geometry problems. Traditional methods for solving these systems typically aim to eliminate variables to reach a univariate polynomial, e.g., a tenth-order polynomial for 5-point pose estimation, using clever manipulations, or more generally using Grobner basis, resultants, and elimination templates, leading to successful algorithms for multiview geometry and other problems. However, these methods do not work when the problem is complex and when they do, they face efficiency and stability issues. Homotopy Continuation (HC) can solve more complex problems without the stability issues, and with guarantees of a global solution, but they are known to be slow. In this paper we show that HC can be parallelized on a GPU, showing significant speedups up to 26 times on polynomial benchmarks. We also show that GPU-HC can be generically applied to a range of computer vision problems, including 4-view triangulation and trifocal pose estimation with unknown focal length, which cannot be solved with elimination template but they can be efficiently solved with HC. GPU-HC opens the door to easy formulation and solution of a range of computer vision problems. BibTeX: @article{Chien2021, author = {Chiang-Heng Chien and Hongyi Fan and Ahmad Abdelfattah and Elias Tsigaridas and Stanimire Tomov and Benjamin Kimia}, title = {GPU-Based Homotopy Continuation for Minimal Problems in Computer Vision}, year = {2021} }  Choi Y-G, Lee S and Yu D (2021), "An efficient parallel block coordinate descent algorithm for large-scale precision matrix estimation using graphics processing units", June, 2021. [Abstract] [BibTeX] Abstract: Large-scale sparse precision matrix estimation has attracted wide interest from the statistics community. The convex partial correlation selection method (CONCORD) developed by Khare et al. (2015) has recently been credited with some theoretical properties for estimating sparse precision matrices. The CONCORD obtains its solution by a coordinate descent algorithm (CONCORD-CD) based on the convexity of the objective function. However, since a coordinate-wise update in CONCORD-CD is inherently serial, a scale-up is nontrivial. In this paper, we propose a novel parallelization of CONCORD-CD, namely, CONCORD-PCD. CONCORD-PCD partitions the off-diagonal elements into several groups and updates each group simultaneously without harming the computational convergence of CONCORD-CD. We guarantee this by employing the notion of edge coloring in graph theory. Specifically, we establish a nontrivial correspondence between scheduling the updates of the off-diagonal elements in CONCORD-CD and coloring the edges of a complete graph. It turns out that CONCORD-PCD simultaneously updates off-diagonal elements in which the associated edges are colorable with the same color. As a result, the number of steps required for updating off-diagonal elements reduces from p(p-1)/2 to p-1 (for even p) or p (for odd p), where p denotes the number of variables. We prove that the number of such steps is irreducible In addition, CONCORD-PCD is tailored to single-instruction multiple-data (SIMD) parallelism. A numerical study shows that the SIMD-parallelized PCD algorithm implemented in graphics processing units (GPUs) boosts the CONCORD-CD algorithm multiple times. BibTeX: @article{Choi2021, author = {Young-Geun Choi and Seunghwan Lee and Donghyeon Yu}, title = {An efficient parallel block coordinate descent algorithm for large-scale precision matrix estimation using graphics processing units}, year = {2021} }  Chou S and Amarasinghe S (2021), "Dynamic Sparse Tensor Algebra Compilation", December, 2021. [Abstract] [BibTeX] Abstract: This paper shows how to generate efficient tensor algebra code that compute on dynamic sparse tensors, which have sparsity structures that evolve over time. We propose a language for precisely specifying recursive, pointer-based data structures, and we show how this language can express a wide range of dynamic data structures that support efficient modification, such as linked lists, binary search trees, and B-trees. We then describe how, given high-level specifications of such data structures, a compiler can generate code to efficiently iterate over and compute with dynamic sparse tensors that are stored in the aforementioned data structures. Furthermore, we define an abstract interface that captures how nonzeros can be inserted into dynamic data structures, and we show how this abstraction guides a compiler to emit efficient code that store the results of sparse tensor algebra computations in dynamic data structures. We evaluate our technique and find that it generates efficient dynamic sparse tensor algebra kernels. Code that our technique emits to compute the main kernel of the PageRank algorithm is 1.05× as fast as Aspen, a state-of-the-art dynamic graph processing framework. Furthermore, our technique outperforms PAM, a parallel ordered (key-value) maps library, by 7.40× when used to implement element-wise addition of a dynamic sparse matrix to a static sparse matrix. BibTeX: @article{Chou2021, author = {Stephen Chou and Saman Amarasinghe}, title = {Dynamic Sparse Tensor Algebra Compilation}, year = {2021} }  Chou H-H, Maly J and Rauhut H (2021), "More is Less: Inducing Sparsity via Overparameterization", December, 2021. [Abstract] [BibTeX] Abstract: In deep learning it is common to overparameterize the neural networks, that is, to use more parameters than training samples. Quite surprisingly training the neural network via (stochastic) gradient descent leads to models that generalize very well, while classical statistics would suggest overfitting. In order to gain understanding of this implicit bias phenomenon we study the special case of sparse recovery (compressive sensing) which is of interest on its own. More precisely, in order to reconstruct a vector from underdetermined linear measurements, we introduce a corresponding overparameterized square loss functional, where the vector to be reconstructed is deeply factorized into several vectors. We show that, under a very mild assumption on the measurement matrix, vanilla gradient flow for the overparameterized loss functional converges to a solution of minimal _1-norm. The latter is well-known to promote sparse solutions. As a by-product, our results significantly improve the sample complexity for compressive sensing in previous works. The theory accurately predicts the recovery rate in numerical experiments. For the proofs, we introduce the concept of solution entropy, which bypasses the obstacles caused by non-convexity and should be of independent interest. BibTeX: @article{Chou2021a, author = {Hung-Hsu Chou and Johannes Maly and Holger Rauhut}, title = {More is Less: Inducing Sparsity via Overparameterization}, year = {2021} }  Chue Hong NP, Katz DS, Barker M, Lamprecht A-L, Martinez C, Psomopoulos FE, Harrow J, Castro LJ, Gruenpeter M, Martinez PA and Honeyman T (2021), "FAIR Principles for Research Software (FAIR4RS Principles)" Research Data Alliance. [Abstract] [BibTeX] [DOI] Abstract: Research software is a fundamental and vital part of research worldwide, yet there remain significant challenges to software productivity, quality, reproducibility, and sustainability. Improving the practice of scholarship is a common goal of the open science, open source software and FAIR (Findable, Accessible, Interoperable and Reusable) communities, but improving the sharing of research software has not yet been a strong focus of the latter. To improve the FAIRness of research software, the FAIR for Research Software (FAIR4RS) Working Group has sought to understand how to apply the FAIR Guiding Principles for scientific data management and stewardship to research software, bringing together existing and new community efforts. Many of the FAIR Guiding Principles can be directly applied to research software by treating software and data as similar digital research objects. However, specific characteristics of software -- such as its executability, composite nature, and continuous evolution and versioning -- make it necessary to revise and extend the principles. This document presents the first version of the FAIR Principles for Research Software (FAIR4RS Principles). It is an outcome of the FAIR for Research Software Working Group (FAIR4RS WG). The FAIR for Research Software Working Group is jointly convened as an RDA Working Group, FORCE11 Working Group, and Research Software Alliance (ReSA) Task Force. BibTeX: @article{ChueHong2021, author = {Chue Hong, Neil P. and Katz, Daniel S. and Barker, Michelle and Lamprecht, Anna-Lena and Martinez, Carlos and Psomopoulos, Fotis E. and Harrow, Jen and Castro, Leyla Jael and Gruenpeter, Morane and Martinez, Paula Andrea and Honeyman, Tom}, title = {FAIR Principles for Research Software (FAIR4RS Principles)}, publisher = {Research Data Alliance}, year = {2021}, doi = {10.15497/RDA00065} }  Cipolla S, Donatelli M and Durastante F (2021), "Regularization of inverse problems by an approximate matrix-function technique", Numerical Algorithms., April, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: In this work, we introduce and investigate a class of matrix-free regularization techniques for discrete linear ill-posed problems based on the approximate computation of a special matrix-function. In order to produce a regularized solution, the proposed strategy employs a regular approximation of the Heavyside step function computed into a small Krylov subspace. This particular feature allows our proposal to be independent from the structure of the underlying matrix. If on the one hand, the use of the Heavyside step function prevents the amplification of the noise by suitably filtering the responsible components of the spectrum of the discretization matrix, on the other hand, it permits the correct reconstruction of the signal inverting the remaining part of the spectrum. Numerical tests on a gallery of standard benchmark problems are included to prove the efficacy of our approach even for problems affected by a high level of noise. BibTeX: @article{Cipolla2021, author = {Stefano Cipolla and Marco Donatelli and Fabio Durastante}, title = {Regularization of inverse problems by an approximate matrix-function technique}, journal = {Numerical Algorithms}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s11075-021-01076-y} }  Coppitters D (2021), "Robust design optimization ofhybrid renewable energy systems". Thesis at: Vrije Universiteit Brussel. [Abstract] [BibTeX] Abstract: Driven by carbon-neutrality, the deployment of photovoltaic arrays and wind turbines increases rapidly in the power, heating and mobility sectors. To comply with the needs of each sector, these renewable energy systems are coupled with different energy storage technologies and energy conversion technologies, resulting in a diverse set of hybrid renewable energy systems. Designing such a hybrid renewable energy system requires information on the technical, economic and environmental performance of each component, as well as information on the climate and energy demand. These parameters are likely to vary during the systemlifetime (i.e., aleatory uncertainty), and data resources on these variations are usually limited (i.e., epistemic uncertainty). Considering these uncertainties in the design of hybrid renewable energy systems is still an exception rather than the norm. Although, disregarding uncertainty can result in a drastic mismatch between simulated and actual performance, and thus lead to a kill-by-randomness of the system. In other fields, such as structural mechanics and aerospace engineering, robust design optimization has already resulted in improved product quality, by providing designs that are less sensitive to the random environment. Despite its potential, applying robust design optimization on hybrid renewable energy systems is not yet studied. Therefore, the research question of this thesis reads: What is the added value of robust design optimization to hybrid renewable energy systems? To answer this question, this thesis followed three steps. First, a surrogate-assisted robust design optimization framework has been developed, using state-of-the-art optimization and uncertainty quantification algorithms. Despite being limited to problems with a low stochastic dimension (i.e., less than 15 uncertainties), this framework allows defining robust designs for two-component renewable energy systems, optimized for a single quantity of interest. However, hybrid renewable energy systems are typically multi-component systems,with multiple, cross-field objectives (i.e., technical, economic and environmental objectives). Hence, in the second step of this thesis, the uncertainty quantification algorithmhas been modified. This modification allowed to handle a large stochastic dimension, and thus to define robust designs for complex, multi-component hybrid renewable energy systems in a holistic context. In the third and final step, an imprecise probability method is proposed, to distinguish between epistemic and aleatory uncertainty on a parameter. In this new formulation, the robust design is optimized for the irreducible, aleatory uncertainty, and the global sensitivity analysis is reserved for the reducible, epistemic uncertainty.\ The robust design optimization algorithm has been applied on three specific hybrid renewable energy systems: a photovoltaic-battery-hydrogen system, a renewable-powered hydrogen refueling station and a photovoltaic-battery-heat pump system with thermal storage. The results indicate that the robust designs are characterized by a higher penetration of renewable energy systems and by considering energy storage: Coupling battery storage and hydrogen storage to a grid-connected photovoltaic array reduces the standard deviation of the levelized cost of electricity by 42 %; A photovoltaic-battery-heat pump with thermal storage systemreduces the standard deviation of the levelized cost of exergy by 36 %, as opposed to the photovoltaic-battery-gas boiler system; Shifting towards a bus fleet that partly consists of hydrogen-fueled buses (54% of the fleet) reduces the standard deviation of the levelized cost of driving (36 %), the mean of the carbon intensity (46 %) and the standard deviation of the carbon intensity (51 %), at the expense of a limited increase in the mean of the levelized cost of driving (11 %).\ As a conclusion, robust design optimization provides an added value in the design of hybrid renewable energy systems, the method complies with the computational burden of holistic design expectations, and it is adaptable to more advanced uncertainty characterization techniques. BibTeX: @phdthesis{Coppitters2021, author = {Diederik Coppitters}, title = {Robust design optimization ofhybrid renewable energy systems}, school = {Vrije Universiteit Brussel}, year = {2021} }  Coronado-Barrientos E, Antonioletti M and Garcia-Loureiro A (2021), "A new AXT format for an efficient SpMV product using AVX-512 instructions and CUDA", Advances in Engineering Software., June, 2021. Vol. 156, pp. 102997. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The Sparse Matrix-Vector (SpMV) product is a key operation used in many scientific applications. This work proposes a new sparse matrix storage scheme, the AXT format, that improves the SpMV performance on vector capability platforms. AXT can be adapted to different platforms, improving the storage efficiency for matrices with different sparsity patterns. Intel AVX-512 instructions and CUDA are used to optimise the performances of the four different AXT subvariants. Performance comparisons are made with the Compressed Sparse Row (CSR) and AXC formats on an Intel Xeon Gold 6148 processor and an NVIDIA Tesla V100 Graphics Processing Units using 26 matrices. On the Intel platform the overall AXT performance is 18% and 44.3% higher than the AXC and CSR respectively, reaching speed-up factors of up to x7.33. On the NVIDIA platform the AXT performance is 44% and 8% higher than the AXC and CSR performances respectively, reaching speed-up factors of up to x378.5. BibTeX: @article{CoronadoBarrientos2021, author = {E. Coronado-Barrientos and M. Antonioletti and A. Garcia-Loureiro}, title = {A new AXT format for an efficient SpMV product using AVX-512 instructions and CUDA}, journal = {Advances in Engineering Software}, publisher = {Elsevier BV}, year = {2021}, volume = {156}, pages = {102997}, doi = {10.1016/j.advengsoft.2021.102997} }  Costanzo M, Rucci E, Costi U, Chichizola F and Naiouf M (2021), "Comparison of HPC Architectures for Computing All-Pairs Shortest Paths. Intel Xeon Phi KNL vs NVIDIA Pascal", In Communications in Computer and Information Science. , pp. 37-49. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: Today, one of the main challenges for high-performance computing systems is to improve their performance by keeping energy consumption at acceptable levels. In this context, a consolidated strategy consists of using accelerators such as GPUs or many-core Intel Xeon Phi processors. In this work, devices of the NVIDIA Pascal and Intel Xeon Phi Knights Landing architectures are described and compared. Selecting the Floyd-Warshall algorithm as a representative case of graph and memory-bound applications, optimized implementations were developed to analyze and compare performance and energy efficiency on both devices. As it was expected, Xeon Phi showed superior when considering double-precision data. However, contrary to what was considered in our preliminary analysis, it was found that the performance and energy efficiency of both devices were comparable using single-precision datatype. BibTeX: @incollection{Costanzo2021, author = {Manuel Costanzo and Enzo Rucci and Ulises Costi and Franco Chichizola and Marcelo Naiouf}, title = {Comparison of HPC Architectures for Computing All-Pairs Shortest Paths. Intel Xeon Phi KNL vs NVIDIA Pascal}, booktitle = {Communications in Computer and Information Science}, publisher = {Springer International Publishing}, year = {2021}, pages = {37--49}, doi = {10.1007/978-3-030-75836-3_3} }  Coupette C and Vreeken J (2021), "Graph Similarity Description: How Are These Graphs Similar?" [Abstract] [BibTeX] [DOI] Abstract: How do social networks differ across platforms? How do information networks change over time? Answering questions like these requires us to compare two or more graphs. This task is commonly treated as a measurement problem, but numerical answers give limited insight. Here, we argue that if the goal is to gain understanding, we should treat graph similarity assessment as a description problem instead. We formalize this problem as a model selection task using the Minimum Description Length principle, capturing the similarity of the input graphs in a common model and the differences between them in transformations to individual models. To discover good models, we propose Momo, which breaks the problem into two parts and introduces efficient algorithms for each. Through an extensive set of experiments on a wide range of synthetic and real-world graphs, we confirm that Momo works well in practice BibTeX: @inproceedings{Coupette2021, author = {Corinna Coupette and Jilles Vreeken}, title = {Graph Similarity Description: How Are These Graphs Similar?}, year = {2021}, doi = {10.1145/3447548.3467257} }  Courbet C (2021), "NSan: A Floating-Point Numerical Sanitizer", February, 2021. [Abstract] [BibTeX] [DOI] Abstract: Sanitizers are a relatively recent trend in software engineering. They aim at automatically finding bugs in programs, and they are now commonly available to programmers as part of compiler toolchains. For example, the LLVM project includes out-of-the-box sanitizers to detect thread safety (tsan), memory (asan,msan,lsan), or undefined behaviour (ubsan) bugs. In this article, we present nsan, a new sanitizer for locating and debugging floating-point numerical issues, implemented inside the LLVM sanitizer framework. nsan puts emphasis on practicality. It aims at providing precise, and actionable feedback, in a timely manner. nsan uses compile-time instrumentation to augment each floating-point computation in the program with a higher-precision shadow which is checked for consistency during program execution. This makes nsan between 1 and 4 orders of magnitude faster than existing approaches, which allows running it routinely as part of unit tests, or detecting issues in large production applications. BibTeX: @article{Courbet2021, author = {Clement Courbet}, title = {NSan: A Floating-Point Numerical Sanitizer}, year = {2021}, doi = {10.1145/3446804.3446848} }  Crainic TG, Feliu JG, Ricciardi N, Semet F and Woensel TV (2021), "Operations Research for Planning and Managing City Logistics Systems" [Abstract] [BibTeX] [URL] Abstract: The chapter presents the Operations Research models and methods to plan and manage City Logistics systems, in particular their supply components. It presents the main planning issues and challenges, and reviews the proposed methodologies. The chapter concludes with a discussion on perspectives for City Logistics and decision-support methodological developments. BibTeX: @inbook{Crainic2021, author = {Teodor Gabriel Crainic and Jesus Gonzalez Feliu and Nicoletta Ricciardi and Frédéric Semet and Tom Van Woensel}, title = {Operations Research for Planning and Managing City Logistics Systems}, year = {2021}, url = {https://hal.inria.fr/hal-03464029/document} }  Cressie N, Sainsbury-Dale M and Zammit-Mangion A (2021), "Basis-Function Models in Spatial Statistics", Annual Review of Statistics and Its Application., 11, 2021. Vol. 9(1) Annual Reviews. [Abstract] [BibTeX] [DOI] Abstract: Spatial statistics is concerned with the analysis of data that have spatial locations associated with them, and those locations are used to model statistical dependence between the data. The spatial data are treated as a single realization from a probability model that encodes the dependence through both fixed effects and random effects, where randomness is manifest in the underlying spatial process and in the noisy, incomplete measurement process. The focus of this review article is on the use of basis functions to provide an extremely flexible and computationally efficient way to model spatial processes that are possibly highly nonstationary. Several examples of basis-function models are provided to illustrate how they are used in Gaussian, non-Gaussian, multivariate, and spatio-temporal settings, with applications in geophysics. Our aim is to emphasize the versatility of these spatial-statistical models and to demonstrate that they are now center-stage in a number of application domains. The review concludes with a discussion and illustration of software currently available to fit spatial-basis-function models and implement spatial-statistical prediction. BibTeX: @article{Cressie2021, author = {Noel Cressie and Matthew Sainsbury-Dale and Andrew Zammit-Mangion}, title = {Basis-Function Models in Spatial Statistics}, journal = {Annual Review of Statistics and Its Application}, publisher = {Annual Reviews}, year = {2021}, volume = {9}, number = {1}, doi = {10.1146/annurev-statistics-040120-020733} }  Croci M, Fasi M, Higham NJ, Mary T and Mikaitis M (2021), "Stochastic Rounding: Implementation,Error Analysis, and Applications" [Abstract] [BibTeX] Abstract: Stochastic rounding randomly maps a real number to one of the two nearest values in a finite precision number system. First proposed for use in computer arithmetic in the 1950s, it is attracting renewed interest. If used in floating-point arithmetic in the computation of the inner product of two vectors of length n, it yields an error bounded by nuu with high probability, where u is the unit roundoff, which is not necessarily the case for round to nearest. A particular attraction of stochastic rounding is that, unlike round to nearest, it is immune to the phenomenon of stagnation, whereby a sequence of tiny updates to a relatively large quantity are lost. We survey stochastic rounding, covering its mathematical properties and probabilistic error analysis, its implementation, and its use in applications, including deep learning and the numerical solution of differential equations. BibTeX: @article{Croci2021, author = {Croci, Matteo and Fasi, Massimiliano and Higham,Nicholas J. and Mary, Theo and Mikaitis, Mantas}, title = {Stochastic Rounding: Implementation,Error Analysis, and Applications}, year = {2021} }  Cui H, Wang N, Wang Y, Han Q and Xu Y (2021), "An effective SPMV based on block strategy and hybrid compression on GPU", October, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: Due to the non-uniformity of the sparse matrix, the calculation of SPMV (sparse matrix vector multiplication) will lead to redundancy in calculation, redundancy in storage, unbalanced load and low GPU utilization. In this study, a new matrix compression method based on CSR and COO is proposed for the above analysis: PBC algorithm. This method considers the load balancing condition in the calculation process of SPMV, and blocks are divided according to the strategy of row main order to ensure the minimum standard deviation between each block, aiming to satisfy the maximum similarity in the number of nonzero elements between each block. This paper preprocesses the original matrix based on block splitting algorithm to meet the conditions of load balancing for each block stored in the form of CSR and COO. Finally, the experimental results show that the time of SPMV preprocessing is within the acceptable range of the algorithm. Compared with the serial code without CSR optimization, the parallel method in this paper has an acceleration ratio of 178x. In addition, compared with the serial code for CSR optimization, the parallel method in this paper has an acceleration ratio of 6x. And a representative matrix compression method is also selected for performing comparative analysis. The experimental results show that the PBC algorithm has a good efficiency improvement compared with the comparison algorithm. BibTeX: @article{Cui2021, author = {Huanyu Cui and Nianbin Wang and Yuhua Wang and Qilong Han and Yuezhu Xu}, title = {An effective SPMV based on block strategy and hybrid compression on GPU}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s11227-021-04123-6} }  Daas HA, Rees T and Scott J (2021), "Two-level Nyström--Schur preconditioner for sparse symmetric positive definite matrices", January, 2021. [Abstract] [BibTeX] Abstract: Randomized methods are becoming increasingly popular in numerical linear algebra. However, few attempts have been made to use them in developing preconditioners. Our interest lies in solving large-scale sparse symmetric positive definite linear systems of equations where the system matrix is preordered to doubly bordered block diagonal form (for example, using a nested dissection ordering). We investigate the use of randomized methods to construct high quality preconditioners. In particular, we propose a new and efficient approach that employs Nyström's method for computing low rank approximations to develop robust algebraic two-level preconditioners. Construction of the new preconditioners involves iteratively solving a smaller but denser symmetric positive definite Schur complement system with multiple right-hand sides. Numerical experiments on problems coming from a range of application areas demonstrate that this inner system can be solved cheaply using block conjugate gradients and that using a large convergence tolerance to limit the cost does not adversely affect the quality of the resulting Nyström--Schur two-level preconditioner. BibTeX: @article{Daas2021, author = {Hussam Al Daas and Tyrone Rees and Jennifer Scott}, title = {Two-level Nyström--Schur preconditioner for sparse symmetric positive definite matrices}, year = {2021} }  Daas HA, Jolivet P and Scott J (2021), "A Robust Algebraic Domain Decomposition Preconditioner for Sparse Normal Equations", July, 2021. [Abstract] [BibTeX] Abstract: Solving the normal equations corresponding to large sparse linear least-squares problems is an important and challenging problem. For very large problems, an iterative solver is needed and, in general, a preconditioner is required to achieve good convergence. In recent years, a number of preconditioners have been proposed. These are largely serial and reported results demonstrate that none of the commonly used preconditioners for the normal equations matrix is capable of solving all sparse least-squares problems. Our interest is thus in designing new preconditioners for the normal equations that are efficient, robust, and can be implemented in parallel. Our proposed preconditioners can be constructed efficiently and algebraically without any knowledge of the problem and without any assumption on the least-squares matrix except that it is sparse. We exploit the structure of the symmetric positive definite normal equations matrix and use the concept of algebraic local symmetric positive semi-definite splittings to introduce two-level Schwarz preconditioners for least-squares problems. The condition number of the preconditioned normal equations is shown to be theoretically bounded independently of the number of subdomains in the splitting. This upper bound can be adjusted using a single parameter τ that the user can specify. We discuss how the new preconditioners can be implemented on top of the PETSc library using only 150 lines of Fortran, C, or Python code. Problems arising from practical applications are used to compare the performance of the proposed new preconditioner with that of other preconditioners. BibTeX: @article{Daas2021a, author = {Hussam Al Daas and Pierre Jolivet and Jennifer Scott}, title = {A Robust Algebraic Domain Decomposition Preconditioner for Sparse Normal Equations}, year = {2021} }  Darbon J and Langlois GP (2021), "Efficient and robust high-dimensional sparse logistic regression via nonlinear primal-dual hybrid gradient algorithms", November, 2021. [Abstract] [BibTeX] Abstract: Logistic regression is a widely used statistical model to describe the relationship between a binary response variable and predictor variables in data sets. It is often used in machine learning to identify important predictor variables. This task, variable selection, typically amounts to fitting a logistic regression model regularized by a convex combination of _1 and _2^2 penalties. Since modern big data sets can contain hundreds of thousands to billions of predictor variables, variable selection methods depend on efficient and robust optimization algorithms to perform well. State-of-the-art algorithms for variable selection, however, were not traditionally designed to handle big data sets; they either scale poorly in size or are prone to produce unreliable numerical results. It therefore remains challenging to perform variable selection on big data sets without access to adequate and costly computational resources. In this paper, we propose a nonlinear primal-dual algorithm that addresses these shortcomings. Specifically, we propose an iterative algorithm that provably computes a solution to a logistic regression problem regularized by an elastic net penalty in O(T(m,n)(1/)) operations, where 𝜖 ∊ (0,1) denotes the tolerance and T(m,n) denotes the number of arithmetic operations required to perform matrix-vector multiplication on a data set with m samples each comprising n features. This result improves on the known complexity bound of O((m^2n,mn^2)(1/)) for first-order optimization methods such as the classic primal-dual hybrid gradient or forward-backward splitting methods. BibTeX: @article{Darbon2021, author = {Jérôme Darbon and Gabriel P. Langlois}, title = {Efficient and robust high-dimensional sparse logistic regression via nonlinear primal-dual hybrid gradient algorithms}, year = {2021} }  Datta A (2021), "Sparse Cholesky matrices in spatial statistics", February, 2021. [Abstract] [BibTeX] Abstract: Gaussian Processes (GP) is a staple in the toolkit of a spatial statistician. Well-documented computing roadblocks in the analysis of large geospatial datasets using Gaussian Processes have now been successfully mitigated via several recent statistical innovations. Nearest Neighbor Gaussian Processes (NNGP) has emerged as one of the leading candidates for such massive-scale geospatial analysis owing to their empirical success. This article reviews the connection of NNGP to sparse Cholesky factors of the spatial precision (inverse-covariance) matrices. Focus of the review is on these sparse Cholesky matrices which are versatile and have recently found many diverse applications beyond the primary usage of NNGP for fast parameter estimation and prediction in the spatial (generalized) linear models. In particular, we discuss applications of sparse NNGP Cholesky matrices to address multifaceted computational issues in spatial bootstrapping, simulation of large-scale realizations of Gaussian random fields, and extensions to non-parametric mean function estimation of a Gaussian Process using Random Forests. We also review a sparse-Cholesky-based model for areal (geographically-aggregated) data that addresses interpretability issues of existing areal models. Finally, we highlight some yet-to-be-addressed issues of such sparse Cholesky approximations that warrants further research. BibTeX: @article{Datta2021, author = {Abhirup Datta}, title = {Sparse Cholesky matrices in spatial statistics}, year = {2021} }  Datta A (2021), "Nearest-neighbor sparse Cholesky matrices in spatial statistics", WIREs Computational Statistics., December, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: Gaussian process (GP) is a staple in the toolkit of a spatial statistician. Well-documented computing roadblocks in the analysis of large geospatial datasets using GPs have now largely been mitigated via several recent statistical innovations. Nearest neighbor Gaussian process (NNGP) has emerged as one of the leading candidates for such massive-scale geospatial analysis owing to their empirical success. This article reviews the connection of NNGP to sparse Cholesky factors of the spatial precision (inverse-covariance) matrix. Focus of the review is on these sparse Cholesky matrices which are versatile and have recently found many diverse applications beyond the primary usage of NNGP for fast parameter estimation and prediction in the spatial (generalized) linear models. In particular, we discuss applications of sparse NNGP Cholesky matrices to address multifaceted computational issues in spatial bootstrapping, simulation of large-scale realizations of Gaussian random fields, and extensions to nonparametric mean function estimation of a GP using random forests. We also review a sparse-Cholesky-based model for areal (geographically aggregated) data that addresses long-established interpretability issues of existing areal models. Finally, we highlight some yet-to-be-addressed issues of such sparse Cholesky approximations that warrant further research. BibTeX: @article{Datta2021a, author = {Abhirup Datta}, title = {Nearest-neighbor sparse Cholesky matrices in spatial statistics}, journal = {WIREs Computational Statistics}, publisher = {Wiley}, year = {2021}, doi = {10.1002/wics.1574} }  Dauzickaite I, Lawless A, Scott J and van Leeuwen PJ (2021), "Randomised preconditioning for the forcing formulation of weak constraint 4D-Var", March, 2021. Copernicus GmbH. [Abstract] [BibTeX] [DOI] Abstract: There is growing awareness that errors in the model equations cannot be ignored in data assimilation methods such as four-dimensional variational assimilation (4D-Var). If allowed for, more information can be extracted from observations, longer time windows are possible, and the minimization process is easier, at least in principle. Weak constraint 4D-Var estimates the model error and minimizes a series of linear least-squares cost functions using the conjugate gradient (CG) method; minimising each cost function is called an inner loop. CG needs preconditioning to improve its performance. In previous work, limited memory preconditioners (LMPs) have been constructed using approximations of the eigenvalues and eigenvectors of the Hessian in the previous inner loop. If the Hessian changes signicantly in consecutive inner loops, the LMP may be of limited usefulness. To circumvent this, we propose using randomised methods for low rank eigenvalue decomposition and use these approximations to cheaply construct LMPs using information from the current inner loop. Three randomised methods are compared. Numerical experiments in idealized systems show that the resulting LMPs perform better than the existing LMPs. Using these methods may allow more efficient and robust implementations of incremental weak constraint 4D-Var. BibTeX: @article{Dauzickaite2021, author = {Ieva Dauzickaite and Amos Lawless and Jennifer Scott and Peter Jan van Leeuwen}, title = {Randomised preconditioning for the forcing formulation of weak constraint 4D-Var}, publisher = {Copernicus GmbH}, year = {2021}, doi = {10.5194/egusphere-egu21-4414} }  Daužickaitė I, Lawless AS, Scott JA and van Leeuwen PJ (2021), "On preconditioning the state formulation of incremental weak constraint 4D-Var", May, 2021. [Abstract] [BibTeX] Abstract: Using a high degree of parallelism is essential to perform data assimilation efficiently. The state formulation of the incremental weak constraint four-dimensional variational data assimilation method allows parallel calculations in the time dimension. In this approach, the solution is approximated by minimising a series of quadratic cost functions using the conjugate gradient method. To use this method in practice, effective preconditioning strategies that maintain the potential for parallel calculations are needed. We examine approximations to the control variable transform (CVT) technique when the latter is beneficial. The new strategy employs a randomised singular value decomposition and retains the potential for parallelism in the time domain. Numerical results for the Lorenz 96 model show that this approach accelerates the minimisation in the first few iterations, with better results when CVT performs well. BibTeX: @article{Dauzickaite2021a, author = {Ieva Daužickaitė and Amos S. Lawless and Jennifer A. Scott and Peter Jan van Leeuwen}, title = {On preconditioning the state formulation of incremental weak constraint 4D-Var}, year = {2021} }  Degro A and Lohner R (2021), "Simple Fault-Tolerant Computing for CFD Codes", In AIAA Scitech 2021 Forum., January, 2021. American Institute of Aeronautics and Astronautics. [Abstract] [BibTeX] [DOI] Abstract: Fault-tolerant computing options based on the use of restart information stored on and off node and the use of reserve processes have been developed, implemented and tested in a large-scale, production field solver taken from the domain of computational fluid dynamics. The tests conducted to date have shown good results, with recovery rates approaching 100% under realistic node failure scenarios. Even though the computational overhead of the field solver is very low (explicit time-marching and finite differences), the fault tolerant implementation adds a run-time penalty that is only in the range of 6%-12%, depending on the spatial and temporal approximation used. The procedures developed are generally applicable, and could easily be ported to other codes. BibTeX: @inproceedings{Degro2021, author = {Atis Degro and Rainald Lohner}, title = {Simple Fault-Tolerant Computing for CFD Codes}, booktitle = {AIAA Scitech 2021 Forum}, publisher = {American Institute of Aeronautics and Astronautics}, year = {2021}, doi = {10.2514/6.2021-0142} }  Deng X, Liao Z-J and Cai X-C (2021), "A parallel multilevel domain decomposition method for source identification problems governed by elliptic equations", Journal of Computational and Applied Mathematics., February, 2021. , pp. 113441. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: In this paper we develop a parallel multilevel domain decomposition method for large-scale source identification problems governed by elliptic equations. A popular approach is to formulate the inverse problem as a PDE-constrained optimization problem. The stationary point satisfies a Karush-Kuhn–Tucker (KKT) system consisting of the state, adjoint and source equations which is rather difficult to solve on parallel computers. We propose and study a parallel method that decomposes the optimization problem on the global domain into subproblems on overlapping subdomains, each subdomain is further decomposed to form an additive Schwarz preconditioner for solving these smaller subproblems simultaneously with a preconditioned Krylov subspace method. For each subproblem, the overlapping part of the solution is discarded and the remaining non-overlapping part of the solution is put together to obtain an approximated global solution to the inverse problem. Since all the subproblems are solved independently, the multilevel domain decomposition method has the advantage of higher degree of parallelism. Numerical experiments show that the algorithm is accurate in terms of the reconstruction error and has reasonably good speedup in terms of the total computing time. The efficiency and robustness of the proposed approach on a parallel computer with more than 1,000 processors are reported. BibTeX: @article{Deng2021, author = {Xiaomao Deng and Zi-Ju Liao and Xiao-Chuan Cai}, title = {A parallel multilevel domain decomposition method for source identification problems governed by elliptic equations}, journal = {Journal of Computational and Applied Mathematics}, publisher = {Elsevier BV}, year = {2021}, pages = {113441}, doi = {10.1016/j.cam.2021.113441} }  Dhandhania S, Deodhar A, Pogorelov K, Biswas S and Langguth J (2021), "Explaining the Performance of Supervised and Semi-SupervisedMethods for Automated Sparse Matrix Format Selection", In Proceedings of the Workshops of the 2021 International Conference on Parallel Processing. [Abstract] [BibTeX] [DOI] Abstract: The performance of sparse matrix-vector multiplication kernels (SpMV) depends on the sparse matrix storage format and the architecture and the memory hierarchy of the target processor. Many sparse matrix storage formats along with corresponding SpMV algorithms have been proposed for improved SpMV performance. Given a sparse matrix and a target architecture, supervised Machine Learning techniques automate selecting the best formats. However, existing supervised approaches suffer from several drawbacks. They depend on large representative datasets and are expensive to train. In addition, retraining to incorporate new classes of matrices or different processor architectures is just as costly since new training data must be generated by benchmarking many instances. Furthermore, it is hard to understand the results of many supervised systems.\ We propose using semi-supervised machine learning techniques for format selection. We highlight the challenges in using the K-Means clustering for the sparse format selection problem and show how to adapt the algorithm to improve its performance. An empirical evaluation of our technique shows that the performance of our proposed semi-supervised learning approach is competitive with supervised methods, in addition to providing flexibility and explainability. BibTeX: @inproceedings{Dhandhania2021, author = {Sunidhi Dhandhania and Akshay Deodhar and Konstantin Pogorelov and Swarnendu Biswas and Johannes Langguth}, title = {Explaining the Performance of Supervised and Semi-SupervisedMethods for Automated Sparse Matrix Format Selection}, booktitle = {Proceedings of the Workshops of the 2021 International Conference on Parallel Processing}, year = {2021}, doi = {10.1145/3458744.3474049} }  Ding N, Liu Y, Williams S and Li XS (2021), "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", In Proceedings of the SIAM Conference on Applied and Computational Discrete Algorithms. [Abstract] [BibTeX] Abstract: Sparse triangular solve is used in conjunction with Sparse LU for solving sparse linear systems, either as a direct solver or as a preconditioner. As GPUs have become a firstclass compute citizen, designing an efficient and scalable SpTRSV on multi-GPU HPC systems is imperative. In this paper, we leverage the advantage of GPU-initiated data transfers of NVSHMEM to implement and evaluate a Multi-GPU SpTRSV. We create a novel producer-consumer paradigm to manage the computation and communication in SpTRSV and implement it using two CUDA streams. Our multi-GPU SpTRSV implementation using CUDA streams achieves a 3.7× speedup when using twelve GPUs (two nodes) relative to our implementation on a single GPU, and up to 6.1× compared to cusparse csrsv2() over the range of one to eighteen GPUs. To further explain the observed performance and explore the key features of matrices to estimate the potential performance benefits when using multi-GPU, we extend the critical path model of SpTRSV to GPUs. We demonstrate the ability of our performance model to understand various aspects of performance and performance bottlenecks on multi-GPU and motivate code optimizations. BibTeX: @inproceedings{Ding2021, author = {Nan Ding and Yang Liu and Samuel Williams and Xiaoye S. Li}, title = {A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver}, booktitle = {Proceedings of the SIAM Conference on Applied and Computational Discrete Algorithms}, year = {2021} }  Ding Y, Li P, Xiao Y and Zhang H (2021), "Efficient Dual ADMMs for Sparse Compressive Sensing MRI Reconstruction", November, 2021. [Abstract] [BibTeX] Abstract: Magnetic Resonance Imaging (MRI) is a kind of medical imaging technology used for diagnostic imaging of diseases, but its image quality may be suffered by the long acquisition time. The compressive sensing (CS) based strategy may decrease the reconstruction time greatly, but it needs efficient reconstruction algorithms to produce high-quality and reliable images. This paper focuses on the algorithmic improvement for the sparse reconstruction of CS-MRI, especially considering a non-smooth convex minimization problem which is composed of the sum of a total variation regularization term and a _1-norm term of the wavelet transformation. The partly motivation of targeting the dual problem is that the dual variables are involved in relatively low-dimensional subspace. Instead of solving the primal model as usual, we turn our attention to its associated dual model composed of three variable blocks and two separable non-smooth function blocks. However, the directly extended alternating direction method of multipliers (ADMM) must be avoided because it may be divergent, although it usually performs well numerically. In order to solve the problem, we employ a symmetric Gauss-Seidel (sGS) technique based ADMM. Compared with the directly extended ADMM, this method only needs one additional iteration, but its convergence can be guaranteed theoretically. Besides, we also propose a generalized variant of ADMM because this method has been illustrated to be efficient for solving semidefinite programming in the past few years. Finally, we do extensive experiments on MRI reconstruction using some simulated and real MRI images under different sampling patterns and ratios. The numerical results demonstrate that the proposed algorithms significantly achieve high reconstruction accuracies with fast computational speed. BibTeX: @article{Ding2021a, author = {Yanyun Ding and Peili Li and Yunhai Xiao and Haibin Zhang}, title = {Efficient Dual ADMMs for Sparse Compressive Sensing MRI Reconstruction}, year = {2021} }  Doikov N and Nesterov Y (2021), "Optimization Methods for Fully Composite Problems", March, 2021. [Abstract] [BibTeX] Abstract: In this paper, we propose a new Fully Composite Formulation of convex optimization problems. It includes, as a particular case, the problems with functional constraints, max-type minimization problems, and problems of Composite Minimization, where the objective can have simple nondifferentiable components. We treat all these formulations in a unified way, highlighting the existence of very natural optimization schemes of different order. We prove the global convergence rates for our methods under the most general conditions. Assuming that the upper-level component of our objective function is subhomogeneous, we develop efficient modification of the basic Fully Composite first-order and second-order Methods, and propose their accelerated variants. BibTeX: @article{Doikov2021, author = {Nikita Doikov and Yurii Nesterov}, title = {Optimization Methods for Fully Composite Problems}, year = {2021} }  Doikov N and Nesterov Y (2021), "Gradient Regularization of Newton Method with Bregman Distances", December, 2021. [Abstract] [BibTeX] Abstract: In this paper, we propose a first second-order scheme based on arbitrary non-Euclidean norms, incorporated by Bregman distances. They are introduced directly in the Newton iterate with regularization parameter proportional to the square root of the norm of the current gradient. For the basic scheme, as applied to the composite optimization problem, we establish the global convergence rate of the order O(k^-2) both in terms of the functional residual and in the norm of subgradients. Our main assumption on the smooth part of the objective is Lipschitz continuity of its Hessian. For uniformly convex functions of degree three, we justify global linear rate, and for strongly convex function we prove the local superlinear rate of convergence. Our approach can be seen as a relaxation of the Cubic Regularization of the Newton method, which preserves its convergence properties, while the auxiliary subproblem at each iteration is simpler. We equip our method with adaptive line search procedure for choosing the regularization parameter. We propose also an accelerated scheme with convergence rate O(k^-3), where k is the iteration counter. BibTeX: @article{Doikov2021a, author = {Nikita Doikov and Yurii Nesterov}, title = {Gradient Regularization of Newton Method with Bregman Distances}, year = {2021} }  Dong Y and Martinsson P-G (2021), "Simpler is better: A comparative study of randomized algorithms for computing the CUR decomposition", April, 2021. [Abstract] [BibTeX] Abstract: The CUR decomposition is a technique for low-rank approximation that selects small subsets of the columns and rows of a given matrix to use as bases for its column and rowspaces. It has recently attracted much interest, as it has several advantages over traditional low rank decompositions based on orthonormal bases. These include the preservation of properties such as sparsity or non-negativity, the ability to interpret data, and reduced storage requirements. The problem of finding the skeleton sets that minimize the norm of the residual error is known to be NP-hard, but classical pivoting schemes such as column pivoted QR work tend to work well in practice. When combined with randomized dimension reduction techniques, classical pivoting based methods become particularly effective, and have proven capable of very rapidly computing approximate CUR decompositions of large, potentially sparse, matrices. Another class of popular algorithms for computing CUR de-compositions are based on drawing the columns and rows randomly from the full index sets, using specialized probability distributions based on leverage scores. Such sampling based techniques are particularly appealing for very large scale problems, and are well supported by theoretical performance guarantees. This manuscript provides a comparative study of the various randomized algorithms for computing CUR decompositions that have recently been proposed. Additionally, it proposes some modifications and simplifications to the existing algorithms that leads to faster execution times. BibTeX: @article{Dong2021, author = {Yijun Dong and Per-Gunnar Martinsson}, title = {Simpler is better: A comparative study of randomized algorithms for computing the CUR decomposition}, year = {2021} }  Driggs D, Tang J, Liang J, Davies M and Schönlieb C-B (2021), "A Stochastic Proximal Alternating Minimization for Nonsmooth and Nonconvex Optimization", SIAM Journal on Imaging Sciences., January, 2021. Vol. 14(4), pp. 1932-1970. Society for Industrial & Applied Mathematics (SIAM). [Abstract] [BibTeX] [DOI] Abstract: In this work, we introduce a novel stochastic proximal alternating linearized minimization algorithm [J. Bolte, S. Sabach, and M. Teboulle, Math. Program., 146 (2014), pp. 459--494] for solving a class of nonsmooth and nonconvex optimization problems. Large-scale imaging problems are becoming increasingly prevalent due to the advances in data acquisition and computational capabilities. Motivated by the success of stochastic optimization methods, we propose a stochastic variant of proximal alternating linearized minimization. We provide global convergence guarantees, demonstrating that our proposed method with variance-reduced stochastic gradient estimators, such as SAGA [A. Defazio, F. Bach, and S. Lacoste-Julien, Advances in Neural Information Processing Systems, 2014, pp. 1646--1654] and SARAH [L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáĉ, Proceedings of the 34th International Conference on Machine Learning, PMLR 70, 2017, pp. 2613--2621], achieves state-of-the-art oracle complexities. We also demonstrate the efficacy of our algorithm via several numerical examples including sparse nonnegative matrix factorization, sparse principal component analysis, and blind image-deconvolution. BibTeX: @article{Driggs2021, author = {Derek Driggs and Junqi Tang and Jingwei Liang and Mike Davies and Carola-Bibiane Schönlieb}, title = {A Stochastic Proximal Alternating Minimization for Nonsmooth and Nonconvex Optimization}, journal = {SIAM Journal on Imaging Sciences}, publisher = {Society for Industrial & Applied Mathematics (SIAM)}, year = {2021}, volume = {14}, number = {4}, pages = {1932--1970}, doi = {10.1137/20m1387213} }  Dunton AM (2021), "Matrix Methods for Low-Rank Compression in Large-Scale Applications". Thesis at: University of Colorado at Boulder. [Abstract] [BibTeX] Abstract: Modern scientific applications generate and require more data every year, far outpacing storage capabilities. This growing disparity has inspired work in lossless and lossy data compression, which seek to alleviate the overwhelming surge in big data. Lossless compression approaches provide an exact reconstruction of the original data, with the trade-off of a lower compression factor. Lossy compression approaches, on the other hand, achieve larger compression factors than lossless methods at the cost of error in reconstruction. \ In the interest of reducing the size of data generated in scientific applications, this thesis proposes low-rank matrix approximation-based lossy compression algorithms for reducing the dimensionality of data matrices. Several pass-efficient, memory lean, and fast low-rank approximation methods are proposed for temporal compression of scientific data. These approaches are shown to compress matrices arising in various scientific applications. These low-rank methods are particularly successful in compressing scientific data matrices when a significant fraction of the variance in the data can be captured on a low-dimensional linear subspace; such structure typically arises in diffusion-dominated problems such as low Reynolds number flow simulations. \ On the other hand, in advection- and convection-dominated problems, low-rank matrix compression methods can perform quite poorly. Recent work in deep learning has demonstrated that a class of neural networks called autoencoders can break through this limitation on linear dimensionality reduction methods. Instead of identifying low-dimensional linear subspaces, autoencoders learn nonlinear manifolds which can approximate a data matrix, in many cases requiring far fewer latent dimensions. Generalizing the linear subspace-based approaches developed in the previous chapters of this thesis, the appendix provides an online algorithm for embedding and reconstructing large-scale data matrices on nonlinear manifolds using autoencoders BibTeX: @phdthesis{Dunton2021, author = {Dunton, Alec Michael}, title = {Matrix Methods for Low-Rank Compression in Large-Scale Applications}, school = {University of Colorado at Boulder}, year = {2021} }  Dussault J-P and Orban D (2021), "Scalable adaptive cubic regularization methods", March, 2021. [Abstract] [BibTeX] [DOI] Abstract: Adaptive cubic regularization (ARC) methods for unconstrained optimization compute steps from linear systems involving a shifted Hessian in the spirit of the Levenberg-Marquardt and trust-region methods. The standard approach consists in performing an iterative search for the shift akin to solving the secular equation in trust-region methods. Such search requires computing the Cholesky factorization of a tentative shifted Hessian at each iteration, which limits the size of problems that can be reasonably considered. We propose a scalable implementation of ARC named ARCqK in which we solve a set of shifted systems concurrently by way of an appropriate modification of the Lanczos formulation of the conjugate gradient (CG) method. At each iteration of ARCqK to solve a problem with n variables, a range of m << n shift parameters is selected. The computational overhead in CG beyond the Lanczos process is thirteen scalar operations to update five vectors of length m and two n-vector updates for each value of the shift. The CG variant only requires one Hessian-vector product and one dot product per iteration, independently of the number of shift parameters. Solves corresponding to inadequate shift parameters are interrupted early. All shifted systems are solved inexactly. Such modest cost makes our implementation scalable and appropriate for large-scale problems. We provide a new analysis of the inexact ARC method including its worst case evaluation complexity, global and asymptotic convergence. We describe our implementation and provide preliminary numerical observations that confirm that for problems of size at least 100, our implementation of ARCqK is more efficient than a classic Steihaug-Toint trust region method. Finally, we generalize our convergence results to inexact Hessians and nonlinear least-squares problems. BibTeX: @article{Dussault2021, author = {Jean-Pierre Dussault and Dominique Orban}, title = {Scalable adaptive cubic regularization methods}, year = {2021}, doi = {10.13140/RG.2.2.18142.15680} }  Eftekhari A, Pasadakis D, Bollhöfer M, Scheidegger S and Schenk O (2021), "Block-Enhanced Precision Matrix Estimation for Large-Scale Datasets", Journal of Computational Science., May, 2021. , pp. 101389. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The l_1-regularized Gaussian maximum likelihood method is a common approach for sparse precision matrix estimation, but one that poses a computational challenge for high-dimensional datasets. We present a novel l_1-regularized maximum likelihood method for performant large-scale sparse precision matrix estimation utilizing the block structures in the underlying computations. We identify the computational bottlenecks and contribute a block coordinate descent update as well as a block approximate matrix inversion routine, which is then parallelized using a shared-memory scheme. We demonstrate the effectiveness, accuracy, and performance of these algorithms. Our numerical examples and comparative results with various modern open-source packages reveal that these precision matrix estimation methods can accelerate the computation of covariance matrices by two to three orders of magnitude, while keeping memory requirements modest. Furthermore, we conduct large-scale case studies for applications from finance and medicine with several thousand random variables to demonstrate applicability for real-world datasets. BibTeX: @article{Eftekhari2021, author = {Aryan Eftekhari and Dimosthenis Pasadakis and Matthias Bollhöfer and Simon Scheidegger and Olaf Schenk}, title = {Block-Enhanced Precision Matrix Estimation for Large-Scale Datasets}, journal = {Journal of Computational Science}, publisher = {Elsevier BV}, year = {2021}, pages = {101389}, doi = {10.1016/j.jocs.2021.101389} }  Ellis M, Buluç A and Yelick K (2021), "Asynchrony versus bulk-synchrony for a generalized N-body problem from genomics", In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming., February, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: This work examines a data-intensive irregular application from genomics, a long-read to long-read alignment problem, which represents a kind of Generalized N-Body problem, one of the "seven giants" of the NRC Big Data motifs [5]. In this problem, computations (genomic alignments) are performed on sparse and data-dependent pairs of inputs, with variable cost computation and variable datum sizes. In particular, there is no inherent locality in the pairwise interactions, unlike simulation-based N-Body problems, and the interaction sparsity depends on particular parameters of the input, which can also affect the quality of the output. We examine two extremes to distributed memory parallelization for this problem, bulk-synchrony and asynchrony, with real workloads. Our bulk-synchronous implementation, uses collective communication in MPI, while our asynchronous implementation uses cross-node RPCs in UPC++. We show that the asynchronous version effectively hides communication costs, with a memory footprint that is typically much lower than the bulk-synchronous version. Our application, while simple enough to be a kind of proxy for genomics or data analytics applications more broadly, is also part of a real application pipeline. It shows good scaling on real input problems, and at the same time, reveals some of the programming and architectural challenges for scaling this type of data-intensive irregular application. BibTeX: @inproceedings{Ellis2021, author = {Marquita Ellis and Aydın Buluç and Katherine Yelick}, title = {Asynchrony versus bulk-synchrony for a generalized N-body problem from genomics}, booktitle = {Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, year = {2021}, doi = {10.1145/3437801.3441580} }  Ellis M, Buluç A and Yelick K (2021), "Scaling Generalized N-Body Problems, A Case Studyfrom Genomics", In Proceedings of the 50th International Conference on Parallel Processing. [Abstract] [BibTeX] Abstract: This work examines a data-intensive irregular application from genomics that represents a type of Generalized N-Body problems, one of the “seven giants” of the NRC Big Data motifs. In this problem, computations (genome alignments) are performed on sparse data-dependent pairs of inputs, with variable cost computation and variable datum sizes. Unlike simulation-based N-Body problems, there is no inherent locality in the pairwise interactions, and the interaction sparsity depends on particular parameters of the input, which can also affect the quality of the output. We build-on a pre-existing bulk-synchronous implementation, using collective communication in MPI, and implement a new asynchronous one, using cross-node RPCs in UPC++. We establish the intra-node comparability and efficiency of both, scaling from one to all core(s) on node. Then we evaluate the multinode scalability from 1 node to 512 nodes (32,768 cores) of NERSC’s Cray XC40 with Intel Xeon Phi “Knight’s Landing” nodes. With real workloads, we examine the load balance of the irregular computation and communication, and the costs of many small asynchronous messages versus few large aggregated messages, in both latency and overall application memory footprint. While both implementations demonstrate good scaling, the study reveals some of the programming and architectural challenges for scaling this type of data-intensive irregular application, and contributes code that can be used in genomics pipelines or in benchmarking for data analytics more broadly. BibTeX: @inproceedings{Ellis2021a, author = {Marquita Ellis and Aydın Buluç and Katherine Yelick}, title = {Scaling Generalized N-Body Problems, A Case Studyfrom Genomics}, booktitle = {Proceedings of the 50th International Conference on Parallel Processing}, year = {2021} }  Emmenegger N, Kyng R and Zehmakan AN (2021), "On the Oracle Complexity of Higher-Order Smooth Non-Convex Finite-Sum Optimization", March, 2021. [Abstract] [BibTeX] Abstract: We prove lower bounds for higher-order methods in smooth non-convex finite-sum optimization. Our contribution is threefold: We first show that a deterministic algorithm cannot profit from the finite-sum structure of the objective, and that simulating a pth-order regularized method on the whole function by constructing exact gradient information is optimal up to constant factors. We further show lower bounds for randomized algorithms and compare them with the best known upper bounds. To address some gaps between the bounds, we propose a new second-order smoothness assumption that can be seen as an analogue of the first-order mean-squared smoothness assumption. We prove that it is sufficient to ensure state-of-the-art convergence guarantees, while allowing for a sharper lower bound. BibTeX: @article{Emmenegger2021, author = {Nicolas Emmenegger and Rasmus Kyng and Ahad N. Zehmakan}, title = {On the Oracle Complexity of Higher-Order Smooth Non-Convex Finite-Sum Optimization}, year = {2021} }  Engelmann A and Faulwasser T (2021), "Decentralized conjugate gradients with finite-step convergence", February, 2021. [Abstract] [BibTeX] Abstract: The decentralized solution of linear systems of equations arises as a subproblem in optimization over networks. Typical examples include the KKT system corresponding to equality constrained quadratic programs in distributed optimization algorithms or in active set methods. This note presents a tailored structure-exploiting decentralized variant of the conjugate gradient method. We show that the decentralized conjugate gradient method exhibits super-linear convergence in a finite number of steps. Finally, we illustrate the algorithm's performance in comparison to the Alternating Direction Method of Multipliers drawing upon examples from sensor fusion. BibTeX: @article{Engelmann2021, author = {Alexander Engelmann and Timm Faulwasser}, title = {Decentralized conjugate gradients with finite-step convergence}, year = {2021} }  Erciyes K (2021), "Algebraic Graph Algorithms", December, 2021. Springer International Publishing. [Abstract] [BibTeX] [URL] Abstract: There has been unprecedented growth in the study of graphs, which are discrete structures that have many real-world applications. The design and analysis of algebraic algorithms to solve graph problems have many advantages, such as implementing results from matrix algebra and using the already available matrix code for sequential and parallel processing.\ Providing Python programming language code for nearly all algorithms, this accessible textbook focuses on practical algebraic graph algorithms using results from matrix algebra rather than algebraic study of graphs. Given the vast theory behind the algebraic nature of graphs, the book strives for an accessible, middle-ground approach by reviewing main algebraic results that are useful in designing practical graph algorithms on the one hand, yet mostly using graph matrices to solve the graph problems. Python is selected for its simplicity, efficiency and rich library routines; and with the code herein, brevity is forsaken for clarity. BibTeX: @book{Erciyes2021, author = {Erciyes, K.}, title = {Algebraic Graph Algorithms}, publisher = {Springer International Publishing}, year = {2021}, url = {https://www.ebook.de/de/product/41866781/k_erciyes_algebraic_graph_algorithms.html} }  Eshragh A, Pietro OD and Saunders MA (2021), "Toeplitz Least Squares Problems, Fast Algorithms and Big Data", December, 2021. [Abstract] [BibTeX] Abstract: In time series analysis, when fitting an autoregressive model, one must solve a Toeplitz ordinary least squares problem numerous times to find an appropriate model, which can severely affect computational times with large data sets. Two recent algorithms (LSAR and Repeated Halving) have applied randomized numerical linear algebra (RandNLA) techniques to fitting an autoregressive model to big time-series data. We investigate and compare the quality of these two approximation algorithms on large-scale synthetic and real-world data. While both algorithms display comparable results for synthetic datasets, the LSAR algorithm appears to be more robust when applied to real-world time series data. We conclude that RandNLA is effective in the context of big-data time series. BibTeX: @article{Eshragh2021, author = {Ali Eshragh and Oliver Di Pietro and Michael A. Saunders}, title = {Toeplitz Least Squares Problems, Fast Algorithms and Big Data}, year = {2021} }  Fan J, Bai J, Li Z, Ortiz-Bobea A and Gomes CP (2021), "A GNN-RNN Approach for Harnessing Geospatial and Temporal Information: Application to Crop Yield Prediction", November, 2021. [Abstract] [BibTeX] Abstract: Climate change is posing new challenges to crop-related concerns including food insecurity, supply stability and economic planning. As one of the central challenges, crop yield prediction has become a pressing task in the machine learning field. Despite its importance, the prediction task is exceptionally complicated since crop yields depend on various factors such as weather, land surface, soil quality as well as their interactions. In recent years, machine learning models have been successfully applied in this domain. However, these models either restrict their tasks to a relatively small region, or only study over a single or few years, which makes them hard to generalize spatially and temporally. In this paper, we introduce a novel graph-based recurrent neural network for crop yield prediction, to incorporate both geographical and temporal knowledge in the model, and further boost predictive power. Our method is trained, validated, and tested on over 2000 counties from 41 states in the US mainland, covering years from 1981 to 2019. As far as we know, this is the first machine learning method that embeds geographical knowledge in crop yield prediction and predicts the crop yields at county level nationwide. We also laid a solid foundation for the comparison with other machine learning baselines by applying well-known linear models, tree-based models, deep learning methods and comparing their performance. Experiments show that our proposed method consistently outperforms the existing state-of-the-art methods on various metrics, validating the effectiveness of geospatial and temporal information. BibTeX: @article{Fan2021, author = {Joshua Fan and Junwen Bai and Zhiyun Li and Ariel Ortiz-Bobea and Carla P. Gomes}, title = {A GNN-RNN Approach for Harnessing Geospatial and Temporal Information: Application to Crop Yield Prediction}, year = {2021} }  Fang S, Liu Y-J and Xiong X (2021), "Efficient Sparse Hessian-Based Semismooth Newton Algorithms for Dantzig Selector", SIAM Journal on Scientific Computing., January, 2021. Vol. 43(6), pp. A4147-A4171. Society for Industrial & Applied Mathematics (SIAM). [Abstract] [BibTeX] [DOI] Abstract: This paper focuses on efficient algorithms for finding the Dantzig selector which was first proposed by Candès and Tao as an effective variable selection technique in the linear regression. This paper first reformulates the Dantzig selector problem as an equivalent convex composite optimization problem and proposes a semismooth Newton augmented Lagrangian (Ssnal) algorithm to solve the equivalent form. This paper also applies a proximal point dual semismooth Newton (PpdSsn) algorithm to solve another equivalent form of the Dantzig selector problem. Comprehensive results on the global convergence and local asymptotic superlinear convergence of the Ssnal and PpdSsn algorithms are characterized under very mild conditions. The computational costs of a semismooth Newton algorithm for solving the subproblems involved in the Ssnal and PpdSsn algorithms can be cheap by fully exploiting the second order sparsity and employing efficient techniques. Numerical experiments on the Dantzig selector problem with synthetic and real data sets demonstrate that the Ssnal and PpdSsn algorithms substantially outperform the state-of-the-art first order algorithms even for the required low accuracy, and the proposed algorithms are able to solve the large-scale problems robustly and efficiently to a relatively high accuracy. BibTeX: @article{Fang2021, author = {Sheng Fang and Yong-Jin Liu and Xianzhu Xiong}, title = {Efficient Sparse Hessian-Based Semismooth Newton Algorithms for Dantzig Selector}, journal = {SIAM Journal on Scientific Computing}, publisher = {Society for Industrial & Applied Mathematics (SIAM)}, year = {2021}, volume = {43}, number = {6}, pages = {A4147--A4171}, doi = {10.1137/20m1364643} }  Fasi M, Higham NJ, Mikaitis M and Pranesh S (2021), "Numerical behavior of NVIDIA tensor cores", PeerJ Computer Science., February, 2021. , pp. e330. PeerJ. [Abstract] [BibTeX] [DOI] Abstract: We explore the floating-point arithmetic implemented in the NVIDIA tensor cores, which are hardware accelerators for mixed-precision matrix multiplication available on the Volta, Turing, and Ampere microarchitectures. Using Volta V100, Turing T4, and Ampere A100 graphics cards, we determine what precision is used for the intermediate results, whether subnormal numbers are supported, what rounding mode is used, in which order the operations underlying the matrix multiplication are performed, and whether partial sums are normalized. These aspects are not documented by NVIDIA, and we gain insight by running carefully designed numerical experiments on these hardware units. Knowing the answers to these questions is important if one wishes to: (1) accurately simulate NVIDIA tensor cores on conventional hardware; (2) understand the differences between results produced by code that utilizes tensor cores and code that uses only IEEE 754-compliant arithmetic operations; and (3) build custom hardware whose behavior matches that of NVIDIA tensor cores. As part of this work we provide a test suite that can be easily adapted to test newer versions of the NVIDIA tensor cores as well as similar accelerators from other vendors, as they become available. Moreover, we identify a non-monotonicity issue affecting floating point multi-operand adders if the intermediate results are not normalized after each step. BibTeX: @article{Fasi2021, author = {Massimiliano Fasi and Nicholas J. Higham and Mantas Mikaitis and Srikara Pranesh}, title = {Numerical behavior of NVIDIA tensor cores}, journal = {PeerJ Computer Science}, publisher = {PeerJ}, year = {2021}, pages = {e330}, doi = {10.7717/peerj-cs.330} }  Favaro F, Dufrechou E, Ezzatti P and Oliver JP (2021), "Energy-efficient algebra kernels in FPGA for High Performance Computing", 10, 2021. Vol. 21(2), pp. e09. Universidad Nacional de La Plata. [Abstract] [BibTeX] [DOI] Abstract: The dissemination of multi-core architectures and the later irruption of massively parallel devices, led to a revolution in High-Performance Computing (HPC) platforms in the last decades. As a result, Field-Programmable Gate Arrays (FPGAs) are re-emerging as a versatile and more energy-efficient alternative to other platforms. Traditional FPGA design implies using low-level Hardware Description Languages (HDL) such as VHDL or Verilog, which follow an entirely different programming model than standard software languages, and their use requires specialized knowledge of the underlying hardware. In the last years, manufacturers started to make big efforts to provide High-Level Synthesis (HLS) tools, in order to allow a grater adoption of FPGAs in the HPC community. Our work studies the use of multi-core hardware and different FPGAs to address Numerical Linear Algebra (NLA) kernels such as the general matrix multiplication GEMM and the sparse matrix-vector multiplication SpMV. Specifically, we compare the behavior of fine-tuned kernels in a multi-core CPU processor and HLS implementations on FPGAs. We perform the experimental evaluation of our implementations on a low-end and a cutting-edge FPGA platform, in terms of runtime and energy consumption, and compare the results against the Intel MKL library in CPU. BibTeX: @article{Favaro2021, author = {Federico Favaro and Ernesto Dufrechou and Pablo Ezzatti and Juan Pablo Oliver}, title = {Energy-efficient algebra kernels in FPGA for High Performance Computing}, publisher = {Universidad Nacional de La Plata}, year = {2021}, volume = {21}, number = {2}, pages = {e09}, doi = {10.24215/16666038.21.e09} }  Fèvre VL, Herault T, Langou J and Robert Y (2021), "A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication", In Euro-Par 2020: Parallel Processing Workshops. , pp. 303-315. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: This paper compares several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes. BibTeX: @incollection{Fevre2021, author = {Valentin Le Fèvre and Thomas Herault and Julien Langou and Yves Robert}, title = {A Comparison of Several Fault-Tolerance Methods for the Detection and Correction of Floating-Point Errors in Matrix-Matrix Multiplication}, booktitle = {Euro-Par 2020: Parallel Processing Workshops}, publisher = {Springer International Publishing}, year = {2021}, pages = {303--315}, doi = {10.1007/978-3-030-71593-9_24} }  Filipovič J, Hozzová J, Nezarat A, Oľha J and Petrovič F (2021), "Using hardware performance counters to speed up autotuning convergence on GPUs", February, 2021. [Abstract] [BibTeX] Abstract: Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware. In this paper, we introduce a novel method for searching tuning spaces. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem-specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time. BibTeX: @article{Filipovic2021, author = {Jiří Filipovič and Jana Hozzová and Amin Nezarat and Jaroslav Oľha and Filip Petrovič}, title = {Using hardware performance counters to speed up autotuning convergence on GPUs}, year = {2021} }  Finkel H and Laguna I (2021), "Report of the Workshop on Program Synthesis for Scientific Computing", February, 2021. [Abstract] [BibTeX] Abstract: Program synthesis is an active research field in academia, national labs, and industry. Yet, work directly applicable to scientific computing, while having some impressive successes, has been limited. This report reviews the relevant areas of program synthesis work for scientific computing, discusses successes to date, and outlines opportunities for future work. This report is the result of the Workshop on Program Synthesis for Scientific Computing was held virtually on August 4-5 2020 (https://prog-synth-science.github.io/2020/). BibTeX: @article{Finkel2021, author = {Hal Finkel and Ignacio Laguna}, title = {Report of the Workshop on Program Synthesis for Scientific Computing}, year = {2021} }  Fischer M, Riedel O, Lechler A and Verl A (2021), "Arithmetic Coding for Floating-Point Numbers", In Proceedings of the 2021 IEEE Conference on Dependable and Secure Computing. , pp. 1-8. [Abstract] [BibTeX] [DOI] Abstract: To enable the usage of standard hardware in safety-critical applications for production systems, new approaches for hardware fault tolerance are required. These approaches must be implemented on software level. As shown in the literature, arithmetic coding is a promising approach, but only supports integer calculations. For complex safety functions, e.g. in robotics, fast floating-point calculations are needed. Therefore, this paper presents a method for direct arithmetic encoding of floating-point calculations with low-performance impact. Moreover, a detailed residual error estimation is given. BibTeX: @inproceedings{Fischer2021, author = {M. Fischer and O. Riedel and A. Lechler and A. Verl}, title = {Arithmetic Coding for Floating-Point Numbers}, booktitle = {Proceedings of the 2021 IEEE Conference on Dependable and Secure Computing}, year = {2021}, pages = {1--8}, doi = {10.1109/DSC49826.2021.9346236} }  Ford KW (2021), "Coordinate Descent Methods for Sparse Optimal Scoring and Its Applications". Thesis at: The University of Alabama. [Abstract] [BibTeX] Abstract: Linear discriminant analysis (LDA) is a popular tool for performing supervised classification in a high-dimensional setting. It seeks to reduce the dimension by projecting the data to a lower dimensional space using a set of optimal discriminant vectors to separate the classes. One formulation of LDA is optimal scoring which uses a sequence of scores to turn the categorical variables into quantitative variables. In this way, optimal scoring creates a generalized linear regression problem from a classification problem. The sparse optimal scoring formulation of LDA uses an elastic-net penalty on the discriminant vectors to induce sparsity and perform feature selection. We propose coordinate descent algorithms for finding optimal discriminant vectors in the sparse optimal scoring formulation of LDA, along with parallel implementations for large-scale problems. We then present numerical results illustrating the efficacy of these algorithms in classifying real and simulated data. Finally, we use Sparse Optimal Scoring to analyze and classify visual comprehension of Deaf persons based on EEG data. BibTeX: @phdthesis{Ford2021, author = {Ford, Katie Wood}, title = {Coordinate Descent Methods for Sparse Optimal Scoring and Its Applications}, school = {The University of Alabama}, year = {2021} }  Frison G, Frey J, Messerer F, Zanelli A and Diehl M (2021), "Introducing the quadratically-constrained quadratic programming framework in HPIPM", December, 2021. [Abstract] [BibTeX] Abstract: This paper introduces the quadratically-constrained quadratic programming (QCQP) framework recently added in HPIPM alongside the original quadratic-programming (QP) framework. The aim of the new framework is unchanged, namely providing the building blocks to efficiently and reliably solve (more general classes of) optimal control problems (OCP). The newly introduced QCQP framework provides full features parity with the original QP framework: three types of QCQPs (dense, optimal control and tree-structured optimal control QCQPs) and interior point method (IPM) solvers as well as (partial) condensing and other pre-processing routines. Leveraging the modular structure of HPIPM, the new QCQP framework builds on the QP building blocks and similarly provides fast and reliable IPM solvers. BibTeX: @article{Frison2021, author = {Gianluca Frison and Jonathan Frey and Florian Messerer and Andrea Zanelli and Moritz Diehl}, title = {Introducing the quadratically-constrained quadratic programming framework in HPIPM}, year = {2021} }  Fujiwara Y, Kanai S, Ida Y, Kumagai A and Ueda N (2021), "Fast Algorithm for Anchor Graph Hashing", Proc. VLDB Endow.. Vol. 14(6), pp. 916-928. [Abstract] [BibTeX] [URL] Abstract: Anchor graph hashing is used in many applications such as cancer detection, web page classifcation, and drug discovery. It computes the hash codes from the eigenvectors of the matrix representing the similarities between data points and anchor points; anchors refer to the points representing the data distribution. In performing an approximate nearest neighbor search, the hash codes of a query data point are determined by identifying its closest anchor points. Anchor graph hashing, however, incurs high computation cost since (1) the computation cost of obtaining the eigenvectors is quadratic to the number of anchor points, and (2) the similarities of the query data point to all the anchor points must be computed. Our proposal, Tridiagonal hashing, increases the efciency of anchor graph hashing because of its two advances: (1) we apply a graph clustering algorithm to compute the eigenvectors from the tridiagonal matrix obtained from the similarities between data points and anchor points, and (2) we detect anchor points closest to the query data point by using a dimensionality reduction approach. Experiments show that our approach is several orders of magnitude faster than the previous approaches. Besides, it yields high search accuracy than the original anchor graph hashing approach. BibTeX: @article{Fujiwara2021, author = {Yasuhiro Fujiwara and Sekitoshi Kanai and Yasutoshi Ida and Atsutoshi Kumagai and Naonori Ueda}, title = {Fast Algorithm for Anchor Graph Hashing}, journal = {Proc. VLDB Endow.}, year = {2021}, volume = {14}, number = {6}, pages = {916--928}, url = {http://www.vldb.org/pvldb/vol14/p916-fujiwara.pdf} }  Gabert K, Pinar A and Çatalyürek ÜV (2021), "A Unifying Framework to Identify Dense Subgraphs on Streams: Graph Nuclei to Hypergraph Cores", In Proceedings of the 14th ACM International Conference on Web Search and Data Mining., March, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: Finding dense regions of graphs is fundamental in graph mining. We focus on the computation of dense hierarchies and regions with graph nuclei---a generalization of k-cores and trusses. Static computation of nuclei, namely through variants of 'peeling', are easy to understand and implement. However, many practically important graphs undergo continuous change. Dynamic algorithms, maintaining nucleus computations on dynamic graph streams, are nuanced and require significant effort to port between nuclei, e.g., from k-cores to trusses. We propose a unifying framework to maintain nuclei in dynamic graph streams. First, we show no dynamic algorithm can asymptotically beat re-computation, highlighting the need to experimentally understand variability. Next, we prove equivalence between k-cores on a special hypergraph and nuclei. Our algorithm splits the problem into maintaining the special hypergraph and maintaining k-cores on it. We implement our algorithm and experimentally demonstrate improvements up to 108 x over re-computation. We show algorithmic improvements on k-cores apply to trusses and outperform truss-specific implementations. BibTeX: @inproceedings{Gabert2021, author = {Kasimir Gabert and Ali Pinar and Ümit V. Çatalyürek}, title = {A Unifying Framework to Identify Dense Subgraphs on Streams: Graph Nuclei to Hypergraph Cores}, booktitle = {Proceedings of the 14th ACM International Conference on Web Search and Data Mining}, publisher = {ACM}, year = {2021}, doi = {10.1145/3437963.3441790} }  Gabert K and Catalyurek UV (2021), "PIGO: A Parallel Graph Input/Output Library", In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops., June, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Graph and sparse matrix systems are highly tuned, able to run complex graph analytics in fractions of seconds on billion-edge graphs. For both developers and researchers, the focus has been on computational kernels and not end-to-end runtime. Despite the significant improvements that modern hardware and operating systems have made towards input and output, these can still become application bottlenecks. Unfortunately, on high-performance shared-memory graph systems running billion-scale graphs, reading the graph from file systems easily takes over 2000× longer than running the computational kernel. This slowdown causes both a disconnect for end users and a loss of productivity for researchers and developers.We close the gap by providing a simple to use, small, header-only, and dependency-free C++11 library that brings I/O improvements to graph and matrix systems. Using our library, we improve the end-to-end performance for state-of-the-art systems significantly—in many cases by over 40×. BibTeX: @inproceedings{Gabert2021a, author = {Kasimir Gabert and Umit V. Catalyurek}, title = {PIGO: A Parallel Graph Input/Output Library}, booktitle = {Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops}, publisher = {IEEE}, year = {2021}, doi = {10.1109/ipdpsw52791.2021.00050} }  Gabert K, Pinar A and Catalyurek UV (2021), "Shared-Memory Scalable k-Core Maintenance on Dynamic Graphs and Hypergraphs", In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops., June, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Computing k-cores on graphs is an important graph mining target as it provides an efficient means of identifying a graph’s dense and cohesive regions. Computing k-cores on hypergraphs has seen recent interest, as many datasets naturally produce hypergraphs. Maintaining k-cores as the underlying data changes is important as graphs are large, growing, and continuously modified. In many practical applications, the graph updates are bursty, both with periods of significant activity and periods of relative calm. Existing maintenance algorithms fail to handle large bursts, and prior parallel approaches on both graphs and hypergraphs fail to scale as available cores increase.We address these problems by presenting two parallel and scalable fully-dynamic batch algorithms for maintaining k-cores on both graphs and hypergraphs. Both algorithms take advantage of the connection between k-cores and h-indices. One algorithm is well suited for large batches and the other for small. We provide the first algorithms that experimentally demonstrate scalability as the number of threads increase while sustaining high change rates in graphs and hypergraphs. BibTeX: @inproceedings{Gabert2021b, author = {Kasimir Gabert and Ali Pinar and Umit V. Catalyurek}, title = {Shared-Memory Scalable k-Core Maintenance on Dynamic Graphs and Hypergraphs}, booktitle = {Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops}, publisher = {IEEE}, year = {2021}, doi = {10.1109/ipdpsw52791.2021.00158} }  Gabow HN and Sankowski P (2021), "Algorithms for Weighted Matching Generalizations I: Bipartite Graphs, b-matching, and Unweighted f-factors", SIAM Journal on Computing., January, 2021. Vol. 50(2), pp. 440-486. Society for Industrial & Applied Mathematics (SIAM). [Abstract] [BibTeX] [DOI] Abstract: Let G=(V,E) be a weighted graph or multigraph, with f or b a function assigning a nonnegative integer to each vertex. An f-factor is a subgraph whose degree function is f; a perfect b-matching is a b-factor in the graph formed from G by adding an unlimited number of copies of each edge. This two-part paper culminates in an efficient algebraic algorithm to find a maximum f-factor, i.e., f-factor with maximum weight. Along the way it presents simpler special cases of interest. Part II presents the maximum f-factor algorithm and the special case of shortest paths in conservative undirected graphs (negative edges allowed). Part I presents these results: An algebraic algorithm for maximum b-matching, i.e., maximum weight b-matching. It is almost identical to its special case b≡ 1, ordinary weighted matching. The time is O(Wb(V)^ω) for W the maximum magnitude of an edge weight, b(V)=_v∊ V b(v), and <2.373 the exponent of matrix multiplication. An algebraic algorithm to find an f-factor. The time is O(f(V)^ω) for f(V)=_v∊ V f(v). The specialization of the f-factor algorithm to bipartite graphs and its extension to maximum/minimum bipartite f-factors. This improves the known complexity bounds for vertex capacitated max-flow and min-cost max-flow on a subclass of graphs. Each algorithm is randomized and has two versions achieving the above time bound: For worst-case time the algorithm is correct with high probability. For expected time the algorithm is Las Vegas. BibTeX: @article{Gabow2021, author = {Harold N. Gabow and Piotr Sankowski}, title = {Algorithms for Weighted Matching Generalizations I: Bipartite Graphs, b-matching, and Unweighted f-factors}, journal = {SIAM Journal on Computing}, publisher = {Society for Industrial & Applied Mathematics (SIAM)}, year = {2021}, volume = {50}, number = {2}, pages = {440--486}, doi = {10.1137/16m1106195} }  Gao Y, Yu X and Zhang H (2021), "Graph clustering using triangle-aware measures in large networks", November, 2021. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Graph clustering (also referred to as community detection) is an important topic in network analysis. Although a large amount of literature has been published on the problem, most of them are designed at the level of lower-order structure of networks, e.g., individual vertices and edges, and fail to capture higher-order information of networks. Recently, higher-order units (under the name of motifs) are introduced to graph clustering. These methods typically focus on constructing a motif-based hypergraph where higher-order information is preserved, and communities abstracted from the hypergraph usually achieve better accuracy. However, the hypergraph is often fragmented for a sparse network and contains a large number of isolated vertices that will be outliers of the identified community cover. To address the fragmentation problem, we propose an asymmetric triangle enhancement approach for graph clustering, in which a mixture of edges and asymmetric triangles is taken into consideration for cluster measures. We also design an approximation model to speed up the algorithm by estimating the measures. Extensive experiments on real and synthetic networks demonstrate the accuracy and efficiency of the proposed method. BibTeX: @article{Gao2021, author = {Yang Gao and Xiangzhan Yu and Hongli Zhang}, title = {Graph clustering using triangle-aware measures in large networks}, publisher = {Elsevier BV}, year = {2021}, doi = {10.1016/j.ins.2021.11.008} }  Gao Y, Li R, Zhou C and Jiang S (2022), "Exploring spatio-temporal correlation and complexity of safety monitoring data by complex networks", Automation in Construction., March, 2022. Vol. 135, pp. 104115. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: With the development of monitoring technologies, collected data has become more massive, precise, and timely, which can be an excellent foundation for more comprehensive assessment. However, existing data-analyzing researches still concentrate mainly on spatial or temporal characteristics separately, neglecting the dependence inside. Actually, the spatio-temporal correlation has been widely studied in other areas. Based on that, this paper proposed an undirected and unweighted data-based complex network as a risk assessment tool to explore the spatio-temporal correlation in safety monitoring data. Eigenvectors containing both spatial and temporal characteristic values are the nodes and the degree of correlation determines whether edges exist between each pair of nodes. The good application of the model in a metro construction project verifies the existence and significance of the correlation. This work not only reveals the spatio-temporal correlation of construction characteristics but also provides a new perspective in safety assessment. BibTeX: @article{Gao2022, author = {Yuyue Gao and Rao Li and Cheng Zhou and Shuangnan Jiang}, title = {Exploring spatio-temporal correlation and complexity of safety monitoring data by complex networks}, journal = {Automation in Construction}, publisher = {Elsevier BV}, year = {2022}, volume = {135}, pages = {104115}, doi = {10.1016/j.autcon.2021.104115} }  Gasparini L, Rodrigues JR, Augusto DA, Carvalho LM, Conopoima C, Goldfeld P, Panetta J, Ramirez JP, de Souza M, Figueiredo MO and Leite VM (2021), "Hybrid Parallel Iterative Sparse Linear Solver Framework for Reservoir Geomechanical and Flow Simulation", Journal of Computational Science., February, 2021. , pp. 101330. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: We discuss new developments of a hybrid parallel iterative sparse linear solver framework focused on petroleum reservoir flow and geomechanical simulation. It runs efficiently on several platforms, from desktop workstations to clusters of multicore nodes, with or without multiple GPUs, using a two-tier hierarchical architecture for distributed matrices and vectors. Results show good parallel scalability. Comparisons with a well-established library and a proprietary commercial solver indicate that our solver is competitive with the best available tools. We present results of the solver's application to simulations of real and synthetic reservoir models of up to billions of unknowns, running on CPUs and GPUs on up to 2,000 processes. BibTeX: @article{Gasparini2021, author = {Leonardo Gasparini and José R.P. Rodrigues and Douglas A. Augusto and Luiz M. Carvalho and Cesar Conopoima and Paulo Goldfeld and Jairo Panetta and João P. Ramirez and Michael de Souza and Mateus O. Figueiredo and Victor M.D.M. Leite}, title = {Hybrid Parallel Iterative Sparse Linear Solver Framework for Reservoir Geomechanical and Flow Simulation}, journal = {Journal of Computational Science}, publisher = {Elsevier BV}, year = {2021}, pages = {101330}, doi = {10.1016/j.jocs.2021.101330} }  Gatti A, Hu Z, Ghysels P, Ng EG and Smidt T (2021), "Graph Partitioning and Sparse Matrix Ordering using Reinforcement Learning", April, 2021. [Abstract] [BibTeX] Abstract: We present a novel method for graph partitioning, based on reinforcement learning and graph convolutional neural networks. The new reinforcement learning based approach is used to refine a given partitioning obtained on a coarser representation of the graph, and the algorithm is applied recursively. The neural network is implemented using graph attention layers, and trained using an advantage actor critic (A2C) agent. We present two variants, one for finding an edge separator that minimizes the normalized cut or quotient cut, and one that finds a small vertex separator. The vertex separators are then used to construct a nested dissection ordering for permuting a sparse matrix so that its triangular factorization will incur less fill-in. The partitioning quality is compared with partitions obtained using METIS and Scotch, and the nested dissection ordering is evaluated in the sparse solver SuperLU. Our results show that the proposed method achieves similar partitioning quality than METIS and Scotch. Furthermore, the method generalizes from one class of graphs to another, and works well on a variety of graphs from the SuiteSparse sparse matrix collection. BibTeX: @article{Gatti2021, author = {Alice Gatti and Zhixiong Hu and Pieter Ghysels and Esmond G. Ng and Tess Smidt}, title = {Graph Partitioning and Sparse Matrix Ordering using Reinforcement Learning}, year = {2021} }  Gazzola S, Nagy JG and Landman MS (2021), "Iteratively Reweighted FGMRES and FLSQR for Sparse Reconstruction", SIAM Journal on Scientific Computing., February, 2021. , pp. S47-S69. Society for Industrial & Applied Mathematics (SIAM). [Abstract] [BibTeX] [DOI] Abstract: This paper presents two new algorithms to compute sparse solutions of large-scale linear discrete ill-posed problems. The proposed approach consists in constructing a sequence of quadratic problems approximating an _2-_1 regularization scheme (with additional smoothing to ensure differentiability at the origin) and partially solving each problem in the sequence using flexible Krylov--Tikhonov methods. These algorithms are built upon a new solid theoretical justification that guarantees that the sequence of approximate solutions to each problem in the sequence converges to the solution of the considered modified version of the _2-_1 problem. Compared to other traditional methods, the new algorithms have the advantage of building a single (flexible) approximation (Krylov) subspace that encodes regularization through variable “preconditioning” and that is expanded as soon as a new problem in the sequence is defined. Links between the new solvers and other well-established solvers based on augmenting Krylov subspaces are also established. The performance of these algorithms is shown through a variety of numerical examples modeling image deblurring and computed tomography. BibTeX: @article{Gazzola2021, author = {Silvia Gazzola and James G. Nagy and Malena Sabaté Landman}, title = {Iteratively Reweighted FGMRES and FLSQR for Sparse Reconstruction}, journal = {SIAM Journal on Scientific Computing}, publisher = {Society for Industrial & Applied Mathematics (SIAM)}, year = {2021}, pages = {S47--S69}, doi = {10.1137/20m1333948} }  Ghannad A, Orban D and Saunders MA (2021), "Linear systems arising in interior methods for convex optimization: a symmetric formulation with bounded condition number", Optimization Methods and Software., October, 2021. , pp. 1-26. Informa UK Limited. [Abstract] [BibTeX] [DOI] Abstract: We provide eigenvalues bounds for a new formulation of the step equations in interior methods for convex quadratic optimization. The matrix of our formulation, named K_2.5, has bounded condition number, converges to a well-defined limit under strict complementarity, and has the same size as the traditional, ill-conditioned, saddle-point formulation. We evaluate the performance in the context of a Matlab object-oriented implementation of PDCO, an interior-point solver for minimizing a smooth convex function subject to linear constraints. The main benefit of our implementation, named PDCOO, is to separate the logic of the interior-point method from the formulation of the system used to compute a step at each iteration and the method used to solve the system. Thus, PDCOO allows easy addition of a new system formulation and/or solution method for experimentation. Our numerical experiments indicate that the K_2.5 formulation has the same storage requirements as the traditional ill-conditioned saddle-point formulation, and its condition is often more favourable than the unsymmetric block 3 × 3 formulation. BibTeX: @article{Ghannad2021, author = {Alexandre Ghannad and Dominique Orban and Michael A. Saunders}, title = {Linear systems arising in interior methods for convex optimization: a symmetric formulation with bounded condition number}, journal = {Optimization Methods and Software}, publisher = {Informa UK Limited}, year = {2021}, pages = {1--26}, doi = {10.1080/10556788.2021.1965599} }  Ghosh A, Mccann MT and Ravishankar S (2021), "Bilevel learning of l1-regularizers with closed-form gradients(BLORC)", November, 2021. [Abstract] [BibTeX] Abstract: We present a method for supervised learning of sparsity-promoting regularizers, a key ingredient in many modern signal reconstruction problems. The parameters of the regularizer are learned to minimize the mean squared error of reconstruction on a training set of ground truth signal and measurement pairs. Training involves solving a challenging bilevel optimization problem with a nonsmooth lower-level objective. We derive an expression for the gradient of the training loss using the implicit closed-form solution of the lower-level variational problem given by its dual problem, and provide an accompanying gradient descent algorithm (dubbed BLORC) to minimize the loss. Our experiments on simple natural images and for denoising 1D signals show that the proposed method can learn meaningful operators and the analytical gradients calculated are faster than standard automatic differentiation methods. While the approach we present is applied to denoising, we believe that it can be adapted to a wide-variety of inverse problems with linear measurement models, thus giving it applicability in a wide range of scenarios. BibTeX: @article{Ghosh2021, author = {Avrajit Ghosh and Michael T. Mccann and Saiprasad Ravishankar}, title = {Bilevel learning of l1-regularizers with closed-form gradients(BLORC)}, year = {2021} }  Gilbert MS, Acer S, Boman EG, Madduri K and Rajamanickam S (2021), "Performance-Portable Graph Coarsening for Efficient Multilevel Graph Analysis", In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)., May, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: The multilevel heuristic is an effective strategy for speeding up graph analytics, and graph coarsening is an integral step of multilevel methods. We perform a comprehensive study of multilevel coarsening in this work. We primarily focus on the graphics processing unit (GPU) parallelization of the Heavy Edge Coarsening (HEC) method executed in an iterative setting. We present optimizations for the two phases of coarsening, a fine-to-coarse vertex mapping phase, and a coarse graph construction phase. We also express several other coarsening algorithms using the Kokkos framework and discuss their parallelization. We demonstrate the efficacy of parallelized HEC on an NVIDIA Turing GPU and a 32-core AMD Ryzen processor using multilevel spectral graph partitioning as the primary case study. BibTeX: @inproceedings{Gilbert2021, author = {Michael S. Gilbert and Seher Acer and Erik G. Boman and Kamesh Madduri and Sivasankaran Rajamanickam}, title = {Performance-Portable Graph Coarsening for Efficient Multilevel Graph Analysis}, booktitle = {2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, publisher = {IEEE}, year = {2021}, doi = {10.1109/ipdps49936.2021.00030} }  Gnanasekaran A and Darve E (2021), "Hierarchical Orthogonal Factorization: Sparse Least Squares Problems", February, 2021. [Abstract] [BibTeX] Abstract: In this work, we develop a fast hierarchical solver for solving large, sparse least squares problems. We build upon the algorithm, spaQR (sparsified QR), that was developed by the authors to solve large sparse linear systems. Our algorithm is built on top of a Nested Dissection based multifrontal QR approach. We use low-rank approximations on the frontal matrices to sparsify the vertex separators at every level in the elimination tree. Using a two-step sparsification scheme, we reduce the number of columns and maintain the ratio of rows to columns in each front without introducing any additional fill-in. With this improvised scheme, we show that the runtime of the algorithm scales as 𝒪(M log N) and uses 𝒪(M) memory to store the factorization. This is achieved at the expense of a small and controllable approximation error. The end result is an approximate factorization of the matrix stored as a sequence of sparse orthogonal and upper-triangular factors and hence easy to apply/solve with a vector. Finally, we compare the performance of the spaQR algorithm in solving sparse least squares problems with direct multifrontal QR and CGLS iterative method with a standard diagonal preconditioner. BibTeX: @article{Gnanasekaran2021, author = {Abeynaya Gnanasekaran and Eric Darve}, title = {Hierarchical Orthogonal Factorization: Sparse Least Squares Problems}, year = {2021} }  Göbel F, Grützmacher T, Ribizel T and Anzt H (2021), "Mixed Precision Incomplete and Factorized Sparse Approximate Inverse Preconditioning on GPUs", In Euro-Par 2021: Parallel Processing. , pp. 550-564. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: In this work, we present highly efficient mixed precision GPU-implementations of an Incomplete Sparse Approximate Inverse (ISAI) preconditioner for general non-symmetric matrices and a Factorized Sparse Approximate Inverse (FPSAI) preconditioner for symmetric positive definite matrices. While working with full double precision in all arithmetic operations, we demonstrate the benefit of decoupling the memory precision and storing the preconditioner in a more compact low precision floating point format to reduce the memory access volume and therefore preconditioner application time. BibTeX: @incollection{Goebel2021, author = {Fritz Göbel and Thomas Grützmacher and Tobias Ribizel and Hartwig Anzt}, title = {Mixed Precision Incomplete and Factorized Sparse Approximate Inverse Preconditioning on GPUs}, booktitle = {Euro-Par 2021: Parallel Processing}, publisher = {Springer International Publishing}, year = {2021}, pages = {550--564}, doi = {10.1007/978-3-030-85665-6_34} }  Gómez C, Mantovani F, Focht E and Casas M (2021), "Efficiently running SpMV on long vector architectures", In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming., February, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: Sparse Matrix-Vector multiplication (SpMV) is an essential kernel for parallel numerical applications. SpMV displays sparse and irregular data accesses, which complicate its vectorization. Such difficulties make SpMV to frequently experiment non-optimal results when run on long vector ISAs exploiting SIMD parallelism. In this context, the development of new optimizations becomes fundamental to enable high performance SpMV executions on emerging long vector architectures. In this paper, we improve the state-of-the-art SELL-C-σ sparse matrix format by proposing several new optimizations for SpMV. We target aggressive long vector architectures like the NEC Vector Engine. By combining several optimizations, we obtain an average 12% improvement over SELL-C-σ considering a heterogeneous set of 24 matrices. Our optimizations boost performance in long vector architectures since they expose a high degree of SIMD parallelism. BibTeX: @inproceedings{Gomez2021, author = {Constantino Gómez and Filippo Mantovani and Erich Focht and Marc Casas}, title = {Efficiently running SpMV on long vector architectures}, booktitle = {Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, year = {2021}, doi = {10.1145/3437801.3441592} }  Gopalakrishnan G, Laguna I, Li A, Panchekha P, Rubio-Gonzalez C and Tatlock Z (2021), "Guarding Numerics Amidst Rising Heterogeneity", In Proceedings of the 2021 IEEE/ACM 5th International Workshop on Software Correctness for HPC Applications., November, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: New heterogeneous computing platforms-especially GPUs and other accelerators-are being introduced at a brisk pace, motivated by the goals of exploiting parallelism and reducing data movement. Unfortunately, their sheer variety as well as the optimization options supported by them have been observed to alter the computed numerical results to the extent that reproducible results are no longer possible to obtain without extra effort. Our main contribution in this paper is to document the scope and magnitude of this problem which we classify under the heading of numerics. We propose a taxonomy to classify specific problems to be addressed by the community, a few immediately actionable topics as the next steps, and also forums within which to continue discussions. BibTeX: @inproceedings{Gopalakrishnan2021, author = {Ganesh Gopalakrishnan and Ignacio Laguna and Ang Li and Pavel Panchekha and Cindy Rubio-Gonzalez and Zachary Tatlock}, title = {Guarding Numerics Amidst Rising Heterogeneity}, booktitle = {Proceedings of the 2021 IEEE/ACM 5th International Workshop on Software Correctness for HPC Applications}, publisher = {IEEE}, year = {2021}, doi = {10.1109/correctness54621.2021.00007} }  Gorissen BL (2021), "Interior point methods can exploit structure of convex piecewise linear functions with application in radiation therapy", December, 2021. [Abstract] [BibTeX] Abstract: Auxiliary variables are often used to model a convex piecewise linear function in the framework of linear optimization. This work shows that such variables yield a block diagonal plus low rank structure in the reduced KKT system of the dual problem. We show how the structure can be detected efficiently, and derive the linear algebra formulas for an interior point method which exploit such structure. The structure is detected in 36% of the cases in Netlib. Numerical results on the inverse planning problem in radiation therapy show an order of magnitude speed-up compared to the state-of-the-art interior point solver CPLEX, and considerable improvements in dose distribution compared to current algorithms. BibTeX: @article{Gorissen2021, author = {Bram L. Gorissen}, title = {Interior point methods can exploit structure of convex piecewise linear functions with application in radiation therapy}, year = {2021} }  Goto H, Endo K, Suzuki M, Sakai Y, Kanao T, Hamakawa Y, Hidaka R, Yamasaki M and Tatsumura K (2021), "High-performance combinatorial optimization based on classical mechanics", Science Advances., February, 2021. Vol. 7(6), pp. eabe7953. American Association for the Advancement of Science (AAAS). [Abstract] [BibTeX] [DOI] Abstract: Quickly obtaining optimal solutions of combinatorial optimization problems has tremendous value but is extremely difficult. Thus, various kinds of machines specially designed for combinatorial optimization have recently been proposed and developed. Toward the realization of higher-performance machines, here, we propose an algorithm based on classical mechanics, which is obtained by modifying a previously proposed algorithm called simulated bifurcation. Our proposed algorithm allows us to achieve not only high speed by parallel computing but also high solution accuracy for problems with up to one million binary variables. Benchmarking shows that our machine based on the algorithm achieves high performance compared to recently developed machines, including a quantum annealer using a superconducting circuit, a coherent Ising machine using a laser, and digital processors based on various algorithms. Thus, high-performance combinatorial optimization is realized by massively parallel implementations of the proposed algorithm based on classical mechanics. BibTeX: @article{Goto2021, author = {Hayato Goto and Kotaro Endo and Masaru Suzuki and Yoshisato Sakai and Taro Kanao and Yohei Hamakawa and Ryo Hidaka and Masaya Yamasaki and Kosuke Tatsumura}, title = {High-performance combinatorial optimization based on classical mechanics}, journal = {Science Advances}, publisher = {American Association for the Advancement of Science (AAAS)}, year = {2021}, volume = {7}, number = {6}, pages = {eabe7953}, doi = {10.1126/sciadv.abe7953} }  Gould NIM and Toint PL (2021), "An adaptive regularization algorithm for unconstrained optimization with inexact function and derivatives values", November, 2021. [Abstract] [BibTeX] Abstract: An adaptive regularization algorithm for unconstrained nonconvex optimization is proposed that is capable of handling inexact objective-function and derivative values, and also of providing approximate minimizer of arbitrary order. In comparison with a similar algorithm proposed in Cartis, Gould, Toint (2021), its distinguishing feature is that it is based on controlling the relative error between the model and objective values. A sharp evaluation complexity complexity bound is derived for the new algorithm. BibTeX: @article{Gould2021, author = {N. I. M. Gould and Ph. L. Toint}, title = {An adaptive regularization algorithm for unconstrained optimization with inexact function and derivatives values}, year = {2021} }  Gowda S, Ma Y, Cheli A, Gwozdz M, Shah VB, Edelman A and Rackauckas C (2021), "High-performance symbolic-numerics via multiple dispatch", May, 2021. [Abstract] [BibTeX] Abstract: As mathematical computing becomes more democratized in high-level languages, high-performance symbolic-numeric systems are necessary for domain scientists and engineers to get the best performance out of their machine without deep knowledge of code optimization. Naturally, users need different term types either to have different algebraic properties for them, or to use efficient data structures. To this end, we developed Symbolics.jl, an extendable symbolic system which uses dynamic multiple dispatch to change behavior depending on the domain needs. In this work we detail an underlying abstract term interface which allows for speed without sacrificing generality. We show that by formalizing a generic API on actions independent of implementation, we can retroactively add optimized data structures to our system without changing the pre-existing term rewriters. We showcase how this can be used to optimize term construction and give a 113x acceleration on general symbolic transformations. Further, we show that such a generic API allows for complementary term-rewriting implementations. We demonstrate the ability to swap between classical term-rewriting simplifiers and e-graph-based term-rewriting simplifiers. We showcase an e-graph ruleset which minimizes the number of CPU cycles during expression evaluation, and demonstrate how it simplifies a real-world reaction-network simulation to halve the runtime. Additionally, we show a reaction-diffusion partial differential equation solver which is able to be automatically converted into symbolic expressions via multiple dispatch tracing, which is subsequently accelerated and parallelized to give a 157x simulation speedup. Together, this presents Symbolics.jl as a next-generation symbolic-numeric computing environment geared towards modeling and simulation. BibTeX: @article{Gowda2021, author = {Shashi Gowda and Yingbo Ma and Alessandro Cheli and Maja Gwozdz and Viral B. Shah and Alan Edelman and Christopher Rackauckas}, title = {High-performance symbolic-numerics via multiple dispatch}, year = {2021} }  Goyens F, Cartis C and Eftekhari A (2021), "Nonlinear matrix recovery using optimization on the Grassmann manifold", September, 2021. [Abstract] [BibTeX] Abstract: We investigate the problem of recovering a partially observed high-rank matrix whose columns obey a nonlinear structure such as a union of subspaces, an algebraic variety or grouped in clusters. The recovery problem is formulated as the rank minimization of a nonlinear feature map applied to the original matrix, which is then further approximated by a constrained non-convex optimization problem involving the Grassmann manifold. We propose two sets of algorithms, one arising from Riemannian optimization and the other as an alternating minimization scheme, both of which include first- and second-order variants. Both sets of algorithms have theoretical guarantees. In particular, for the alternating minimization, we establish global convergence and worst-case complexity bounds. Additionally, using the Kurdyka-Lojasiewicz property, we show that the alternating minimization converges to a unique limit point. We provide extensive numerical results for the recovery of union of subspaces and clustering under entry sampling and dense Gaussian sampling. Our methods are competitive with existing approaches and, in particular, high accuracy is achieved in the recovery using Riemannian second-order methods. BibTeX: @article{Goyens2021, author = {Florentin Goyens and Coralia Cartis and Armin Eftekhari}, title = {Nonlinear matrix recovery using optimization on the Grassmann manifold}, year = {2021} }  Grützmacher T, Anzt H and Quintana-Ortí ES (2021), "Using Ginkgo's memory accessor for improving the accuracy of memory-bound low precision BLAS", October, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: The roofline model not only provides a powerful tool to relate an application’s performance with the specific constraints imposed by the target hardware but also offers a graphic representation of the balance between memory access cost and compute throughput. In this work, we present a strategy to break up the tight coupling between the precision format used for arithmetic operations and the storage format employed for memory operations. (At a high level, this idea is equivalent to compressing/decompressing the data in registers before/after invoking store/load memory operations.) In practice, we demonstrate that a “memory accessor” that hides the data compression behind the memory access, can virtually push the bandwidth-induced roofline, yielding higher performance for memory-bound applications using high precision arithmetic that can handle the numerical effects associated with lossy compression. We also demonstrate that memory-bound applications operating on low precision data can increase the accuracy by relying on the memory accessor to perform all arithmetic operations in high precision. In particular, we demonstrate that memory-bound BLAS operations (including the sparse matrix-vector product) can be re-engineered with the memory accessor and that the resulting accessor-enabled BLAS routines achieve lower rounding errors while delivering the same performance as the fast low precision BLAS BibTeX: @article{Gruetzmacher2021, author = {Thomas Grützmacher and Hartwig Anzt and Enrique S. Quintana-Ortí}, title = {Using Ginkgo's memory accessor for improving the accuracy of memory-bound low precision BLAS}, publisher = {Wiley}, year = {2021}, doi = {10.1002/spe.3041} }  Guo J, Liang H, Ai S, Lu C, Hua H and Cao J (2021), "Improved approximate minimum degree ordering method and its application for electrical power network analysis and computation", Tsinghua Science and Technology., August, 2021. Vol. 26(4), pp. 464-474. Tsinghua University Press. [Abstract] [BibTeX] [DOI] Abstract: Electrical power network analysis and computation play an important role in the planning and operation of the power grid, and they are modeled mathematically as differential equations and network algebraic equations. The direct method based on Gaussian elimination theory can obtain analytical results. Two factors affect computing efficiency: the number of nonzero element fillings and the length of elimination tree. This article constructs mapping correspondence between eliminated tree nodes and quotient graph nodes through graph and quotient graph theories. The Approximate Minimum Degree (AMD) of quotient graph nodes and the length of the elimination tree nodes are composed to build an Approximate Minimum Degree and Minimum Length (AMDML) model. The quotient graph node with the minimum degree, which is also the minimum length of elimination tree node, is selected as the next ordering vector. Compared with AMD ordering method and other common methods, the proposed method further reduces the length of elimination tree without increasing the number of nonzero fillings; the length was decreased by about 10% compared with the AMD method. A testbed for experiment was built. The efficiency of the proposed method was evaluated based on different sizes of coefficient matrices of power flow cases. BibTeX: @article{Guo2021, author = {Jian Guo and Hong Liang and Songpu Ai and Chao Lu and Haochen Hua and Junwei Cao}, title = {Improved approximate minimum degree ordering method and its application for electrical power network analysis and computation}, journal = {Tsinghua Science and Technology}, publisher = {Tsinghua University Press}, year = {2021}, volume = {26}, number = {4}, pages = {464--474}, doi = {10.26599/tst.2020.9010019} }  Guo Z, Min A, Yang B, Chen J, Li H and Gao J (2021), "A Sparse Oblique-Manifold Nonnegative Matrix Factorization for Hyperspectral Unmixing", IEEE Transactions on Geoscience and Remote Sensing. , pp. 1-13. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: Hyperspectral unmixing (HU) has been one of the most significant tasks in hyperspectral image (HSI) processing. In recent years, nonnegative matrix factorization (NMF) has received great attention in the HU due to its simultaneous estimation, flexible modeling, and little requirement on prior information. However, several common NMF algorithms still suffer from high computational complexity, instability, and low convergence rate. Motivated by the matrix manifold theory, this article proposes a new sparse oblique-manifold (OB) NMF method from the perspective of matrix manifold. The critical idea of the proposed method is to regard the abundance matrix as locating on the oblique manifold, which eliminates its constraint of nonnegativity and sum-to-one and incorporates its intrinsic Riemannian geometry. Meanwhile, the L_1/2-norm on the Euclidean space can be transformed equivalently into the L_1-norm on oblique manifold. Then, via solving this sparse OBNMF by the Riemannian conjugated gradient (RCG) algorithm and the multiplicative iterative rule, the proposed method not only ensures improvement in the solution accuracy but also leads to a much faster convergence rate. Experimental results from the synthetic and real-world datasets illustrate the effectiveness and efficiency of the proposed method compared with the state-of-the-art NMF methods in HU. BibTeX: @article{Guo2021a, author = {Ziyang Guo and Anyou Min and Bing Yang and Junhong Chen and Hong Li and Junbin Gao}, title = {A Sparse Oblique-Manifold Nonnegative Matrix Factorization for Hyperspectral Unmixing}, journal = {IEEE Transactions on Geoscience and Remote Sensing}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--13}, doi = {10.1109/tgrs.2021.3082255} }  Guthe S and Thuerck D (2021), "Algorithm 1015", ACM Transactions on Mathematical Software., April, 2021. Vol. 47(2), pp. 1-27. Association for Computing Machinery (ACM). [Abstract] [BibTeX] [DOI] Abstract: We present a new algorithm for solving the dense linear (sum) assignment problem and an efficient, parallel implementation that is based on the successive shortest path algorithm. More specifically, we introduce the well-known epsilon scaling approach used in the Auction algorithm to approximate the dual variables of the successive shortest path algorithm prior to solving the assignment problem to limit the complexity of the path search. This improves the runtime by several orders of magnitude for hard-to-solve real-world problems, making the runtime virtually independent of how hard the assignment is to find. In addition, our approach allows for using accelerators and/or external compute resources to calculate individual rows of the cost matrix. This enables us to solve problems that are larger than what has been reported in the past, including the ability to efficiently solve problems whose cost matrix exceeds the available systems memory. To our knowledge, this is the first implementation that is able to solve problems with more than one trillion arcs in less than 100 hours on a single machine. BibTeX: @article{Guthe2021, author = {Stefan Guthe and Daniel Thuerck}, title = {Algorithm 1015}, journal = {ACM Transactions on Mathematical Software}, publisher = {Association for Computing Machinery (ACM)}, year = {2021}, volume = {47}, number = {2}, pages = {1--27}, doi = {10.1145/3442348} }  Halsted T, Shorinwa O, Yu J and Schwager M (2021), "A Survey of Distributed Optimization Methods for Multi-Robot Systems", March, 2021. [Abstract] [BibTeX] Abstract: Distributed optimization consists of multiple computation nodes working together to minimize a common objective function through local computation iterations and network-constrained communication steps. In the context of robotics, distributed optimization algorithms can enable multi-robot systems to accomplish tasks in the absence of centralized coordination. We present a general framework for applying distributed optimization as a module in a robotics pipeline. We survey several classes of distributed optimization algorithms and assess their practical suitability for multi-robot applications. We further compare the performance of different classes of algorithms in simulations for three prototypical multi-robot problem scenarios. The Consensus Alternating Direction Method of Multipliers (C-ADMM) emerges as a particularly attractive and versatile distributed optimization method for multi-robot systems. BibTeX: @article{Halsted2021, author = {Trevor Halsted and Ola Shorinwa and Javier Yu and Mac Schwager}, title = {A Survey of Distributed Optimization Methods for Multi-Robot Systems}, year = {2021} }  Hamdi-Larbi O, Mehrez I and Dufaud T (2021), "Machine Learning to Design an Auto-tuning System for the Best Compressed Format Detection for Parallel Sparse Computations", Parallel Processing Letters., November, 2021. World Scientific Pub Co Pte Ltd. [Abstract] [BibTeX] [DOI] Abstract: Many applications in scientific computing process very large sparse matrices on parallel architectures. The presented work in this paper is a part of a project where our general aim is to develop an auto-tuner system for the selection of the best matrix compression format in the context of high-performance computing. The target smart system can automatically select the best compression format for a given sparse matrix, a numerical method processing this matrix, a parallel programming model and a target architecture. Hence, this paper describes the design and implementation of the proposed concept. We consider a case study consisting of a numerical method reduced to the sparse matrix vector product (SpMV), some compression formats, the data parallel as a programming model and, a distributed multi-core platform as a target architecture. This study allows extracting a set of important novel metrics and parameters which are relative to the considered programming model. Our metrics are used as input to a machine-learning algorithm to predict the best matrix compression format. An experimental study targeting a distributed multi-core platform and processing random and real-world matrices shows that our system can improve in average up to 7% the accuracy of the machine learning. BibTeX: @article{HamdiLarbi2021, author = {Olfa Hamdi-Larbi and Ichrak Mehrez and Thomas Dufaud}, title = {Machine Learning to Design an Auto-tuning System for the Best Compressed Format Detection for Parallel Sparse Computations}, journal = {Parallel Processing Letters}, publisher = {World Scientific Pub Co Pte Ltd}, year = {2021}, doi = {10.1142/s0129626421500195} }  Han R, Si M, Demmel J and You Y (2021), "Dynamic scaling for low-precision learning", In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming., February, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: In recent years, distributed deep learning is becoming popular in industry and academia. Although researchers want to use distributed systems for training, it has been reported that the communication cost for synchronizing gradients can be a bottleneck. Using low-precision gradients is a promising technique for reducing the bandwidth requirement. In this work, we propose Auto Precision Scaling (APS), an algorithm that can improve the accuracy when we communicate gradients by low-precision floating-point values. APS can improve the accuracy for all precisions with a trivial communication cost. Our experimental results show that for both image classification and segmentation, applying APS can train the state-of-the-art models by 8-bit floating-point gradients with no or only a tiny accuracy loss (<0.05%). Furthermore, we can avoid any accuracy loss by designing a hybrid-precision technique. Finally, we propose a performance model to evaluate the proposed method. Our experimental results show that APS can get a significant speedup over the state-of-the-art method. To make it available to researchers and developers, we design and implement a high-performance system for customized precision Deep Learning(CPD), which can simulate the training process using an arbitrary low-precision customized floating-point format. We integrate CPD into PyTorch and make it open-source to the public. BibTeX: @inproceedings{Han2021, author = {Ruobing Han and Min Si and James Demmel and Yang You}, title = {Dynamic scaling for low-precision learning}, booktitle = {Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, year = {2021}, doi = {10.1145/3437801.3441624} }  Hanauer K, Henzinger M and Schulz C (2021), "Recent Advances in Fully Dynamic Graph Algorithms", February, 2021. [Abstract] [BibTeX] Abstract: In recent years, significant advances have been made in the design and analysis of fully dynamic algorithms. However, these theoretical results have received very little attention from the practical perspective. Few of the algorithms are implemented and tested on real datasets, and their practical potential is far from understood. Here, we survey recent engineering and theory results in the area of fully dynamic graph algorithms. BibTeX: @article{Hanauer2021, author = {Kathrin Hanauer and Monika Henzinger and Christian Schulz}, title = {Recent Advances in Fully Dynamic Graph Algorithms}, year = {2021} }  Hao S (2021), "Computer Vision for Environmental and Social Sustainability Policy". Thesis at: Stanford University. [Abstract] [BibTeX] [URL] Abstract: Computer vision (CV) has succeeded in many benchmark datasets but has yet to be widely applied to large-scale spatial and temporal correlated image datasets to address environmental or social issues. This thesis demonstrates the challenge, solutions, and policy implications of using computer vision for satellite, aerial, and street view images at scale with four empirical studies: 1) Mapping Vegetation and Classifying the driver of deforestation in Indonesia; 2) Detecting and characterizing high-emission oil and gas facilities in the U.S.; 3) Estimating the prevalence and placement of surveillance cameras in 16 major cities; 4) Mapping the Trend of Crosswalks Visibility Enhancements in the U.S. Transit-Oriented Development (TOD) areas. We conclude with a systematic review of the opportunities and challenges of using computer vision in facility urban and environment planning research. BibTeX: @phdthesis{Hao2021, author = {Sheng Hao}, title = {Computer Vision for Environmental and Social Sustainability Policy}, school = {Stanford University}, year = {2021}, url = {https://www.proquest.com/openview/abba1d278720b5736e962ccda4e99bda/1?pq-origsite=gscholar&cbl=18750&diss=y} }  He X, Hu R and Fang Y-P (2021), "Convergence rate analysis of fast primal-dual methods with scalings for linearly constrained convex optimization problems", March, 2021. [Abstract] [BibTeX] Abstract: We propose a primal-dual algorithm with scaling, linked to the Nesterov's acceleration scheme, for a linear equality constrained convex optimization problem. We also consider two variants of the algorithm: an inexact proximal primal-dual algorithm and an inexact linearized primal-dual algorithm. We prove that these algorithms enjoy fast convergence properties, even faster than 𝒪(1/k^2) under suitable scaling conditions. Finally, we study an inertial primal-dual dynamic with time scaling for a better understanding of accelerated schemes of the proposed algorithms. BibTeX: @article{He2021, author = {Xin He and Rong Hu and Ya-Ping Fang}, title = {Convergence rate analysis of fast primal-dual methods with scalings for linearly constrained convex optimization problems}, year = {2021} }  He X, Hu R and Fang Y-P (2021), "Fast convergence of primal-dual dynamics and algorithms with time scaling for linear equality constrained convex optimization problems", March, 2021. [Abstract] [BibTeX] Abstract: We propose a primal-dual dynamic with time scaling for a linear equality constrained convex optimization problem, which consists of a second-order ODE for the primal variable and a first-order ODE for the dual variable. Without assuming strong convexity, we prove its fast convergence property and show that the obtained fast convergence property is preserved under a small perturbation. We also develop an inexact primal-dual algorithm derived by a time discretization, and derive the fast convergence property matching that of the underlying dynamic. Finally, we give numerical experiments to illustrate the validity of the proposed algorithm. BibTeX: @article{He2021a, author = {Xin He and Rong Hu and Ya-Ping Fang}, title = {Fast convergence of primal-dual dynamics and algorithms with time scaling for linear equality constrained convex optimization problems}, year = {2021} }  He X, Hu R and Fang Y-P (2021), "Inertial primal-dual methods for linear equality constrained convex optimization problems", March, 2021. [Abstract] [BibTeX] Abstract: Inspired by a second-order primal-dual dynamical system [Zeng X, Lei J, Chen J. Dynamical primal-dual accelerated method with applications to network optimization. 2019; arXiv:1912.03690], we propose an inertial primal-dual method for the linear equality constrained convex optimization problem. When the objective function has a "nonsmooth + smooth" composite structure, we further propose an inexact inertial primal-dual method by linearizing the smooth individual function and solving the subproblem inexactly. Assuming merely convexity, we prove that the proposed methods enjoy 𝒪(1/k^2) convergence rate on ℒ(x_k,*)-ℒ(x^*,*) and 𝒪(1/k) convergence rate on primal feasibility, where ℒ is the Lagrangian function and (x^*,*) is a saddle point of ℒ. Numerical results are reported to demonstrate the validity of the proposed methods. BibTeX: @article{He2021b, author = {Xin He and Rong Hu and Ya-Ping Fang}, title = {Inertial primal-dual methods for linear equality constrained convex optimization problems}, year = {2021} }  He G, Vialle S and Baboulin M (2021), "Parallel and accurate k -means algorithm on CPU-GPU architectures for spectral clustering", Concurrency and Computation: Practice and Experience., September, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: k-Means is a standard algorithm for clustering data. It constitutes generally the final step in a more complex chain of high-quality spectral clustering. However, this chain suffers from lack of scalability when addressing large datasets. This can be overcome by applying also the k-means algorithm as a preprocessing task to reduce the input data instances. We propose parallel optimization techniques for the k-means algorithm on CPU and GPU. Particularly we use a two-step summation method with package processing to handle the effect of rounding errors that may occur during the phase of updating cluster centroids. Our experiments on synthetic and real-world datasets containing millions of instances exhibit a speedup up to 7 for the k-means iteration time on GPU versus 20/40 CPU threads using AVX units, and achieve double-precision accuracy with single-precision computations. BibTeX: @article{He2021c, author = {Guanlin He and Stephane Vialle and Marc Baboulin}, title = {Parallel and accurate k -means algorithm on CPU-GPU architectures for spectral clustering}, journal = {Concurrency and Computation: Practice and Experience}, publisher = {Wiley}, year = {2021}, doi = {10.1002/cpe.6621} }  Heirman W, Eyerman S, Bois KD and Hur I (2021), "Automatic Sublining for Efficient Sparse Memory Accesses", ACM Transactions on Architecture and Code Optimization., April, 2021. Vol. 18(3), pp. 1-23. Association for Computing Machinery (ACM). [Abstract] [BibTeX] [DOI] Abstract: Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional stream prefetchers useless. Furthermore, performing standard caching and prefetching on sparse accesses wastes precious memory bandwidth and thrashes caches, deteriorating performance for regular accesses. Bypassing prefetchers and caches for sparse accesses, and fetching only a single element (e.g., 8 B) from main memory (subline access), can solve these issues.\ Deciding which accesses to handle as sparse accesses and which as regular cached accesses, is a challenging task, with a large potential impact on performance. Not only is performance reduced by treating sparse accesses as regular accesses, not caching accesses that do have locality also negatively impacts performance by significantly increasing their latency and bandwidth consumption. Furthermore, this decision depends on the dynamic environment, such as input set characteristics and system load, making a static decision by the programmer or compiler suboptimal.\ We propose the Instruction Spatial Locality Estimator (ISLE), a hardware detector that finds instructions that access isolated words in a sea of unused data. These sparse accesses are dynamically converted into uncached subline accesses, while keeping regular accesses cached. ISLE does not require modifying source code or binaries, and adapts automatically to a changing environment (input data, available bandwidth, etc.). We apply ISLE to a graph analytics processor running sparse graph workloads, and show that ISLE outperforms the performance of no subline accesses, manual sublining, and prior work on detecting sparse accesses. BibTeX: @article{Heirman2021, author = {Wim Heirman and Stijn Eyerman and Kristof Du Bois and Ibrahim Hur}, title = {Automatic Sublining for Efficient Sparse Memory Accesses}, journal = {ACM Transactions on Architecture and Code Optimization}, publisher = {Association for Computing Machinery (ACM)}, year = {2021}, volume = {18}, number = {3}, pages = {1--23}, doi = {10.1145/3452141} }  Helal AE, Laukemann J, Checconi F, Tithi JJ, Ranadive T, Petrini F and Choi J (2021), "ALTO: Adaptive Linearized Storage of Sparse Tensors", February, 2021. [Abstract] [BibTeX] Abstract: The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive Linearized Tensor Order (ALTO) format, a novel mode-agnostic (general) representation that keeps neighboring nonzero elements in the multi-dimensional space close to each other in memory. To generate the indexing metadata, ALTO uses an adaptive bit encoding scheme that trades off index computations for lower memory usage and more effective use of memory bandwidth. Moreover, by decoupling its sparse representation from the irregular spatial distribution of nonzero elements, ALTO eliminates the workload imbalance and greatly reduces the synchronization overhead of tensor computations. As a result, the parallel performance of ALTO-based tensor operations becomes a function of their inherent data reuse. On a gamut of tensor datasets, ALTO outperforms an oracle that selects the best state-of-the-art format for each dataset, when used in key tensor decomposition operations. Specifically, ALTO achieves a geometric mean speedup of 8X over the best mode-agnostic format, while delivering a geometric mean compression ratio of more than 4X relative to the best mode-specific format. BibTeX: @article{Helal2021, author = {Ahmed E. Helal and Jan Laukemann and Fabio Checconi and Jesmin Jahan Tithi and Teresa Ranadive and Fabrizio Petrini and Jeewhan Choi}, title = {ALTO: Adaptive Linearized Storage of Sparse Tensors}, year = {2021} }  Herschlag G, Lee S, Vetter J and Randles A (2021), "Analysis of GPU Data Access Patterns on Complex Geometries for the D3Q19 Lattice Boltzmann Algorithm", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: GPU performance of the lattice Boltzmann method (LBM) depends heavily on memory access patterns. When LBM is advanced with GPUs on complex domains, typically, geometric data is accessed indirectly, and lattice data is accessed lexicographically. Although there are a variety of other access patterns, no study has yet examined the relative efficacy between them. Here, we examine a suite of memory access schemes via empirical testing and performance modeling. We find strong evidence that semi-direct is often better suited than the more common indirect addressing: semi-direct methods provide increased computational speed and may reduce memory consumption. For the lattice layout, we find that the Collected Structure of Arrays (CSoA) and bundling layouts outperform the common Structrure of Array layout; on V100 and P100 devices, CSoA consistently outperforms bundling, however the relationship is more complicated on K40 devices. When compared to state-of-the-art practices, our recommended addressing modifications lead to performance gains between 10--40% across different domains and reduce memory consumption by as much as 17%. We demonstrate that our results hold across multiple GPUs on a leadership class system, and present the first near-optimal strong results for LBM with arterial geometries run on GPUs. BibTeX: @article{Herschlag2021, author = {Gregory Herschlag and Seyong Lee and Jeffrey Vetter and Amanda Randles}, title = {Analysis of GPU Data Access Patterns on Complex Geometries for the D3Q19 Lattice Boltzmann Algorithm}, journal = {IEEE Transactions on Parallel and Distributed Systems}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/tpds.2021.3061895} }  Higham NK and Lettington MC (2021), "Optimizing and Factorizing the Wilson Matrix", February, 2021. , pp. e330. PeerJ. [Abstract] [BibTeX] Abstract: The Wilson matrix, W, is a 4 × 4 unimodular symmetric positive definite matrix of integers that has been used as a test matrix since the 1940s, owing to its mild ill-conditioning. We ask how close W is to being the most ill-conditioned matrix in its class, with or without the requirement of positive definiteness. By exploiting the matrix adjugate and applying various matrix norm bounds from the literature we derive bounds on the condition numbers for the two cases and we compare them with the optimal condition numbers found by exhaustive search. We also investigate the existence of factorizations W = Z^T Z with Z having integer or rational entries. Drawing on recent research that links the existence of these factorizations to number-theoretic considerations of quadratic forms, we show that W has an integer factor Z and two rational factors, up to signed permutations. This little 4 × 4 matrix continues to be a useful example on which to apply existing matrix theory as well as being capable of raising challenging questions that lead to new results. BibTeX: @article{Higham2021, author = {Higham, Nicholas K. and Lettington, Matthew C.}, title = {Optimizing and Factorizing the Wilson Matrix}, publisher = {PeerJ}, year = {2021}, pages = {e330} }  Higham NJ, Lettington MC and Schmidt KM (2021), "Integer matrix factorisations, superalgebras and the quadratic form obstruction", March, 2021. [Abstract] [BibTeX] Abstract: We identify and analyse obstructions to factorisation of integer matrices into products N^T N or N^2 of matrices with rational or integer entries. The obstructions arise as quadratic forms with integer coefficients and raise the question of the discrete range of such forms. They are obtained by considering matrix decompositions over a superalgebra. We further obtain a formula for the determinant of a square matrix in terms of adjugates of these matrix decompositions, as well as identifying a co-Latin symmetry space. BibTeX: @article{Higham2021a, author = {Nicholas J. Higham and Matthew C. Lettington and Karl Michael Schmidt}, title = {Integer matrix factorisations, superalgebras and the quadratic form obstruction}, year = {2021} }  Higham NJ (2021), "Numerical Stability of Algorithms at Extreme Scale and Low Precisions" [Abstract] [BibTeX] Abstract: The largest dense linear systems that are being solved today are of order n = 107. Single precision arithmetic, which has a unit roundoff u ≈ 10-8, is widely used in scientific computing, and half precision arithmetic, with u ≈ 10^-4, is increasingly being exploited as it becomes more readily available in hardware. Standard rounding error bounds for numerical linear algebra algorithms are proportional to p(n)u, with p growing at least linearly with n. Therefore we are at the stage where these rounding error bounds are not able to guarantee any accuracy or stability in the computed results for some extreme-scale or low-accuracy computations. We explain how rounding error bounds with much smaller constants can be obtained. Blocked algorithms, which break the data into blocks of size b, lead to a reduction in the error constants by a factor b or more. Two architectural features also reduce the error constants: extended precision registers and fused multiply–add operations, either at the scalar level or in mixed precision block form. We also discuss a new probabilistic approach to rounding error analysis that provides error constants that are the square roots of those of the worst-case bounds. Combining these different considerations provides new understanding of the numerical stability of extreme scale and low precision computations in numerical linear algebra. BibTeX: @article{Higham2021b, author = {Higham, Nicholas J.}, title = {Numerical Stability of Algorithms at Extreme Scale and Low Precisions}, year = {2021} }  Higham NJ and Mikaitis M (2021), "Anymatrix: An Extensible MATLAB Matrix Collection" [Abstract] [BibTeX] [URL] Abstract: Anymatrix is a MATLAB toolbox that provides an extensible collection of matrices with the ability to search the collection by matrix properties. Each matrix is implemented as a MATLAB function and the matrices are arranged in groups. Compared with previous collections, Anymatrix offers three novel features. First, it allows a user to share a collection of matrices by putting them in a group, annotating them with properties, and placing the group on a public repository, for example on GitHub; the group can then be incorporated into another user's local Anymatrix installation. Second, it provides a tool to search for matrices by their properties, with Boolean expressions supported. Third, it provides organization into sets, which are subsets of matrices from the whole collection appended with notes, which facilitate reproducible experiments. Anymatrix v1.0 comes with 146 built-in matrices organized into 7 groups with 49 recognized properties. The authors continue to extend the collection and welcome contributions from the community. BibTeX: @article{Higham2021c, author = {Higham, Nicholas J. and Mikaitis, Mantas}, title = {Anymatrix: An Extensible MATLAB Matrix Collection}, year = {2021}, url = {http://eprints.maths.manchester.ac.uk/2835/} }  Higham NJ and Mary T (2021), "Mixed Precision Algorithms in Numerical Linear Algebra" [Abstract] [BibTeX] [URL] Abstract: Today's floating-point arithmetic landscape is broader than ever. While scientific computing has traditionally used single precision and double precision floating-point arithmetics, half precision is increasingly available in hardware and quadruple precision is supported in software. Lower precision arithmetic brings increased speed and reduced communication and energy costs, but it produces results of correspondingly low accuracy. Higher precisions are more expensive but can potentially provide great benefits, even if used sparingly. A variety of mixed precision algorithms have been developed that combine the superior performance of lower precisions with the better accuracy of higher precisions. Some of these algorithms aim to provide results of the same quality as algorithms running in a fixed precision but at a much lower cost; others use a little higher precision to improve the accuracy of an algorithm. This survey treats a broad range of mixed precision algorithms in numerical linear algebra, both direct and iterative, for problems including matrix multiplication, matrix factorization, linear systems, least squares, eigenvalue decomposition, and singular value decomposition. We identify key algorithmic ideas, such as iterative refinement, adapting the precision to the data, and exploiting mixed precision block fused multiply–add operations. We also describe the possible performance benefits and explain what is known about the numerical stability of the algorithms. This survey should be useful to a wide community of researchers and practitioners who wish to develop or benefit from mixed precision numerical linear algebra algorithms. BibTeX: @article{Higham2021d, author = {Higham, Nicholas J. and Mary, Theo}, title = {Mixed Precision Algorithms in Numerical Linear Algebra}, year = {2021}, url = {http://eprints.maths.manchester.ac.uk/2841/1/paper_eprint.pdf} }  Ho N-M, silva HD and Wong W-F (2021), "GRAM: A Framework for Dynamically Mixing Precisions in GPU Applications", ACM Transactions on Architecture and Code Optimization., February, 2021. Vol. 18(2), pp. 1-24. Association for Computing Machinery (ACM). [Abstract] [BibTeX] [DOI] Abstract: This article presents GRAM (GPU-based Runtime Adaption for Mixed-precision) a framework for the effective use of mixed precision arithmetic for CUDA programs. Our method provides a fine-grain tradeoff between output error and performance. It can create many variants that satisfy different accuracy requirements by assigning different groups of threads to different precision levels adaptively at runtime. To widen the range of applications that can benefit from its approximation, GRAM comes with an optional half-precision approximate math library. Using GRAM, we can trade off precision for any performance improvement of up to 540 %, depending on the application and accuracy requirement. BibTeX: @article{Ho2021, author = {Nhut-Minh Ho and Himeshi De silva and Weng-Fai Wong}, title = {GRAM: A Framework for Dynamically Mixing Precisions in GPU Applications}, journal = {ACM Transactions on Architecture and Code Optimization}, publisher = {Association for Computing Machinery (ACM)}, year = {2021}, volume = {18}, number = {2}, pages = {1--24}, doi = {10.1145/3441830} }  Hu D, Ubaru S, Gittens A, Clarkson KL, Horesh L and Kalantzis V (2021), "Sparse Graph Based Sketching for Fast Numerical Linear Algebra", June, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: In recent years, a variety of randomized constructions of sketching matrices have been devised, that have been used in fast algorithms for numerical linear algebra problems, such as least squares regression, low-rank approximation, and the approximation of leverage scores. A key property of sketching matrices is that of subspace embedding. In this paper, we study sketching matrices that are obtained from bipartite graphs that are sparse, i.e., have left degree s that is small. In particular, we explore two popular classes of sparse graphs, namely, expander graphs and magical graphs. For a given subspace U ⊆ R^n of dimension k, we show that the magical graph with left degree s = 2 yields a (1 ± ) l 2-subspace embedding for U, if the number of right vertices (the sketch size) m = O(k^2/2). The expander graph with s=O(logk/) yields a subspace embedding for m=O(k log k/2). We also discuss the construction of sparse sketching matrices with reduced randomness using expanders based on error-correcting codes. Empirical results on various synthetic and real datasets show that these sparse graph sketching matrices work very well in practice. BibTeX: @inproceedings{Hu2021, author = {Dong Hu and Shashanka Ubaru and Alex Gittens and Kenneth L. Clarkson and Lior Horesh and Vassilis Kalantzis}, title = {Sparse Graph Based Sketching for Fast Numerical Linear Algebra}, publisher = {IEEE}, year = {2021}, doi = {10.1109/icassp39728.2021.9414030} }  Huang M (2021), "Escaping Saddle Points for Nonsmooth Weakly Convex Functions via Perturbed Proximal Algorithms", February, 2021. [Abstract] [BibTeX] Abstract: We propose perturbed proximal algorithms that can provably escape strict saddles for nonsmooth weakly convex functions. The main results are based on a novel characterization of 𝜖-approximate local minimum for nonsmooth functions, and recent developments on perturbed gradient methods for escaping saddle points for smooth problems. Specifically, we show that under standard assumptions, the perturbed proximal point, perturbed proximal gradient and perturbed proximal linear algorithms find 𝜖-approximate local minimum for nonsmooth weakly convex functions in O(-2(d)^4) iterations, where d is the dimension of the problem. BibTeX: @article{Huang2021, author = {Minhui Huang}, title = {Escaping Saddle Points for Nonsmooth Weakly Convex Functions via Perturbed Proximal Algorithms}, year = {2021} }  Huang Q, Kang M, Dinh G, Norell T, Kalaiah A, Demmel J, Wawrzynek J and Shao YS (2021), "CoSA: Scheduling by Constrained Optimization for Spatial Accelerators", May, 2021. [Abstract] [BibTeX] Abstract: Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and flexible interconnect. While DNN accelerators can take advantage of data reuse and achieve high peak throughput, they also expose a large number of runtime parameters to the programmers who need to explicitly manage how computation is scheduled both spatially and temporally. In fact, different scheduling choices can lead to wide variations in performance and efficiency, motivating the need for a fast and efficient search strategy to navigate the vast scheduling space. To address this challenge, we present CoSA, a constrained-optimization-based approach for scheduling DNN accelerators. As opposed to existing approaches that either rely on designers' heuristics or iterative methods to navigate the search space, CoSA expresses scheduling decisions as a constrained-optimization problem that can be deterministically solved using mathematical optimization techniques. Specifically, CoSA leverages the regularities in DNN operators and hardware to formulate the DNN scheduling space into a mixed-integer programming (MIP) problem with algorithmic and architectural constraints, which can be solved to automatically generate a highly efficient schedule in one shot. We demonstrate that CoSA-generated schedules significantly outperform state-of-the-art approaches by a geometric mean of up to 2.5x across a wide range of DNN networks while improving the time-to-solution by 90x. BibTeX: @article{Huang2021a, author = {Qijing Huang and Minwoo Kang and Grace Dinh and Thomas Norell and Aravind Kalaiah and James Demmel and John Wawrzynek and Yakun Sophia Shao}, title = {CoSA: Scheduling by Constrained Optimization for Spatial Accelerators}, year = {2021} }  Huang S (2021), "Optimization of Block-Based Tensor Decompositions through Sub-Tensor Impact Graphs and Applications to Dynamicity in Data and User Focus". Thesis at: Arizona State University. [Abstract] [BibTeX] Abstract: Tensors are commonly used for representing multi-dimensional data, such as Web graphs, sensor streams, and social networks. As a consequence of the increase in the use of tensors, tensor decomposition operations began to form the basis for many data analysis and knowledge discovery tasks, from clustering, trend detection, anomaly detection to correlation analysis [31, 38].\ It is well known that Singular Value matrix Decomposition (SVD) [9] is used to extract latent semantics for matrix data. When apply svd to tensors, which have more than two modes, it is tensor decomposition. The two most popular tensor decomposition algorithms are the Tucker [54] and the CP [19] decompositions. Intuitively, they both generalize SVD to tensors. However, one key problem with tensor decomposition is its computational complexity which may cause system bottleneck. Therefore, two phase block-centric CP tensor decomposition (2PCP) was proposed to partition the tensor into small sub-tensors, execute sub-tensor decomposition in parallel and combine the factors from each sub-tensor into final decomposition factors through iterative refinement process.\ Consequently, I proposed Sub-tensor Impact Graph (SIG) to account for inaccuracy propagation among sub-tensors and measure the impact of decomposition of sub-tensors on the other’s decomposition, Based on SIG, I proposed several optimization strategies to optimize 2PCP’s phase-2 refinement process. Furthermore, I applied SIG and optimization strategies for data focus, data evolution and focus shifting in tensor analysis. Personalized Tensor Decomposition (PTD) is proposed to account for the users focus given the observations that in many applications, the user may have a focus of interest i.e., part of the data for which the user needs high accuracy and beyond this area focus, accuracy may not be as critical. PTD takes as input one or more areas of focus and performs the decomposition in such a way that, when reconstructed, the accuracy of the tensor is boosted for these areas of focus.\ iA related challenge of data evolution in tensor analytics is incremental tensor decomposition since re-computation of the whole tensor decomposition with each update will cause high computational costs and incur large memory overheads. Especially for applications where data evolves over time and the tensor-based analysis results need to be continuously maintained. To avoid re-decomposition, I propose a two-phase block-incremental CP-based tensor decomposition technique, BICP, that efficiently and effectively maintains tensor decomposition results in the presence of dynamically evolving tensor data.\ I further extend the research focus on user focus shift. User focus may change over time as data is evolving along the time. Although PTD is efficient, re-computation for each user preference update can be bottleneck for the system. Therefore I propose dynamic evolving user focus tensor decomposition which can smartly reuse the existing decomposition result to improve the efficiency of evolving user focus block decomposition. BibTeX: @phdthesis{Huang2021b, author = {Huang, Shengyu}, title = {Optimization of Block-Based Tensor Decompositions through Sub-Tensor Impact Graphs and Applications to Dynamicity in Data and User Focus}, school = {Arizona State University}, year = {2021} }  Huang J, Huang S and Sun M (2021), "DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworksusing Stochastic Domain Decomposition", In Proceedings of CVPR 2021. [Abstract] [BibTeX] Abstract: We propose a novel approach for large-scale nonlinear least squares problems based on deep learning frameworks. Nonlinear least squares are commonly solved with the Levenberg-Marquardt (LM) algorithm for fast convergence. We implement a general and efficient LM solver on a deep learning framework by designing a new backward jacobian network to enable automatic sparse jacobian matrix computation. Furthermore, we introduce a stochastic domain decomposition approach that enables batched optimization and preserves convergence for large problems. We evaluate our method by solving bundle adjustment as a fundamental problem. Experiments show that our optimizer significantly outperforms the state-of-the-art solutions and existing deep learning solvers considering quality, efficiency, and memory. Our stochastic domain decomposition enables distributed optimization, consumes little memory and time, and achieves similar quality compared to a global solver. As a result, our solver effectively solves nonlinear least squares on an extremely large scale. Our code will be available based on Pytorch and Mindspore. BibTeX: @inproceedings{Huang2021c, author = {Jingwei Huang and Shan Huang and Mingwei Sun}, title = {DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworksusing Stochastic Domain Decomposition}, booktitle = {Proceedings of CVPR 2021}, year = {2021} }  Huang J, Jiao Y, Lu X, Shi Y, Yang Q and Yang Y (2021), "PSNA: A pathwise semismooth Newton algorithm for sparse recovery with optimal local convergence and oracle properties", Signal Processing., December, 2021. , pp. 108432. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: We propose a pathwise semismooth Newton algorithm (PSNA) for sparse recovery in high-dimensional linear models. PSNA is derived from a formulation of the KKT conditions for Lasso and Enet based on Newton derivatives. It solves the semismooth KKT equations efficiently by actively and continuously seeking the support of the regression coefficients along the solution path with warm start. At each knot in the path, PSNA converges locally superlinearly for the Enet criterion and achieves the best possible convergence rate for the Lasso criterion, i.e., PSNA converges in just one step at the cost of two matrix-vector multiplication per iteration. Under certain regularity conditions on the design matrix and the minimum magnitude of the nonzero elements of the target regression coefficients, we show that PSNA hits a solution with the same signs as the regression coefficients and achieves a sharp estimation error bound in finite steps with high probability. Extensive simulation studies support our theoretical results and indicate that PSNA is competitive with or outperforms state-of-the-art Lasso solvers in terms of efficiency and accuracy. BibTeX: @article{Huang2021d, author = {Jian Huang and Yuling Jiao and Xiliang Lu and Yueyong Shi and Qinglong Yang and Yuanyuan Yang}, title = {PSNA: A pathwise semismooth Newton algorithm for sparse recovery with optimal local convergence and oracle properties}, journal = {Signal Processing}, publisher = {Elsevier BV}, year = {2021}, pages = {108432}, doi = {10.1016/j.sigpro.2021.108432} }  Huber D, Schreiber M and Schulz M (2021), "Graph-based multi-core higher-order time integration of linear autonomous partial differential equations", Journal of Computational Science., April, 2021. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Modern high-performance computing (HPC) systems rely on increasingly complex nodes with a steadily growing number of cores and matching deep memory hierarchies. In order to fully exploit them, algorithms must be explicitly designed to exploit these features. In this work we address this challenge for a widely used class of application kernels: polynomial-based time integration of linear autonomous partial differential equations.\ We build on prior work [1] of a cache-aware, yet sequential solution and provide an innovative way to parallelize it, while addressing cache-awareness across a large number of cores. For this, we introduce a dependency graph driven view of the algorithm and then use both static graph partitioning and dynamic scheduling to efficiently map the execution to the underlying platform. We implement our approach on top of the widely available Intel Threading Building Blocks (TBB) library, although the concepts are programming model agnostic and can apply to any task-driven parallel programming approach.\ We demonstrate the performance of our approach for a 2-nd, 4-th and 6-th order time integration of the linear advection equation on three different architectures with widely varying memory systems and achieve an up to 60% reduction of wall clock time compared to a conventional, state-of-the-art non-cache-aware approach. BibTeX: @article{Huber2021, author = {Dominik Huber and Martin Schreiber and Martin Schulz}, title = {Graph-based multi-core higher-order time integration of linear autonomous partial differential equations}, journal = {Journal of Computational Science}, publisher = {Elsevier BV}, year = {2021}, doi = {10.1016/j.jocs.2021.101349} }  Hult R, Zanon M, Gros S and Falcone P (2021), "A Semi-Distributed Interior Point Algorithm for Optimal Coordination of Automated Vehicles at Intersections", November, 2021. [Abstract] [BibTeX] Abstract: In this paper, we consider the optimal coordination of automated vehicles at intersections under fixed crossing orders. We formulate the problem using direct optimal control and exploit the structure to construct a semi-distributed primal-dual interior-point algorithm to solve it by parallelizing most of the computations. Differently from standard distributed optimization algorithms, where the optimization problem is split, in our approach we split the linear algebra steps, such that the algorithm takes the same steps as a fully centralized one, while still performing computations in a distributed fashion. We analyze the communication requirements of the algorithm, and propose an approximation scheme which can significantly reduce the data exchange. We demonstrate the effectiveness of the algorithm in hard but realistic scenarios, which show that the approximation leads to reductions in communicated data of almost 99% of the exact formulation, at the expense of less than 1% suboptimality. BibTeX: @article{Hult2021, author = {Robert Hult and Mario Zanon and Sebastien Gros and Paolo Falcone}, title = {A Semi-Distributed Interior Point Algorithm for Optimal Coordination of Automated Vehicles at Intersections}, year = {2021} }  Hussain MT, Abhishek GS, Buluç A and Azad A (2021), "Parallel Algorithms for Adding a Collection of Sparse Matrices", December, 2021. [Abstract] [BibTeX] Abstract: We develop a family of parallel algorithms for the SpKAdd operation that adds a collection of k sparse matrices. SpKAdd is a much needed operation in many applications including distributed memory sparse matrix-matrix multiplication (SpGEMM), streaming accumulations of graphs, and algorithmic sparsification of the gradient updates in deep learning. While adding two sparse matrices is a common operation in Matlab, Python, Intel MKL, and various GraphBLAS libraries, these implementations do not perform well when adding a large collection of sparse matrices. We develop a series of algorithms using tree merging, heap, sparse accumulator, hash table, and sliding hash table data structures. Among them, hash-based algorithms attain the theoretical lower bounds both on the computational and I/O complexities and perform the best in practice. The newly-developed hash SpKAdd makes the computation of a distributed-memory SpGEMM algorithm at least 2x faster than that the previous state-of-the-art algorithms. BibTeX: @article{Hussain2021, author = {Md Taufique Hussain and Guttu Sai Abhishek and Aydin Buluç and Ariful Azad}, title = {Parallel Algorithms for Adding a Collection of Sparse Matrices}, year = {2021} }  Hyun C and Lee P-S (2021), "A load balancing algorithm for the parallel automated multilevel substructuring method", Computers & Structures., December, 2021. Vol. 257, pp. 106649. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The objective of this paper is to present a load balancing algorithm for the parallel automated multilevel substructuring (PAMLS) method. In the PAMLS method, load balancing is highly dependent on the computation time for the transformation and back transformation procedures corresponding to substructures. To balance the workload among threads, the proposed algorithm consists of two types of granularity: coarse-grained and fine-grained parallel algorithms. According to the level of substructures, the coarse-grained parallel algorithm splits both the transformation and back transformation procedures and assigns them to threads. Through fine-grained parallelism, more threads are exploited for the transformation of each substructure compared to threads used in the original PAMLS method. Without repartitioning, the proposed algorithm significantly improves the efficiency of the PAMLS method. BibTeX: @article{Hyun2021, author = {Cheolgyu Hyun and Phill-Seung Lee}, title = {A load balancing algorithm for the parallel automated multilevel substructuring method}, journal = {Computers & Structures}, publisher = {Elsevier BV}, year = {2021}, volume = {257}, pages = {106649}, doi = {10.1016/j.compstruc.2021.106649} }  Ibrahim AH, Kumam P, Abubakar AB and Adamu A (2021), "Accelerated derivative-free method for nonlinear monotone equations with an application", Numerical Linear Algebra with Applications., November, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: In optimization theory, to speed up the convergence of iterative procedures, many mathematicians often use the inertial extrapolation method. In this article, based on the three-term derivative-free method for solving monotone nonlinear equations with convex constraints [Calcolo, 2016;53(2):133-145], we design an inertial algorithm for finding the solutions of nonlinear equation with monotone and Lipschitz continuous operator. The convergence analysis is established under some mild conditions. Furthermore, numerical experiments are implemented to illustrate the behavior of the new algorithm. The numerical results have shown the effectiveness and fast convergence of the proposed inertial algorithm over the existing algorithm. Moreover, as an application, we extend this method to solve the LASSO problem to decode a sparse signal in compressive sensing. Performance comparisons illustrate the effectiveness and competitiveness of the method. BibTeX: @article{Ibrahim2021, author = {Abdulkarim Hassan Ibrahim and Poom Kumam and Auwal Bala Abubakar and Abubakar Adamu}, title = {Accelerated derivative-free method for nonlinear monotone equations with an application}, journal = {Numerical Linear Algebra with Applications}, publisher = {Wiley}, year = {2021}, doi = {10.1002/nla.2424} }  Ibriga HS and Sun WW (2021), "Covariate-assisted Sparse Tensor Completion", March, 2021. [Abstract] [BibTeX] Abstract: We aim to provably complete a sparse and highly-missing tensor in the presence of covariate information along tensor modes. Our motivation comes from online advertising where users click-through-rates (CTR) on ads over various devices form a CTR tensor that has about 96% missing entries and has many zeros on non-missing entries, which makes the standalone tensor completion method unsatisfactory. Beside the CTR tensor, additional ad features or user characteristics are often available. In this paper, we propose Covariate-assisted Sparse Tensor Completion (COSTCO) to incorporate covariate information for the recovery of the sparse tensor. The key idea is to jointly extract latent components from both the tensor and the covariate matrix to learn a synthetic representation. Theoretically, we derive the error bound for the recovered tensor components and explicitly quantify the improvements on both the reveal probability condition and the tensor recovery accuracy due to covariates. Finally, we apply COSTCO to an advertisement dataset consisting of a CTR tensor and ad covariate matrix, leading to 23% accuracy improvement over the baseline. An important by-product is that ad latent components from COSTCO reveal interesting ad clusters, which are useful for better ad targeting. BibTeX: @article{Ibriga2021, author = {Hilda S Ibriga and Will Wei Sun}, title = {Covariate-assisted Sparse Tensor Completion}, year = {2021} }  Il'in VP (2021), "Iterative Preconditioned Methods in Krylov Spaces: Trends of the 21st Century", Computational Mathematics and Mathematical Physics., November, 2021. Vol. 61(11), pp. 1750-1775. Pleiades Publishing Ltd. [Abstract] [BibTeX] [DOI] [URL] Abstract: A analytic review of major problems and new mathematical and technological discoveries in methods for solving SLAEs is given. This stage of mathematical modeling is a bottleneck because the amount of the required computational resources grows nonlinearly with the increasing number of degrees of freedom of the problem. It is important that the efficiency and performance of computational methods and technologies significantly depend on how well the specific features of the class of application problems--electromagnetism, fluid dynamics, elasticity and plasticity, multiphase filtering, heat and mass transfer, etc. are taken into account. The development of Krylov iterative processes is mainly intended for the construction of two-level algorithms with various orthogonal, projective, variational, and spectral properties, including not only polynomial but also rational and harmonic approximation techniques. Additional acceleration of such algorithms is achieved on the basis of deflation and augmenting approaches using various systems of basis vectors. The goal of intensive studies is to construct efficient preconditioning operators on the basis of various principles: new multigrid schemes and parallel domain decomposition methods, multipreconditioning, nested and alternate triangular factorizations, low-rank and other algorithms for approximating inverse matrices, etc. High-performance and scalable parallelization are based on hybrid programming using internode message passing, multithreaded computations, vectorization, and graphics processing units (GPUs). Modern trends in mathematical methods and software are aimed at the creation of an integrated environment designed for a long lifecycle and massive innovations in important applications BibTeX: @article{IlIn2021, author = {V. P. Il'in}, title = {Iterative Preconditioned Methods in Krylov Spaces: Trends of the 21st Century}, journal = {Computational Mathematics and Mathematical Physics}, publisher = {Pleiades Publishing Ltd}, year = {2021}, volume = {61}, number = {11}, pages = {1750--1775}, url = {https://icmmg.nsc.ru/sites/default/files/pubs/commat2111009ilinkor.pdf}, doi = {10.1134/s0965542521110099} }  Inapakurthi RK, Miriyala SS and Mitra K (2021), "Deep Learning Based Dynamic Behavior Modelling and Prediction of Particulate Matter in Air", Chemical Engineering Journal., July, 2021. , pp. 131221. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks are utilized to capture the dynamic trends of 15 environmental parameters including particulate matter and pollutants in the atmosphere that cause long-term health hazards. Despite having the capability for capturing the long-term dependencies and nonlinearities in dynamic data, these deep learning based models suffer from overfitting if hyper-parameters are not determined optimally. For this purpose, a novel evolutionary algorithm for neural architecture search balancing the accuracy-complexity trade-off through a multi-objective optimization is proposed. This algorithm not only designs optimal deep-RNNs, but also ensures simultaneous determination of activation function and truncated backpropagation length. Analysis of many-to-one and many-to-many styled RNNs concluded that latter style is more effective. Subsequently it is compared with that of LSTMs to achieve an overall accuracy between 85.612% to 99.56%. To further minimize this error, multi-variate modelling is proposed. However, since it is important to identify the most significant features, which can be considered as inputs to multi-variate deep RNNs, Monte Carlo based Global Sensitivity Analysis is performed. It proved the hypothesis with sufficient statistical evidence that pH of rain (whose univariate modelling accuracy was least among all) is affected by methane, carbon monoxide, non-methane hydrocarbons and total hydrocarbons, thus improving the modelling accuracy to 98.97%. These models not only can help policymakers make informed decisions and mitigate climate change, but also the approach can be extended for other time-series modelling related applications due to its generic nature. BibTeX: @article{Inapakurthi2021, author = {Ravi Kiran Inapakurthi and Srinivas Soumitri Miriyala and Kishalay Mitra}, title = {Deep Learning Based Dynamic Behavior Modelling and Prediction of Particulate Matter in Air}, journal = {Chemical Engineering Journal}, publisher = {Elsevier BV}, year = {2021}, pages = {131221}, doi = {10.1016/j.cej.2021.131221} }  Iqbal Z, Nooshabadi S, Yamazaki I, Tomov S and Dongarra J (2021), "Exploiting Block Structures of KKT Matrices for Efficient Solution of Convex Optimization Problems", IEEE Access. Vol. 9, pp. 116604-116611. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: Convex optimization solvers are widely used in the embedded systems that require sophisticated optimization algorithms including model predictive control (MPC). In this paper, we aim to reduce the online solve time of such convex optimization solvers so as to reduce the total runtime of the algorithm and make it suitable for real-time convex optimization. We exploit the property of the Karush–Kuhn–Tucker (KKT) matrix involved in the solution of the problem that only some parts of the matrix change during the solution iterations of the algorithm. Our results show that the proposed method can effectively reduce the runtime of the solvers. BibTeX: @article{Iqbal2021, author = {Zafar Iqbal and Saeid Nooshabadi and Ichitaro Yamazaki and Stanimire Tomov and Jack Dongarra}, title = {Exploiting Block Structures of KKT Matrices for Efficient Solution of Convex Optimization Problems}, journal = {IEEE Access}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, volume = {9}, pages = {116604--116611}, doi = {10.1109/access.2021.3106054} }  Jahromi AF and Shams NN (2021), "A new optimized iterative method for solving textdollarMtextdollar-matrix linear systems", Applications of Mathematics., November, 2021. , pp. 1-22. Institute of Mathematics, Czech Academy of Sciences. [Abstract] [BibTeX] [DOI] Abstract: In this paper, we present a new iterative method for solving a linear system, whose coefficient matrix is an M-matrix. This method includes four parameters that are obtained by the accelerated overrelaxation (AOR) splitting and using the Taylor approximation. First, under some standard assumptions, we establish the convergence properties of the new method. Then, by minimizing the Frobenius norm of the iteration matrix, we find the optimal parameters. Meanwhile, numerical results on test examples show the efficiency of the new proposed method in contrast with the Hermitian and skew-Hermitian splitting (HSS), AOR methods and a modified version of the AOR (QAOR) iteration. BibTeX: @article{Jahromi2021, author = {Alireza Fakharzadeh Jahromi and Nafiseh Nasseri Shams}, title = {A new optimized iterative method for solving textdollarMtextdollar-matrix linear systems}, journal = {Applications of Mathematics}, publisher = {Institute of Mathematics, Czech Academy of Sciences}, year = {2021}, pages = {1--22}, doi = {10.21136/am.2021.0246-20} }  Janalík R (2021), "Node-Level Performance Modeling of Sparse Factorization Solver". Thesis at: Università della Svizzera Italiana. [Abstract] [BibTeX] Abstract: Solving large sparse linear systems is at the heart of many application problems arising from scientific and engineering problems. These systems are often solved by direct factorization solvers, especially when the system needs to be solved for multiple right-hand sides or when a high numerical precision is required. Direct solvers are based on matrix factorization, which is then followed by forward and backward substitution to obtain a precise solution. The factorization is the most computationally intensive step, but it has to be computed only once for a given matrix. Then the system is solved with forward and backward substitution for every right-hand side. Performance modeling of algorithms involved in solving these linear systems reveals the computational bottlenecks, which can guide node-level performance optimizations and shows the best performance that can be achieved on given architecture.\ In this thesis we investigate and analyze the performance of the forward/backward solution process of the PARDISO direct sparse solver and present a detailed performance analysis for its sparse solver kernel. This analysis is based on the Berkeley roofline model, a model that is widely used to predict the upper bound of a code based on processor peak performance and memory bandwidth. We establish a modified roofline model that captures the serial and parallel execution phases which allows us to predict the in-socket scaling over the processor cores. The distinction of serial and parallel execution is important as the amount of the serial fraction depends on the matrix used and can have a significant negative impact on performance. We compared the roofline model with an alternative Erlangen ECM model and provide discussion on usability and modeling capabilities of both models. The model predictions are compared with various measurements for a representative set of sparse matrices on different x86_64 processors. The performance analysis and modeling performed in this work are limited to a single node, however, the code considered here is also a building block for the MPI parallel version. Hence, also the distributed memory implementation of PARDISO will profit from any enhancement achieved. BibTeX: @phdthesis{Janalik2021, author = {Radim Janalík}, title = {Node-Level Performance Modeling of Sparse Factorization Solver}, school = {Università della Svizzera Italiana}, year = {2021} }  Jelich C, Karimi M, Kessissoglou N and Marburg S (2021), "Efficient solution of block Toeplitz systems with multiple right-hand sides arising from a periodic boundary element formulation", Engineering Analysis with Boundary Elements., September, 2021. Vol. 130, pp. 135-144. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Block Toeplitz matrices are a special class of matrices that exhibit reduced memory requirements and a reduced complexity of matrix-vector multiplications. We herein present an efficient computational approach to solve a sequence of block Toeplitz systems arising from a block Toeplitz system with multiple right-hand sides. Two different numerical schemes are implemented for the solution of the sequence of block Toeplitz systems based on global and block variants of the generalized minimal residual (GMRES) method. The performance of the schemes is assessed in terms of the wall clock time of the iterative solution process, the number of multiplications with the block Toeplitz system matrix and the peak memory usage. To demonstrate the method, two numerical examples are presented. In the first case study, aeroacoustic prediction of an airfoil in turbulent flow is examined, which requires multiple solutions of the wall pressure field beneath the turbulent boundary layer. The fluctuating pressure on the surface of the airfoil is synthesized in terms of uncorrelated wall plane waves, whereby each realization of the wall pressure field is an input to the acoustic solver based on the boundary element method (BEM). The total acoustic response from the airfoil in turbulent flow is then obtained from an ensemble average for the number of realizations considered. The number of realizations to yield a converged solution for the wall pressure field leads to a sequence of block Toeplitz systems. The second case study examines the nonlinear eigenvalue analysis of a sonic crystal barrier composed of locally resonant C-shaped sound-hard scatterers. The periodicity of the sound barrier leads to a block Toeplitz system matrix whereas the nonlinear eigenvalue problem requires the solution of sequences of linear systems. The combined technique to solve the sequences of block Toeplitz systems using the proposed variants of the GMRES is shown to yield a computationally efficient approach for flow noise prediction and nonlinear eigenvalue analysis. BibTeX: @article{Jelich2021, author = {Christopher Jelich and Mahmoud Karimi and Nicole Kessissoglou and Steffen Marburg}, title = {Efficient solution of block Toeplitz systems with multiple right-hand sides arising from a periodic boundary element formulation}, journal = {Engineering Analysis with Boundary Elements}, publisher = {Elsevier BV}, year = {2021}, volume = {130}, pages = {135--144}, doi = {10.1016/j.enganabound.2021.05.003} }  Jensen SM (2021), "Use of Machine Learning in Climate Econometrics". Thesis at: Aarhus University. [Abstract] [BibTeX] [URL] Abstract: This dissertation consists of three self-contained chapters on the use of machine learning in climate econometrics and is particularly concerned with how tools and ideas from the fields of econometrics and machine learning can be combined to shed new light on the relationship between macroeconomic activity and carbon dioxide (CO_2) emissions. According to the Intergovernmental Panel on Climate Change (IPCC) of the United Nations, CO_2 emissions constitute the key driver of climate change and are driven largely by economic and population growth (IPCC, 2014), highlighting the importance of a sound understanding of the relationship between macroeconomic activity and CO_2 emissions. BibTeX: @phdthesis{Jensen2021, author = {Sebastian Mathias Jensen}, title = {Use of Machine Learning in Climate Econometrics}, school = {Aarhus University}, year = {2021}, url = {https://pure.au.dk/portal/files/225863920/PhD_dissertation_Sebastian_Mathias_Jensen.pdf} }  Ji Y (2021), "High-Performance Graph Computing and Application in Cybersecurity". Thesis at: The George Washington University. [Abstract] [BibTeX] Abstract: Graph is a natural representation for many real-world applications, such as road map, protein-protein interaction network, and code graph. The graph algorithms can help mine useful knowledge from the corresponding graphs, such as navigation on road map graph and vulnerability detection from code graphs. This dissertation strives to build fast and scalable graph analytics techniques and apply them to cybersecurity applications. \ Chapter 1 introduces the background of graph connectivity algorithms, graph neural networks, and graphs in cybersecurity applications. Later, it summarizes related works and highlights the challenges and contributions of this dissertation. \ Chapter 2 introduces iSpan, a fast spanning tree construction method for computing strongly connected component. iSpan consists of parallel, relaxed synchronization construction of spanning trees for detecting the large and small SCCs, combined with fast trims for small SCCs. The evaluations show that iSpan is able to significantly outperform current state-of-the-art DFS and BFS-based methods by average 18× and 4×, respectively. \ Chapter 3 describes Aquila, an adaptive parallel computation framework that covers a wide range of different highly optimized graph connectivity algorithms. Given a graph, Aquila first transforms the query if it can be answered with partial computation. During the computation, Aquila is able to greatly reduce the workload by up to 98%. Furthermore, Aquila identifies the irregular tasks in the connectivity algorithms and applies different parallel strategies for different tasks. As a result, Aquila significantly outperforms existing systems by orders of magnitude. \ Chapter 4 designs BugGraph, which performs source-binary code similarity detection in two steps. First, BugGraph identifies the compilation provenance of the target binary and compiles the comparing source code to a binary with the same provenance. Second, BugGraph utilizes a new graph triplet-loss network on the vi attributed control flow graph to produce a similarity ranking. The experiments on four real-world datasets show that BugGraph achieves 90% and 75% true positive rate for syntax equivalent and similar code, respectively, an improvement of 16% and 24% over state-of-the-art methods. Moreover, BugGraph is able to identify 140 vulnerabilities in six commercial firmware. \ Chapter 5 presents Vestige, a new compilation provenance identification system for binary code. Vestige builds a new representation of the binary code, i.e., attributed function call graph (AFCG), that covers three types of features: idiom features at the instruction level, graphlet features at the function level, and function call graph at the binary level. Vestige applies a graph neural network model on the AFCG and generates representative embeddings for provenance identification. The experiment shows that Vestige achieves 96% accuracy on the publicly available datasets of more than 6,000 binaries, which is significantly better than previous works. When applied for binary code vulnerability detection, Vestige can help to improve the top-1 hit rate of three recent code vulnerability detection methods by up to 27%. \ Most graph neural networks (GNNs) work on the classical attributed graph structure, while we observe that a nested graph structure is a more accurate representation for many practical applications. Observing no existing GNNs can directly learn on such graph structure by synchronizing both the outer and inner graphs, Chapter 6 designs NestedGNN, the first graph neural network for nested graphs. NestedGNN consists of three layers, i.e., inner GNN layers, nested graph layers, and outer GNN layers. We successfully build NestedGNN on top of four different types of traditional GNNs and evaluate with three case studies where NestedGNN is able to significantly improve the performance over traditional GNN models. BibTeX: @phdthesis{Ji2021, author = {Ji, Yuede}, title = {High-Performance Graph Computing and Application in Cybersecurity}, school = {The George Washington University}, year = {2021} }  Jiang Z, Gao W, Tang F, Xiong X, Wang L, Lan C, Luo C, Li H and Zhan J (2021), "HPC AI500: Representative, Repeatable and Simple HPC AI Benchmarking", February, 2021. [Abstract] [BibTeX] Abstract: Recent years witness a trend of applying large-scale distributed deep learning algorithms (HPC AI) in both business and scientific computing areas, whose goal is to speed up the training time to achieve a state-of-the-art quality. The HPC AI benchmarks accelerate the process. Unfortunately, benchmarking HPC AI systems at scale raises serious challenges. This paper presents a representative, repeatable and simple HPC AI benchmarking methodology. Among the seventeen AI workloads of AIBench Training -- by far the most comprehensive AI Training benchmarks suite -- we choose two representative and repeatable AI workloads. The selected HPC AI benchmarks include both business and scientific computing: Image Classification and Extreme Weather Analytics. To rank HPC AI systems, we present a new metric named Valid FLOPS, emphasizing both throughput performance and a target quality. The specification, source code, datasets, and HPC AI500 ranking numbers are publicly available from https://www.benchcouncil.org/HPCAI500/. BibTeX: @article{Jiang2021, author = {Zihan Jiang and Wanling Gao and Fei Tang and Xingwang Xiong and Lei Wang and Chuanxin Lan and Chunjie Luo and Hongxiao Li and Jianfeng Zhan}, title = {HPC AI500: Representative, Repeatable and Simple HPC AI Benchmarking}, year = {2021} }  Jiang X, Zeng X, Sun J and Chen J (2021), "Distributed proximal gradient algorithm for non-smooth non-convex optimization over time-varying networks", March, 2021. [Abstract] [BibTeX] Abstract: This note studies the distributed non-convex optimization problem with non-smooth regularization, which has wide applications in decentralized learning, estimation and control. The objective function is the sum of different local objective functions, which consist of differentiable (possibly non-convex) cost functions and non-smooth convex functions. This paper presents a distributed proximal gradient algorithm for the non-smooth non-convex optimization problem over time-varying multi-agent networks. Each agent updates local variable estimate by the multi-step consensus operator and the proximal operator. We prove that the generated local variables achieve consensus and converge to the set of critical points with convergence rate O(1/T). Finally, we verify the efficacy of proposed algorithm by numerical simulations. BibTeX: @article{Jiang2021a, author = {Xia Jiang and Xianlin Zeng and Jian Sun and Jie Chen}, title = {Distributed proximal gradient algorithm for non-smooth non-convex optimization over time-varying networks}, year = {2021} }  Jiang F and Ma J (2021), "A comprehensive study of macro factors related to traffic fatality rates by XGBoost-based model and GIS techniques", Accident Analysis & Prevention., December, 2021. Vol. 163 Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: With the fast development of economics, road safety is becoming a serious problem. Exploring macro factors is effective to improve road safety. However, the existing studies have some limitations: (1) The existing studies only considered one aspect of macro factors and constructed models based on a few data samples. (2) The methods commonly used cannot address the non-linear relationship or calculate the feature importance. The findings obtained from such models may be limited and biased. To address the limitations, this study proposes a BO-CV-XGBoost framework to explore the macro factors related to traffic fatality rate classes based on a high-dimensional dataset that fully considers the impact of multi-factor interaction with adequate data samples. The proposed framework is applied to a dataset in the US. 453 county-level macro factors are collected from various data sources, covering ten macro aspects, including topography, transportation, etc. The optimized BO-CV-XGBoost model obtains the best classification performance with an AUC of 0.8977 and an accuracy of 85.02%. Compared with other methods, the proposed model has superiority on fatality rate classification. Ten macro factors are identified, including ‘Current-dollar GDP’, ‘highway miles per person’, etc. The ten factors contain four aspects of information, including economics, transportation, education, and medical condition. Geographic information system (GIS) techniques are further used for spatial analysis of the identified macro factors. Therefore, targeted and effective measures are accordingly proposed to prevent traffic fatalities and improve road safety BibTeX: @article{Jiang2021b, author = {Feifeng Jiang and Jun Ma}, title = {A comprehensive study of macro factors related to traffic fatality rates by XGBoost-based model and GIS techniques}, journal = {Accident Analysis & Prevention}, publisher = {Elsevier BV}, year = {2021}, volume = {163}, doi = {10.1016/j.aap.2021.106431} }  Jin B, Peruzzi M and Dunson DB (2021), "Bag of DAGs: Flexible & Scalable Modeling of Spatiotemporal Dependence", December, 2021. [Abstract] [BibTeX] Abstract: We propose a computationally efficient approach to construct a class of nonstationary spatiotemporal processes in point-referenced geostatistical models. Current methods that impose nonstationarity directly on covariance functions of Gaussian processes (GPs) often suffer from computational bottlenecks, causing researchers to choose less appropriate alternatives in many applications. A main contribution of this paper is the development of a well-defined nonstationary process using multiple yet simple directed acyclic graphs (DAGs), which leads to computational efficiency, flexibility, and interpretability. Rather than acting on the covariance functions, we induce nonstationarity via sparse DAGs across domain partitions, whose edges are interpreted as directional correlation patterns in space and time. We account for uncertainty about these patterns by considering local mixtures of DAGs, leading to a bag of DAGs'' approach. We are motivated by spatiotemporal modeling of air pollutants in which a directed edge in DAGs represents a prevailing wind direction causing some associated covariance in the pollutants; for example, an edge for northwest to southeast winds. We establish Bayesian hierarchical models embedding the resulting nonstationary process from the bag of DAGs approach and illustrate inferential and performance gains of the methods compared to existing alternatives. We consider a novel application focusing on the analysis of fine particulate matter (PM2.5) in South Korea and the United States. The code for all analyses is publicly available on Github. BibTeX: @article{Jin2021, author = {Bora Jin and Michele Peruzzi and David B. Dunson}, title = {Bag of DAGs: Flexible & Scalable Modeling of Spatiotemporal Dependence}, year = {2021} }  Juez-Gil M, Arnaiz-González Á, Rodríguez JJ, López-Nozal C and García-Osorio C (2021), "Rotation Forest for Big Data", Information Fusion., March, 2021. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The Rotation Forest classifier is a successful ensemble method for a wide variety of data mining applications. However, the way in which Rotation Forest transforms the feature space through PCA, although powerful, penalizes training and prediction times, making it unfeasible for Big Data. In this paper, a MapReduce Rotation Forest and its implementation under the Spark framework are presented. The proposed MapReduce Rotation Forest behaves in the same way as the standard Rotation Forest, training the base classifiers on a rotated space, but using a functional implementation of the rotation that enables its execution in Big Data frameworks. Experimental results are obtained using different cloud-based cluster configurations. Bayesian tests are used to validate the method against two ensembles for Big Data: Random Forest and PCARDE classifiers. Our proposal incorporates the parallelization of both the PCA calculation and the tree training, providing a scalable solution that retains the performance of the original Rotation Forest and achieves a competitive execution time (in average, at training, more than 3 times faster than other PCA-based alternatives). In addition, extensive experimentation shows that by setting some parameters of the classifier (i.e., bootstrap sample size, number of trees, and number of rotations), the execution time is reduced with no significant loss of performance using a small ensemble. BibTeX: @article{JuezGil2021, author = {Mario Juez-Gil and Álvar Arnaiz-González and Juan J. Rodríguez and Carlos López-Nozal and César García-Osorio}, title = {Rotation Forest for Big Data}, journal = {Information Fusion}, publisher = {Elsevier BV}, year = {2021}, doi = {10.1016/j.inffus.2021.03.007} }  Kalantzis V, Xi Y and Horesh L (2021), "Fast randomized non-Hermitian eigensolver based on rational filtering and matrix partitioning", March, 2021. [Abstract] [BibTeX] Abstract: This paper describes a set of rational filtering algorithms to compute a few eigenvalues (and associated eigenvectors) of non-Hermitian matrix pencils. Our interest lies in computing eigenvalues located inside a given disk, and the proposed algorithms approximate these eigenvalues and associated eigenvectors by harmonic Rayleigh-Ritz projections on subspaces built by computing range spaces of rational matrix functions through randomized range finders. These rational matrix functions are designed so that directions associated with non-sought eigenvalues are dampened to (approximately) zero. Variants based on matrix partitionings are introduced to further reduce the overall complexity of the proposed framework. Compared with existing eigenvalue solvers based on rational matrix functions, the proposed technique requires no estimation of the number of eigenvalues located inside the disk. Several theoretical and practical issues are discussed, and the competitiveness of the proposed framework is demonstrated via numerical experiments. BibTeX: @article{Kalantzis2021, author = {Vassilis Kalantzis and Yuanzhe Xi and Lior Horesh}, title = {Fast randomized non-Hermitian eigensolver based on rational filtering and matrix partitioning}, year = {2021} }  Kalantzis V, Gupta A, Horesh L, Nowicki T, Squillante MS and Wu CW (2021), "Solving sparse linear systems with approximate inverse preconditioners on analog devices", July, 2021. [Abstract] [BibTeX] Abstract: Sparse linear system solvers are computationally expensive kernels that lie at the heart of numerous applications. This paper proposes a flexible preconditioning framework to substantially reduce the time and energy requirements of this task by utilizing a hybrid architecture that combines conventional digital microprocessors with analog crossbar array accelerators. Our analysis and experiments with a simulator for analog hardware demonstrate that an order of magnitude speedup is readily attainable without much impact on convergence, despite the noise in analog computations. BibTeX: @article{Kalantzis2021a, author = {Vasileios Kalantzis and Anshul Gupta and Lior Horesh and Tomasz Nowicki and Mark S. Squillante and Chai Wah Wu}, title = {Solving sparse linear systems with approximate inverse preconditioners on analog devices}, year = {2021} }  Kalinin KP and Berloff NG (2021), "Large-scale Sustainable Search on Unconventional Computing Hardware", April, 2021. [Abstract] [BibTeX] Abstract: Since the advent of the Internet, quantifying the relative importance of web pages is at the core of search engine methods. According to one algorithm, PageRank, the worldwide web structure is represented by the Google matrix, whose principal eigenvector components assign a numerical value to web pages for their ranking. Finding such a dominant eigenvector on an ever-growing number of web pages becomes a computationally intensive task incompatible with Moore's Law. We demonstrate that special-purpose optical machines such as networks of optical parametric oscillators, lasers, and gain-dissipative condensates, may aid in accelerating the reliable reconstruction of principal eigenvectors of real-life web graphs. We discuss the feasibility of simulating the PageRank algorithm on large Google matrices using such unconventional hardware. We offer alternative rankings based on the minimisation of spin Hamiltonians. Our estimates show that special-purpose optical machines may provide dramatic improvements in power consumption over classical computing architectures. BibTeX: @article{Kalinin2021, author = {Kirill P. Kalinin and Natalia G. Berloff}, title = {Large-scale Sustainable Search on Unconventional Computing Hardware}, year = {2021} }  Kang Y, Choi H, Im J, Park S, Shin M, Song C-K and Kim S (2021), "Estimation of surface-level NO2 and O3 concentrations using TROPOMI data and machine learning over East Asia", Environmental Pollution., July, 2021. , pp. 117711. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: In East Asia, air quality has been recognized as an important public health problem. In particular, the surface concentrations of air pollutants are closely related to human life. This study aims to develop models for estimating high spatial resolution surface concentrations of NO2 and O3 from TROPOspheric Monitoring Instrument (TROPOMI) data in East Asia. The machine learning was adopted by fusion of various satellite-based variables, numerical model-based meteorological variables, and land-use variables. Four machine learning approaches—Support Vector Regression (SVR), Random Forest (RF), Extreme Gradient Boost (XGB), and Light Gradient Boosting Machine (LGBM)—were evaluated and compared with Multiple Linear Regression (MLR) as a base statistical method. This study also modeled the NO2 and O3 concentrations over the ocean surface (i.e., land model for scheme 1 and ocean model for scheme 2). The estimated surface concentrations were validated through three cross-validation approaches (i.e., random, temporal, and spatial). The results showed that the NO2 model produced R2 of 0.63--0.70 and normalized root-mean-square-error (nRMSE) of 38.3--42.2% and the O3 model resulted in R2 of 0.65--0.78 and nRMSE of 19.6--24.7% for scheme 1. The indirect validation based on the stations near the coastline for scheme 2 showed slight decrease (≈ 0.3--2.4%) in nRMSE when compared to scheme 1. The contributions of input variables to the models were analyzed based on SHapely Additive exPlanations (SHAP) values. The NO2 vertical column density among the TROPOMI-derived variables showed the largest contribution in both the NO2 and O3 models. BibTeX: @article{Kang2021, author = {Yoojin Kang and Hyunyoung Choi and Jungho Im and Seohui Park and Minso Shin and Chang-Keun Song and Sangmin Kim}, title = {Estimation of surface-level NO2 and O3 concentrations using TROPOMI data and machine learning over East Asia}, journal = {Environmental Pollution}, publisher = {Elsevier BV}, year = {2021}, pages = {117711}, doi = {10.1016/j.envpol.2021.117711} }  Kanno Y (2021), "Accelerated proximal gradient method for bi-modulus static elasticity", Optimization and Engineering., March, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: Bi-modulus constitutive law assumes that material constants have different values in tension and compression. It is known that finding an equilibrium state of an elastic body consisting of a bi-modulus material is recast as a semidefinite programming problem, which can be solved with a primal-dual interior-point method. As an alternative approach, this paper presents a fast first-order optimization method. Specifically, we propose an accelerated proximal gradient method for solving a minimization problem of the total potential energy. This algorithm is easy to implement, and free from numerical solution of linear equations. Numerical experiments demonstrate that the proposed method outperforms the semidefinite programming approach with a standard solver implementing a primal-dual interior-point method. BibTeX: @article{Kanno2021, author = {Yoshihiro Kanno}, title = {Accelerated proximal gradient method for bi-modulus static elasticity}, journal = {Optimization and Engineering}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s11081-021-09595-2} }  Kara G and Özturan C (2021), "Parallel network simplex algorithm for the minimum cost flow problem", 10, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: In this work, we contribute a parallel implementation of the network simplex algorithm that is used for the solution of minimum cost flow problem. In the network simplex algorithm, finding an entering arc requires searching through many arcs to decide which one should be included in the spanning tree solution on the next iteration. We propose finding the entering arc in parallel as it often takes the majority of the execution time. A usual strategy is to pick the arc violating the optimality the most out of all possible candidates. Scanning all arcs can take quite some time, so it is common to consider only a fixed number of arcs which is referred as the block search pivoting rule. Arc scans can easily be done in parallel to find the best candidate as the calculations are independent of each other. We used shared memory parallelism using OpenMP along with vectorization using AVX instructions. We also tried adjusting block sizes to increase the parallel portion of the algorithm. Our dataset consists of various natural and synthetic graphs with sizes up to a billion arc. Our experiments show speedups up to four are possible, though they are typically lower. BibTeX: @article{Kara2021, author = {Gökçehan Kara and Can Özturan}, title = {Parallel network simplex algorithm for the minimum cost flow problem}, publisher = {Wiley}, year = {2021}, doi = {10.1002/cpe.6659} }  Kardos D, Patassy P, Szabó S and Zaválnij B (2021), "Numerical experiments with LP formulations of the maximum clique problem", Central European Journal of Operations Research., September, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: The maximum clique problems calls for determining the size of the largest clique in a given graph. This graph problem affords a number of zero-one linear programming formulations. In this case study we deal with some of these formulations. We consider ways for tightening the formulations. We carry out numerical experiments to see the improvements the tightened formulations provide. BibTeX: @article{Kardos2021, author = {Dóra Kardos and Patrik Patassy and Sándor Szabó and Bogdán Zaválnij}, title = {Numerical experiments with LP formulations of the maximum clique problem}, journal = {Central European Journal of Operations Research}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s10100-021-00776-z} }  Karim S and Solomonik E (2021), "Efficient Preconditioners for Interior Point Methods via a new Schur Complementation Strategy", April, 2021. [Abstract] [BibTeX] Abstract: We propose new preconditioned iterative solvers for linear systems arising in primal-dual interior-point methods for convex quadratic programming problems. These preconditioned conjugate gradient methods operate on an implicit Schur complement of the KKT system at each iteration. In contrast to standard approaches, the Schur complement we consider enables the reuse of the factorization of the Hessian of the equality-constraint Lagrangian across all interior point iterations. Further, the resulting reduced system admits preconditioners that directly alleviate the ill-conditioning associated with the strict complementarity condition in interior point methods. The two preconditioners we propose also provably reduce the number of unique eigenvalues for the coefficient matrix (CG iteration count). One is efficient when the number of equality constraints is small, while the other is efficient when the number of remaining degrees of freedom is small. Numerical experiments with synthetic problems and problems from the Maros-Mézráros QP collection show that our preconditioned inexact interior point are effective at improving conditioning and reducing cost. Across all test problems for which the direct method is not fastest, our preconditioned methods achieve a reduction in cost by a geometric mean of 1.432 relative to the best alternative preconditioned method for each problem. BibTeX: @article{Karim2021, author = {Samah Karim and Edgar Solomonik}, title = {Efficient Preconditioners for Interior Point Methods via a new Schur Complementation Strategy}, year = {2021} }  Kassab L (2021), "Iterative Matrix Completion and Topic Modeling Using Matrix and Tenstor Factorizations". Thesis at: Colorado State University. [Abstract] [BibTeX] Abstract: With the ever-increasing access to data, one of the greatest challenges that remains is how to make sense out of this abundance of information. In this dissertation, we propose three techniques that take into account underlying structure in large-scale data to produce better or more interpretable results for machine learning tasks.\ One of the challenges that arise when it comes to analyzing large-scale datasets is missing values in data, which could be challenging to handle without efficient methods. We propose adjusting an iteratively reweighted least squares algorithm for low-rank matrix completion to take into account sparsity-based structure in the missing entries. We also propose an iterative gradient-projection-based implementation of the algorithm, and present numerical experiments showcasing the performance of the algorithm compared to standard algorithms.\ Another challenge arises while performing a (semi-)supervised learning task on highdimensional data. We propose variants of semi-supervised nonnegative matrix factorization models and provide motivation for these models as maximum likelihood estimators. The proposed models simultaneously provide a topic model and a model for classification. We derive training methods using multiplicative updates for each new model, and demonstrate the application of these models to document classification (e.g., 20 Newsgroups dataset).\ Lastly, although many datasets can be represented as matrices, datasets also often arise as high-dimensional arrays, known as higher-order tensors. We show that nonnegative CANDECOMP/PARAFAC tensor decomposition successfully detects short-lasting topics in temporal text datasets, including news headlines and COVID-19 related tweets, that other popular methods such as Latent Dirichlet Allocation and Nonnegative Matrix Factorization fail to fully detect. BibTeX: @phdthesis{Kassab2021, author = {Lara Kassab}, title = {Iterative Matrix Completion and Topic Modeling Using Matrix and Tenstor Factorizations}, school = {Colorado State University}, year = {2021} }  Katayama S and Ohtsuka T (2021), "Structure-Exploiting Newton-Type Method for Optimal Control of Switched Systems", December, 2021. [Abstract] [BibTeX] Abstract: This study proposes an efficient Newton-type method for the optimal control of switched systems under a given mode sequence. A mesh-refinement-based approach is utilized to discretize continuous-time optimal control problems (OCPs) and formulate a nonlinear program (NLP), which guarantees the local convergence of a Newton-type method. A dedicated structure-exploiting algorithm (Riccati recursion) is proposed to perform a Newton-type method for the NLP efficiently because its sparsity structure is different from a standard OCP. The proposed method computes each Newton step with linear time-complexity for the total number of discretization grids as the standard Riccati recursion algorithm. Additionally, the computation is always successful if the solution is sufficiently close to a local minimum. Conversely, general quadratic programming (QP) solvers cannot accomplish this because the Hessian matrix is inherently indefinite. Moreover, a modification on the reduced Hessian matrix is proposed using the nature of the Riccati recursion algorithm as the dynamic programming for a QP subproblem to enhance the convergence. A numerical comparison is conducted with off-the-shelf NLP solvers, which demonstrates that the proposed method is up to two orders of magnitude faster. Whole-body optimal control of quadrupedal gaits is also demonstrated and shows that the proposed method can achieve the whole-body model predictive control (MPC) of robotic systems with rigid contacts. BibTeX: @article{Katayama2021, author = {Sotaro Katayama and Toshiyuki Ohtsuka}, title = {Structure-Exploiting Newton-Type Method for Optimal Control of Switched Systems}, year = {2021} }  Kaya K, Yılmaz Y, Yaslan Y, Öğüdücü ŞG and Çıngı F (2021), "Demand forecasting model using hotel clustering findings for hospitality industry", Information Processing & Management., November, 2021. , pp. 102816. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Tourism has become a growing industry day by day with the developing economic conditions and the increasing communication and social interaction ability of the people. Forecasting tourism demand is not only important for tourism operators to maximize their revenues but also important for the formation of economic plans of the countries on a global scale. Based on the predictions countries are able to regulate the sectors that benefit economically from tourism locally. Therefore, it is crucial to accurately predict the demand in many weeks advance. In this study, we propose a new demand forecasting model for the hospitality industry that forecasts weekly hotel demand four weeks in advance through Attention-Long Short Term Memory (Attention-LSTM). Unlike most of the existing methods, the proposed method utilizes the time series demand data together with additional features obtained from K-Means Clustering findings such as Top 10 Hotel Features or Hotel Embeddings obtained using Neural Networks (NN). While creating our model, the clustering part was influenced by the fact that travelers choose their accommodation according to certain criteria, and the hotels meeting similar criteria may have similar demands. Therefore, before the clustering part, we also applied methods that would enable us to represent the features of the hotels more properly and we observed that 10-D Embedded Hotel Data representation with NN Embeddings came to the fore. In order to observe the performance of the proposed hotel demand forecasting model we used a real-world dataset provided by a tourism agency in Turkey and the results show that the proposed model achieves less mean absolute error and mean absolute percentage error (at worst % 3 and at most % 29 improvements) compared to the currently used machine learning and deep learning models. BibTeX: @article{Kaya2021, author = {Kıymet Kaya and Yaren Yılmaz and Yusuf Yaslan and Şule Gündüz Öğüdücü and Furkan Çıngı}, title = {Demand forecasting model using hotel clustering findings for hospitality industry}, journal = {Information Processing & Management}, publisher = {Elsevier BV}, year = {2021}, pages = {102816}, doi = {10.1016/j.ipm.2021.102816} }  Ke Y and Ma C (2021), "Prediction-correction matrix splitting iteration algorithm for a class of large and sparse linear systems", Applied Numerical Mathematics., July, 2021. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: For the large and sparse linear systems, we utilize the efficient splittings of the system matrix and introduce an intermediate variable. The main contribution of this paper is that a prediction-correction matrix splitting iteration algorithm is constructed from the view of numerical optimization to solve the derived equation instead, which is inspired by the idea of adaptive parameter update. The novel algorithm adopts the prediction and correction two-step iteration, which uses information with delay to define the iterations. The global convergence results are established and the algorithm enjoys at least a Q-linear convergence rate under some suitable conditions. Further, a preconditioned version is also presented. Compared with some well-known algorithms, numerical experiments show the efficiency and effectiveness of the new proposal with application to the three-dimensional convection-diffusion equation and the image restoration problems. BibTeX: @article{Ke2021, author = {Yifen Ke and Changfeng Ma}, title = {Prediction-correction matrix splitting iteration algorithm for a class of large and sparse linear systems}, journal = {Applied Numerical Mathematics}, publisher = {Elsevier BV}, year = {2021}, doi = {10.1016/j.apnum.2021.07.004} }  Kennedy G and Fu Y (2021), "Topology Optimization Benchmark Problems for Assessing the Performance of Optimization Algorithms", In AIAA Scitech 2021 Forum., January, 2021. American Institute of Aeronautics and Astronautics. [Abstract] [BibTeX] [DOI] Abstract: This paper presents a set of benchmark topology optimization problems that are used to evaluate the performance of optimization software. The topology optimization formulations considered in this benchmark set include mass-constrained compliance minimization, stress-constrained mass minimization, mass and stress-constrained compliance minimization, and mass and frequency-constrained compliance minimization. Both structured and unstructured quadrilateral and hexahedral meshes are used with conventional and non-conventional topology optimization design domains. In total, the benchmark set contains 108 2D and 72 3D domain and mesh combinations. In this preliminary work, the performance of the optimizers SNOPT, IPOPT and ParOpt are evaluated. Performance profiles are used to assess the performance of an optimizer across the full benchmark set. We find that SNOPT performs the best overall among all optimizers. Furthermore, a modification of the quasi-Newton Hessian update provides a considerable performance benefit in both objective function value and discreteness measure. BibTeX: @inproceedings{Kennedy2021, author = {Graeme Kennedy and Yicong Fu}, title = {Topology Optimization Benchmark Problems for Assessing the Performance of Optimization Algorithms}, booktitle = {AIAA Scitech 2021 Forum}, publisher = {American Institute of Aeronautics and Astronautics}, year = {2021}, doi = {10.2514/6.2021-1357} }  Kepner J, Davis T, Gadepally V, Jananthan H and Milechin L (2021), "Mathematics of Digital Hyperspace", March, 2021. [Abstract] [BibTeX] Abstract: Social media, e-commerce, streaming video, e-mail, cloud documents, web pages, traffic flows, and network packets fill vast digital lakes, rivers, and oceans that we each navigate daily. This digital hyperspace is an amorphous flow of data supported by continuous streams that stretch standard concepts of type and dimension. The unstructured data of digital hyperspace can be elegantly represented, traversed, and transformed via the mathematics of hypergraphs, hypersparse matrices, and associative array algebra. This paper explores a novel mathematical concept, the semilink, that combines pairs of semirings to provide the essential operations for graph analytics, database operations, and machine learning. The GraphBLAS standard currently supports hypergraphs, hypersparse matrices, the mathematics required for semilinks, and seamlessly performs graph, network, and matrix operations. With the addition of key based indices (such as pointers to strings) and semilinks, GraphBLAS can become a richer associative array algebra and be a plug-in replacement for spreadsheets, database tables, and data centric operating systems, enhancing the navigation of unstructured data found in digital hyperspace. BibTeX: @article{Kepner2021, author = {Jeremy Kepner and Timothy Davis and Vijay Gadepally and Hayden Jananthan and Lauren Milechin}, title = {Mathematics of Digital Hyperspace}, year = {2021} }  Kepner J, Jones M, Andersen D, Buluc A, Byun C, Claffy K, Davis T, Arcand W, Bernays J, Bestor D, Bergeron W, Gadepally V, Houle M, Hubbell M, Klein A, Meiners C, Milechin L, Mullen J, Pisharody S, Prout A, Reuther A, Rosa A, Samsi S, Stetson D, Tse A, Yee C and Michaleas P (2021), "Spatial Temporal Analysis of 40,000,000,000,000 Internet Darkspace Packets", August, 2021. [Abstract] [BibTeX] Abstract: The Internet has never been more important to our society, and understanding the behavior of the Internet is essential. The Center for Applied Internet Data Analysis (CAIDA) Telescope observes a continuous stream of packets from an unsolicited darkspace representing 1/256 of the Internet. During 2019 and 2020 over 40,000,000,000,000 unique packets were collected representing the largest ever assembled public corpus of Internet traffic. Using the combined resources of the Supercomputing Centers at UC San Diego, Lawrence Berkeley National Laboratory, and MIT, the spatial temporal structure of anonymized source-destination pairs from the CAIDA Telescope data has been analyzed with GraphBLAS hierarchical hypersparse matrices. These analyses provide unique insight on this unsolicited Internet darkspace traffic with the discovery of many previously unseen scaling relations. The data show a significant sustained increase in unsolicited traffic corresponding to the start of the COVID19 pandemic, but relatively little change in the underlying scaling relations associated with unique sources, source fan-outs, unique links, destination fan-ins, and unique destinations. This work provides a demonstration of the practical feasibility and benefit of the safe collection and analysis of significant quantities of anonymized Internet traffic. BibTeX: @article{Kepner2021a, author = {Jeremy Kepner and Michael Jones and Daniel Andersen and Aydin Buluc and Chansup Byun and K Claffy and Timothy Davis and William Arcand and Jonathan Bernays and David Bestor and William Bergeron and Vijay Gadepally and Micheal Houle and Matthew Hubbell and Anna Klein and Chad Meiners and Lauren Milechin and Julie Mullen and Sandeep Pisharody and Andrew Prout and Albert Reuther and Antonio Rosa and Siddharth Samsi and Doug Stetson and Adam Tse and Charles Yee and Peter Michaleas}, title = {Spatial Temporal Analysis of 40,000,000,000,000 Internet Darkspace Packets}, year = {2021} }  Khodadadi A and Saeidi S (2021), "Discovering the maximum k-clique on social networks using bat optimization algorithm", Computational Social Networks., February, 2021. Vol. 8(1) Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: The k-clique problem is identifying the largest complete subgraph of size k on a network, and it has many applications in Social Network Analysis (SNA), coding theory, geometry, etc. Due to the NP-Complete nature of the problem, the meta-heuristic approaches have raised the interest of the researchers and some algorithms are developed. In this paper, a new algorithm based on the Bat optimization approach is developed for finding the maximum k-clique on a social network to increase the convergence speed and evaluation criteria such as Precision, Recall, and F1-score. The proposed algorithm is simulated in Matlab® software over Dolphin social network and DIMACS dataset for k = 3, 4, 5. The computational results show that the convergence speed on the former dataset is increased in comparison with the Genetic Algorithm (GA) and Ant Colony Optimization (ACO) approaches. Besides, the evaluation criteria are also modified on the latter dataset and the F1-score is obtained as 100% for k = 5. BibTeX: @article{Khodadadi2021, author = {Akram Khodadadi and Shahram Saeidi}, title = {Discovering the maximum k-clique on social networks using bat optimization algorithm}, journal = {Computational Social Networks}, publisher = {Springer Science and Business Media LLC}, year = {2021}, volume = {8}, number = {1}, doi = {10.1186/s40649-021-00087-y} }  Khor CS (2021), "Recent Advancements in Commercial Integer Optimization Solvers for Business Intelligence Applications", In E-Business - Higher Education and Intelligence Applications., May, 2021. IntechOpen. [Abstract] [BibTeX] [DOI] Abstract: The chapter focuses on the recent advancements in commercial integer optimization solvers as exemplified by the CPLEX software package particularly but not limited to mixed-integer linear programming (MILP) models applied to business intelligence applications. We provide background on the main underlying algorithmic method of branch-and-cut, which is based on the established optimization solution methods of branch-and-bound and cutting planes. The chapter also covers heuristic-based algorithms, which include preprocessing and probing strategies as well as the more advanced methods of local or neighborhood search for polishing solutions toward enhanced use in practical settings. Emphasis is given to both theory and implementation of the methods available. Other considerations are offered on parallelization, solution pools, and tuning tools, culminating with some concluding remarks on computational performance vis-à-vis business intelligence applications with a view toward perspective for future work in this area. BibTeX: @incollection{Khor2021, author = {Cheng Seong Khor}, title = {Recent Advancements in Commercial Integer Optimization Solvers for Business Intelligence Applications}, booktitle = {E-Business - Higher Education and Intelligence Applications}, publisher = {IntechOpen}, year = {2021}, doi = {10.5772/intechopen.93416} }  Khosla M and Anand A (2021), "Revisiting the Auction Algorithm for Weighted Bipartite Perfect Matchings", January, 2021. [Abstract] [BibTeX] Abstract: We study the classical weighted perfect matchings problem for bipartite graphs or sometimes referred to as the assignment problem, i.e., given a weighted bipartite graph G = (U∪ V,E) with weights w : E → R we are interested to find the maximum matching in G with the minimum/maximum weight. In this work we present a new and arguably simpler analysis of one of the earliest techniques developed for solving the assignment problem, namely the auction algorithm. Using our analysis technique we present tighter and improved bounds on the runtime complexity for finding an approximate minumum weight perfect matching in k-left regular sparse bipartite graphs. BibTeX: @article{Khosla2021, author = {Megha Khosla and Avishek Anand}, title = {Revisiting the Auction Algorithm for Weighted Bipartite Perfect Matchings}, year = {2021} }  Khouja R, Khalil H and Mourrain B (2021), "Riemannian Newton optimization methods for the symmetric tensor approximation problem", Linear Algebra and its Applications., December, 2021. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The Symmetric Tensor Approximation problem (STA) consists of approximating a symmetric tensor or a homogeneous polynomial by a linear combination of symmetric rank-1 tensors or powers of linear forms of low symmetric rank. We present two new Riemannian Newton-type methods for low rank approximation of symmetric tensor with complex coefficients.\ The first method uses the parametrization of the set of tensors of rank at most r by weights and unit vectors. Exploiting the properties of the apolar product on homogeneous polynomials combined with efficient tools from complex optimization, we provide an explicit and tractable formulation of the Riemannian gradient and Hessian, leading to Newton iterations with local quadratic convergence. We prove that under some regularity conditions on non-defective tensors in the neighborhood of the initial point, the Newton iteration (completed with a trust-region scheme) is converging to a local minimum.\ The second method is a Riemannian Gauss--Newton method on the Cartesian product of Veronese manifolds. Explicit orthonormal basis of the tangent space of this Riemannian manifold is described. We deduce the Riemannian gradient and the Gauss--Newton approximation of the Riemannian Hessian. We present a new retraction operator on the Veronese manifold.\ We analyze the numerical behavior of these methods, with an initial point provided by Simultaneous Matrix Diagonalisation (SMD). Numerical experiments show the good numerical behavior of the two methods in different cases and in comparison with existing state-of-the-art methods. BibTeX: @article{Khouja2021, author = {Rima Khouja and Houssam Khalil and Bernard Mourrain}, title = {Riemannian Newton optimization methods for the symmetric tensor approximation problem}, journal = {Linear Algebra and its Applications}, publisher = {Elsevier BV}, year = {2021}, doi = {10.1016/j.laa.2021.12.008} }  Kim Y, Pacaud F, Kim K and Anitescu M (2021), "Leveraging GPU batching for scalable nonlinear programming through massive Lagrangian decomposition", June, 2021. [Abstract] [BibTeX] Abstract: We present the implementation of a trust-region Newton algorithm ExaTron for bound-constrained nonlinear programming problems, fully running on multiple GPUs. Without data transfers between CPU and GPU, our implementation has achieved the elimination of a major performance bottleneck under a memory-bound situation, particularly when solving many small problems in batch. We discuss the design principles and implementation details for our kernel function and core operations. Different design choices are justified by numerical experiments. By using the application of distributed control of alternating current optimal power flow, where a large problem is decomposed into many smaller nonlinear programs using a Lagrangian approach, we demonstrate computational performance of ExaTron on the Summit supercomputer at Oak RidgeNational Laboratory. Our numerical results show the linear scaling with respect to the batch size and the number of GPUs and more than 35 times speedup on 6 GPUs than on 40 CPUs available on a single node. BibTeX: @article{Kim2021, author = {Youngdae Kim and François Pacaud and Kibaek Kim and Mihai Anitescu}, title = {Leveraging GPU batching for scalable nonlinear programming through massive Lagrangian decomposition}, year = {2021} }  Klein C and Strzodka R (2021), "Tridiagonal GPU Solver with Scaled Partial Pivoting at Maximum Bandwidth", In 50th International Conference on Parallel Processing., August, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: Partial pivoting is the method of choice to ensure stability in matrix factorizations performed on CPUs. For sparse matrices, this has not been implemented on GPUs so far because of problems with datadependent execution flow. This work incorporates scaled partial pivoting into a tridiagonal GPU solver in such a fashion that despite the data-dependent decisions no SIMD divergence occurs. The cost of the computation is completely hidden behind the data movement which itself runs at maximum bandwidth. Therefore, the cost of the tridiagonal GPU solver is no more than the minimally required data movement. For large single precision systems with 2^25 unknowns, speedups of 5 are reported in comparison to the numerically stable tridiagonal solver (gtsv2) of cuSPARSE. The proposed tridiagonal solver is also evaluated as a preconditioner for Krylov solvers of large sparse linear equation systems. As expected it performs best for problems with strong anisotropies BibTeX: @inproceedings{Klein2021, author = {Christoph Klein and Robert Strzodka}, title = {Tridiagonal GPU Solver with Scaled Partial Pivoting at Maximum Bandwidth}, booktitle = {50th International Conference on Parallel Processing}, publisher = {ACM}, year = {2021}, doi = {10.1145/3472456.3472484} }  Kolev T, Fischer P, Min M, Dongarra J, Brown J, Dobrev V, Warburton T, Tomov S, Shephard MS, Abdelfattah A, Barra V, Beams N, Camier J-S, Chalmers N, Dudouit Y, Karakus A, Karlin I, Kerkemeier S, Lan Y-H, Medina D, Merzari E, Obabko A, Pazner W, Rathnayake T, Smith CW, Spies L, Swirydowicz K, Thompson J, Tomboulides A and Tomov V (2021), "Efficient exascale discretizations: High-order finite element methods", The International Journal of High Performance Computing Applications., June, 2021. , pp. 109434202110208. SAGE Publications. [Abstract] [BibTeX] [DOI] Abstract: Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms used in many large-scale applications. These architectures favor algorithms that expose ultra fine-grain parallelism and maximize the ratio of floating point operations to energy intensive data movement. One of the few viable approaches to achieve high efficiency in the area of PDE discretizations on unstructured grids is to use matrix-free/partially assembled high-order finite element methods, since these methods can increase the accuracy and/or lower the computational time due to reduced data motion. In this paper we provide an overview of the research and development activities in the Center for Efficient Exascale Discretizations (CEED), a co-design center in the Exascale Computing Project that is focused on the development of next-generation discretization software and algorithms to enable a wide range of finite element applications to run efficiently on future hardware. CEED is a research partnership involving more than 30 computational scientists from two US national labs and five universities, including members of the Nek5000, MFEM, MAGMA and PETSc projects. We discuss the CEED co-design activities based on targeted benchmarks, miniapps and discretization libraries and our work on performance optimizations for large-scale GPU architectures. We also provide a broad overview of research and development activities in areas such as unstructured adaptive mesh refinement algorithms, matrix-free linear solvers, high-order data visualization, and list examples of collaborations with several ECP and external applications. BibTeX: @article{Kolev2021, author = {Tzanio Kolev and Paul Fischer and Misun Min and Jack Dongarra and Jed Brown and Veselin Dobrev and Tim Warburton and Stanimire Tomov and Mark S Shephard and Ahmad Abdelfattah and Valeria Barra and Natalie Beams and Jean-Sylvain Camier and Noel Chalmers and Yohann Dudouit and Ali Karakus and Ian Karlin and Stefan Kerkemeier and Yu-Hsiang Lan and David Medina and Elia Merzari and Aleksandr Obabko and Will Pazner and Thilina Rathnayake and Cameron W Smith and Lukas Spies and Kasia Swirydowicz and Jeremy Thompson and Ananias Tomboulides and Vladimir Tomov}, title = {Efficient exascale discretizations: High-order finite element methods}, journal = {The International Journal of High Performance Computing Applications}, publisher = {SAGE Publications}, year = {2021}, pages = {109434202110208}, doi = {10.1177/10943420211020803} }  Kong W (2021), "Accelerated Inexact First-Order Methods for Solving Nonconvex Composite Optimization Problems", April, 2021. [Abstract] [BibTeX] Abstract: This thesis focuses on developing and analyzing accelerated and inexact first-order methods for solving or finding stationary points of various nonconvex composite optimization (NCO) problems. Our main tools mainly come from variational and convex analysis, and our key results are in the form of iteration complexity bounds and how these bounds compare to other ones in the literature. BibTeX: @article{Kong2021, author = {Weiwei Kong}, title = {Accelerated Inexact First-Order Methods for Solving Nonconvex Composite Optimization Problems}, year = {2021} }  Konshin I and Terekhov K (2021), "Sparse System Solution Methods for Complex Problems", In Proceedings of the International Conference on Parallel Computing Technologies. , pp. 53-73. [Abstract] [BibTeX] Abstract: Sparse system solution methods (S^3 M) is a collection of interoperable linear solvers and preconditioners organized into a C++ header-only library. The current set of methods in the collection span both rather traditional Krylov space acceleration methods and smoothers as well as advanced incomplete factorization methods and rescaling and reordering methods. The methods can be integrated into algebraic multigrid and multi-stage fashion to construct solution strategies for complex linear systems that originate from coupled multi-physics problems. Several examples are considered in this work, that includes Constrained Pressure Residual (CPR) multi-stage strategy for oil & gas problem and Schur complement method for the system obtained with mimetic finite difference discretization for anisotropic diffusion problem. BibTeX: @inproceedings{Konshin2021, author = {Igor Konshin and Kirill Terekhov}, title = {Sparse System Solution Methods for Complex Problems}, booktitle = {Proceedings of the International Conference on Parallel Computing Technologies}, year = {2021}, pages = {53--73} }  Kopczewska K (2021), "Spatial machine learning: new opportunities for regional science", The Annals of Regional Science., December, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: This paper is a methodological guide to using machine learning in the spatial context. It provides an overview of the existing spatial toolbox proposed in the literature: unsupervised learning, which deals with clustering of spatial data, and supervised learning, which displaces classical spatial econometrics. It shows the potential of using this developing methodology, as well as its pitfalls. It catalogues and comments on the usage of spatial clustering methods (for locations and values, both separately and jointly) for mapping, bootstrapping, cross-validation, GWR modelling and density indicators. It provides details of spatial machine learning models, which are combined with spatial data integration, modelling, model fine-tuning and predictions to deal with spatial autocorrelation and big data. The paper delineates “already available” and “forthcoming” methods and gives inspiration for transplanting modern quantitative methods from other thematic areas to research in regional science. BibTeX: @article{Kopczewska2021, author = {Katarzyna Kopczewska}, title = {Spatial machine learning: new opportunities for regional science}, journal = {The Annals of Regional Science}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s00168-021-01101-x} }  Korkmaz E, Faverge M, Pichon G, Korkmaz PR, Faverge M, Pichon G and Ramet P (2021), "Deciding Non-Compressible Blocks in Sparse Direct Solvers using Incomplete Factorization". Thesis at: Inria Bordeaux -- Sud Ouest. [Abstract] [BibTeX] [URL] Abstract: Low-rank compression techniques are very promising for reducing memory footprint and execution time on a large spectrum of linear solvers. Sparse direct supernodal approaches are one these techniques. However, despite providing a very good scalability and reducing the memory footprint, they suffer from an important flops overhead in their unstructured low-rank updates. As a consequence, the execution time is not improved as expected. In this paper, we study a solution to improve low-rank compression techniques in sparse supernodal solvers. The proposed method tackles the overprice of the low-rank updates by identifying the blocks that have poor compression rates. We show that block incomplete LU factorization, thanks to the block fill-in levels, allows to identify most of these non-compressible blocks at low cost. This identification enables to postpone the low-rank compression step to trade small extra memory consumption for a better time to solution. The solution is validated within the PaStiX library with a large set of application matrices. It demonstrates sequential and multi-threaded speedup up to 8.5×, for small memory overhead of less than 1.49× with respect to the original version. BibTeX: @techreport{Korkmaz2021, author = {Esragul Korkmaz and Mathieu Faverge and Grégoire Pichon and Pierre RametEsragul Korkmaz and Mathieu Faverge and Grégoire Pichon and Pierre Ramet}, title = {Deciding Non-Compressible Blocks in Sparse Direct Solvers using Incomplete Factorization}, school = {Inria Bordeaux -- Sud Ouest}, year = {2021}, url = {https://hal.inria.fr/hal-03152932/file/RR-9396.pdf} }  Kosaian J and Rashmi KV (2021), "Arithmetic-Intensity-Guided Fault Tolerance for Neural Network Inference on GPUs", April, 2021. [Abstract] [BibTeX] Abstract: Neural networks (NNs) are increasingly employed in domains that require high reliability, such as scientific computing and safety-critical systems, as well as in environments more prone to unreliability (e.g., soft errors), such as on spacecraft. As recent work has shown that faults in NN inference can lead to mispredictions and safety hazards, it is critical to impart fault tolerance to NN inference. Algorithm-based fault tolerance (ABFT) is emerging as an appealing approach for efficient fault tolerance in NNs. In this work, we identify new, unexploited opportunities for low-overhead ABFT for NN inference: current inference-optimized GPUs have high compute-to-memory-bandwidth ratios, while many layers of current and emerging NNs have low arithmetic intensity. This leaves many convolutional and fully-connected layers in NNs memory-bandwidth-bound. These layers thus exhibit stalls in computation that could be filled by redundant execution, but that current approaches to ABFT for NN inference cannot exploit. To reduce execution-time overhead for such memory-bandwidth-bound layers, we first investigate thread-level ABFT schemes for inference-optimized GPUs that exploit this fine-grained compute underutilization. We then propose intensity-guided ABFT, an adaptive, arithmetic-intensity-guided approach to ABFT that selects the best ABFT scheme for each individual layer between traditional approaches to ABFT, which are suitable for compute-bound layers, and thread-level ABFT, which is suitable for memory-bandwidth-bound layers. Through this adaptive approach, intensity-guided ABFT reduces execution-time overhead by 1.09--5.3× across a variety of NNs, lowering the cost of fault tolerance for current and future NN inference workloads. BibTeX: @article{Kosaian2021, author = {Jack Kosaian and K. V. Rashmi}, title = {Arithmetic-Intensity-Guided Fault Tolerance for Neural Network Inference on GPUs}, year = {2021} }  Kozický C and Šimeček I (2021), "Joint direct and transposed sparse matrix-vector multiplication for multithreaded CPUs", Concurrency and Computation: Practice and Experience., February, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: Repeatedly performing sparse matrix‐vector multiplication (SpMV) followed by transposed sparse matrix‐vector multiplication (SpM^TV) with the same matrix is a part of several algorithms, for example, the Lanczos biorthogonalization algorithm and the biconjugate gradient method. Such algorithms can benefit from combining parallel SpMV and SpM^TV into a single operation we call joint direct and transposed sparse matrix‐vector multiplication (SpMM^TV). In this article, we present a parallel SpMM^TV algorithm for shared‐memory CPUs. The algorithm uses a sparse matrix format that divides the stored matrix into sparse matrix blocks and compresses the row and column indices of the matrix. This sparse matrix format can be also used for SpMV, SpM^TV, and similar sparse matrix‐vector operations. We expand upon existing research by suggesting new variants of the parallel SpMM^TV algorithm and by extending the algorithm to efficiently support symmetric matrices. We compare the performance of the presented parallel SpMM^TV algorithm with alternative approaches, which use state‐of‐the‐art sparse matrix formats and libraries, using sparse matrices from real‐world applications. The performance results indicate that the median performance of our proposed parallel SpMM^TV algorithm is up to 45% higher than of the alternative approaches. BibTeX: @article{Kozicky2021, author = {Claudio Kozický and Ivan Šimeček}, title = {Joint direct and transposed sparse matrix-vector multiplication for multithreaded CPUs}, journal = {Concurrency and Computation: Practice and Experience}, publisher = {Wiley}, year = {2021}, doi = {10.1002/cpe.6236} }  Krasnopolsky B and Medvedev A (2021), "XAMG: A library for solving linear systems with multiple right-hand side vectors", March, 2021. [Abstract] [BibTeX] Abstract: This paper presents the XAMG library for solving large sparse systems of linear algebraic equations with multiple right-hand side vectors. The library specializes but is not limited to the solution of linear systems obtained from the discretization of elliptic differential equations. A corresponding set of numerical methods includes Krylov subspace, algebraic multigrid, Jacobi, Gauss-Seidel, and Chebyshev iterative methods. The parallelization is implemented with MPI+POSIX shared memory hybrid programming model, which introduces a three-level hierarchical decomposition using the corresponding per-level synchronization and communication primitives. The code contains a number of optimizations, including the multilevel data segmentation, compression of indices, mixed-precision floating-point calculations, vector status flags, and others. The XAMG library uses the program code of the well-known hypre library to construct the multigrid matrix hierarchy. The XAMG's own implementation for the solve phase of the iterative methods provides up to a twofold speedup compared to hypre for the tests performed. Additionally, XAMG provides extended functionality to solve systems with multiple right-hand side vectors. BibTeX: @article{Krasnopolsky2021, author = {Boris Krasnopolsky and Alexey Medvedev}, title = {XAMG: A library for solving linear systems with multiple right-hand side vectors}, year = {2021} }  Kronbichler M, Sashko D and Munch P (2021), "Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations", The International Journal of High Performance Computing Applications. [Abstract] [BibTeX] Abstract: This work presents a variant of the conjugate gradient (CG) method with minimal memory access for the vector operations, targeting high-order finite-element schemes with fast matrix-free operator evaluation and cheap preconditioners like the matrix diagonal. The algorithm relies on a data-dependency analysis and interleaves the vector updates and inner products in a CG iteration with the matrix-vector product. As a result, around 90% of the vector entries of the three active vectors of the CG method are transferred from slow RAM memory exactly once per iteration, with all additional access hitting fast cache memory. Node-level performance analyses and scaling studies on up to 147k cores show that the new method is around two times faster than a standard CG solver as well as optimized pipelined CG and s-step CG methods for large sizes that exceed processor caches, and provides similar performance near the strong scaling limit. BibTeX: @article{Kronbichler2021, author = {Martin Kronbichler and Dmytro Sashko and Peter Munch}, title = {Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations}, journal = {The International Journal of High Performance Computing Applications}, year = {2021} }  Kronqvist J, Misener R and Tsay C (2021), "Between steps: Intermediate relaxations between big-M and convex hull formulations", January, 2021. [Abstract] [BibTeX] Abstract: This work develops a class of relaxations in between the big-M and convex hull formulations of disjunctions, drawing advantages from both. The proposed "P-split" formulations split convex additively separable constraints into P partitions and form the convex hull of the partitioned disjuncts. Parameter P represents the trade-off of model size vs. relaxation strength. We examine the novel formulations and prove that, under certain assumptions, the relaxations form a hierarchy starting from a big-M equivalent and converging to the convex hull. We computationally compare the proposed formulations to big-M and convex hull formulations on a test set including: K-means clustering, P_ball problems, and ReLU neural networks. The computational results show that the intermediate P-split formulations can form strong outer approximations of the convex hull with fewer variables and constraints than the extended convex hull formulations, giving significant computational advantages over both the big-M and convex hull. BibTeX: @article{Kronqvist2021, author = {Jan Kronqvist and Ruth Misener and Calvin Tsay}, title = {Between steps: Intermediate relaxations between big-M and convex hull formulations}, year = {2021} }  Kwasniewski G, Kabić M, Ben-Nun T, Ziogas AN, Saethre JE, Gaillard A, Schneider T, Besta M, Kozhevnikov A, VandeVondele J and Hoefler T (2021), "On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations", Published at Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November, 2021(SC'21)., August, 2021. [Abstract] [BibTeX] [DOI] Abstract: Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating N^3/(P*sqrt(M)) elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested scenarios, with matrix sizes ranging from 2,048 to 262,144 on up to 512 CPU nodes of the Piz Daint supercomputer, decreasing the time-to-solution by up to three times. Our code is ScaLAPACK-compatible and available as an open-source library. BibTeX: @article{Kwasniewski2021, author = {Grzegorz Kwasniewski and Marko Kabić and Tal Ben-Nun and Alexandros Nikolaos Ziogas and Jens Eirik Saethre and André Gaillard and Timo Schneider and Maciej Besta and Anton Kozhevnikov and Joost VandeVondele and Torsten Hoefler}, title = {On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations}, journal = {Published at Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November, 2021(SC'21)}, year = {2021}, doi = {10.1145/3458817.3476167} }  Lang C and Yang X-T (2021), "On the Numerical Evaluation of Invertible Preconditioners Based on High-Order Discretizations of the Laplace Operator for Linear Systems", October, 2021. Research Square Platform LLC. [Abstract] [BibTeX] [DOI] Abstract: To alleviate the ill-posed condition of linear systems, the discrete Laplace operator is often used as a preconditioner incorporated with iterative methods. However, with a traditional lower-order(typically, second-order) approximation of the Laplace operator it is often difficult to achieve the optimal effect. For this reason, we construct preconditioners based on high-order finite difference discretizations to further develop the potential of the discrete Laplace operator and evaluate their numerical efficiency. The sparse band structure and symmetric property of such high-order preconditioners are derived from the theoretical aspect , revealing the cheap computing cost of the corresponding preconditioning processes for each problem dimension. Numerical experiments are implemented to confirm our analysis and the computing results show advantages of the proposed preconditioned iterative methods compared with the classical methods. BibTeX: @article{Lang2021, author = {Chao Lang and Xiao-Ting Yang}, title = {On the Numerical Evaluation of Invertible Preconditioners Based on High-Order Discretizations of the Laplace Operator for Linear Systems}, publisher = {Research Square Platform LLC}, year = {2021}, doi = {10.21203/rs.3.rs-959956/v1} }  Leplat V, Nesterov Y, Gillis N and Glineur F (2021), "Exact Nonnegative Matrix Factorization via Conic Optimization", May, 2021. [Abstract] [BibTeX] Abstract: In this paper, we present two new approaches for computing exact nonnegative matrix factorizations (NMFs). Exact NMF can be defined as follows: given an input nonnegative matrix V ∊ ℝ_+^F × N and a factorization rank K, compute, if possible, two nonnegative matrices, W ∊ ℝ_+^F × K and H ∊ ℝ_+^K × N, such that V=WH. The two proposed approaches to tackle exact NMF, which is NP-hard in general, rely on the same two steps. First, we reformulate exact NMF as minimizing a concave function over a product of convex cones; one approach is based on the exponential cone, and the other on the second-order cone. Second, we solve these reformulations iteratively: at each step, we minimize exactly, over the feasible set, a majorization of the objective functions obtained via linearization at the current iterate. Hence these subproblems are convex conic programs and can be solved efficiently using dedicated algorithms. We show that our approaches, which we call successive conic convex approximations, are able to compute exact NMFs, and compete favorably with the state of the art when applied to several classes of nonnegative matrices; namely, randomly generated, infinite rigid and slack matrices. BibTeX: @article{Leplat2021, author = {Valentin Leplat and Yurii Nesterov and Nicolas Gillis and François Glineur}, title = {Exact Nonnegative Matrix Factorization via Conic Optimization}, year = {2021} }  Lestandi L (2021), "Numerical Study of Low Rank Approximation Methods for Mechanics Data and Its Analysis", Journal of Scientific Computing., February, 2021. Vol. 87(1) Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: This paper proposes a comparison of the numerical aspect and efficiency of several low rank approximation techniques for multidimensional data, namely CPD, HOSVD, TT-SVD, RPOD, QTT-SVD and HT. This approach is different from the numerous papers that compare the theoretical aspects of these methods or propose efficient implementation of a single technique. Here, after a brief presentation of the studied methods, they are tested in practical conditions in order to draw hindsight at which one should be preferred. Synthetic data provides sufficient evidence for dismissing CPD, T-HOSVD and RPOD. Then, three examples from mechanics provide data for realistic application of TT-SVD and ST-HOSVD. The obtained low rank approximation provides different levels of compression and accuracy depending on how separable the data is. In all cases, the data layout has significant influence on the analysis of modes and computing time while remaining similarly efficient at compressing information. Both methods provide satisfactory compression, from 0.1% to 20% of the original size within a few percent error in L^2 norm. ST-HOSVD provides an orthonormal basis while TT-SVD doesn’t. QTT is performing well only when one dimension is very large. A final experiment is applied to an order 7 tensor with (4 × 8 × 8 × 64 × 64 × 64 × 64) entries (32 GB) from complex multi-physics experiment. In that case, only HT provides actual compression (50%) due to the low separability of this data. However, it is better suited for higher order d. Finally, these numerical tests have been performed with pydecomp , an open source python library developed by the author. BibTeX: @article{Lestandi2021, author = {Lucas Lestandi}, title = {Numerical Study of Low Rank Approximation Methods for Mechanics Data and Its Analysis}, journal = {Journal of Scientific Computing}, publisher = {Springer Science and Business Media LLC}, year = {2021}, volume = {87}, number = {1}, doi = {10.1007/s10915-021-01421-2} }  Li Z, Menon H, Mohror K, Bremer P-T, Livant Y and Pascucci V (2021), "Understanding a program's resiliency through error propagation", In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming., February, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: Aggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault introducing an error that is not readily detected nto an HPC simulation. Due to the insidious nature of SDCs, researchers have worked to understand their impact on applications. Previous studies have relied on expensive fault injection campaigns with uniform sampling to provide overall SDC rates, but this solution does not provide any feedback on the code regions without samples.\ In this research, we develop a method to systematically analyze all fault injection sites in an application with a low number of fault injection experiments. We use fault propagation data from a fault injection experiment to predict the resiliency of other untested fault sites and obtain an approximate fault tolerance threshold value for each site, which represents the largest error that can be introduced at the site without incurring incorrect simulation results. We define the collection of threshold values over all fault sites in the program as a fault tolerance boundary and propose a simple but efficient method to approximate the boundary. In our experiments, we show our method reduces the number of fault injection samples required to understand a program's resiliency by several orders of magnitude when compared with a traditional fault injection study. BibTeX: @inproceedings{Li2021, author = {Zhimin Li and Harshitha Menon and Kathryn Mohror and Peer-Timo Bremer and Yarden Livant and Valerio Pascucci}, title = {Understanding a program's resiliency through error propagation}, booktitle = {Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, publisher = {ACM}, year = {2021}, doi = {10.1145/3437801.3441589} }  Li J and Cai X-C (2021), "Summation pollution of principal component analysis and an improved algorithm for location sensitive data", Numerical Linear Algebra with Applications., March, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: Principal component analysis (PCA) is widely used for dimensionality reduction and unsupervised learning. The reconstruction error is sometimes large even when a large number of eigenmode is used. In this paper, we show that this unexpected error source is the pollution effect of a summation operation in the objective function of the PCA algorithm. The summation operator brings together unrelated parts of the data into the same optimization and the result is the reduction of the accuracy of the overall algorithm. We introduce a domain decomposed PCA that improves the accuracy, and surprisingly also increases the parallelism of the algorithm. To demonstrate the accuracy and parallel efficiency of the proposed algorithm, we consider three applications including a face recognition problem, a brain tumor detection problem using two‐ and three‐dimensional MRI images. BibTeX: @article{Li2021a, author = {Jingwei Li and Xiao-Chuan Cai}, title = {Summation pollution of principal component analysis and an improved algorithm for location sensitive data}, journal = {Numerical Linear Algebra with Applications}, publisher = {Wiley}, year = {2021}, doi = {10.1002/nla.2370} }  Li Y, Xu W and Gao X (2021), "Graphical-model based high dimensional generalized linear models", Electronic Journal of Statistics., January, 2021. Vol. 15(1) Institute of Mathematical Statistics. [Abstract] [BibTeX] [DOI] Abstract: We consider the problem of both prediction and model selection in high dimensional generalized linear models. Predictive performance can be improved by leveraging structure information among predictors. In this paper, a graphic model-based doubly sparse regularized estimator is discussed under the high dimensional generalized linear models, that utilizes the graph structure among the predictors. The graphic information among predictors is incorporated node-by-node using a decomposed representation and the sparsity is encouraged both within and between the decomposed components. We propose an efficient iterative proximal algorithm to solve the optimization problem. Statistical convergence rates and selection consistency for the doubly sparse regularized estimator are established in the ultra-high dimensional setting. Specifically, we allow the dimensionality grows exponentially with the sample size. We compare the estimator with existing methods through numerical analysis on both simulation study and a microbiome data analysis. BibTeX: @article{Li2021b, author = {Yaguang Li and Wei Xu and Xin Gao}, title = {Graphical-model based high dimensional generalized linear models}, journal = {Electronic Journal of Statistics}, publisher = {Institute of Mathematical Statistics}, year = {2021}, volume = {15}, number = {1}, doi = {10.1214/21-ejs1831} }  Li R, Sjögreen B and Yang UM (2021), "A New Class of AMG Interpolation Methods Based on Matrix-Matrix Multiplications", SIAM Journal on Scientific Computing., July, 2021. , pp. S540-S564. Society for Industrial & Applied Mathematics (SIAM). [Abstract] [BibTeX] [DOI] Abstract: A new class of distance-two interpolation methods for algebraic multigrid (AMG) that can be formulated in terms of sparse matrix-matrix multiplications is presented and analyzed. Compared with similar distance-two prolongation operators [H. De Sterck et al., Numer. Linear Algebra Appl., 15 (2008), pp. 115--139], the proposed algorithms exhibit improved efficiency and portability to various computing platforms, since they allow one to easily exploit existing high-performance sparse matrix kernels. The new interpolation methods have been implemented in hypre [R. D. Falgout and U. M. Yang, hypre: A library of high performance preconditioners, in Computational Science --- ICCS 2002, P. M. A. Sloot et al., eds., Springer, Berlin, Heidelberg, 2002, pp. 632--641], a widely used parallel multigrid solver library. With the proposed interpolations, the overall time of hypre's BoomerAMG setup can be considerably reduced, while sustaining equivalent, sometimes improved, convergence rates. Numerical results for a variety of test problems on parallel machines are presented that support the superiority of the proposed interpolation operators over the existing ones in hypre. BibTeX: @article{Li2021c, author = {Ruipeng Li and Björn Sjögreen and Ulrike Meier Yang}, title = {A New Class of AMG Interpolation Methods Based on Matrix-Matrix Multiplications}, journal = {SIAM Journal on Scientific Computing}, publisher = {Society for Industrial & Applied Mathematics (SIAM)}, year = {2021}, pages = {S540--S564}, doi = {10.1137/20m134931x} }  Li X, Cai L, Li J, Yu CKW and Hu Y (2021), "A survey of clustering methods via optimization methodology" [Abstract] [BibTeX] Abstract: Clustering is one of fundamental tasks in unsupervised learning and plays a very important role in various application areas. This paper aims to present a survey of five types of clustering methods in the perspective of optimization methodology, including center-based methods, convex clustering, spectral clustering, subspace clustering, and optimal transport based clustering. The connection between optimization methodology and clustering algorithms is not only helpful to advance the understanding of the principle and theory of existing clustering algorithms, but also useful to inspire new ideas of efficient clustering algorithms. Preliminary numerical experiments of various clustering algorithms for datasets of various shapes are provided to show the preference and specificity of each algorithm. BibTeX: @article{Li2021d, author = {Xiaotian Li and Linju Cai and Jingchao Li and Carisa Kwok Wai Yu and Yaohua Hu}, title = {A survey of clustering methods via optimization methodology}, year = {2021} }  Li C, Xia T, Zhao W, Zheng N and Ren P (2021), "SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SpMV", In 2021 58th ACM/IEEE Design Automation Conference (DAC)., December, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Sparse Matrix-Vector Multiplication (SpMV) plays an important role in many scientific and industry applications, and remains a well-known challenge due to the high sparsity and irregularity. Most existing researches on SpMV try to pursue high vectorization efficiency. However, such approaches may suffer from non-negligible speculation penalty due to their irregular computation patterns. In this paper, we propose SpV8, a novel approach that optimizes both speculation and vectorization in SpMV. Specifically, SpV8 analyzes data distribution in different matrices and row panels, and accordingly applies optimization method that achieves the maximal vectorization with regular computation patterns. We evaluate SpV8 on Intel Xeon CPU and compare with multiple state-of-art SpMV algorithms using 71 sparse matrices. The results show that SpV8 achieves up to 10× speedup (average 2.8×) against the standard MKL SpMV routine, and up to 2.4× speedup (average 1.4×) against the best existing approach. Moreover, SpMV features very low preprocessing overhead in all compared approaches, which indicates SpV8 is highly-applicable in real-world applications. BibTeX: @inproceedings{Li2021e, author = {Chenyang Li and Tian Xia and Wenzhe Zhao and Nanning Zheng and Pengju Ren}, title = {SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SpMV}, booktitle = {2021 58th ACM/IEEE Design Automation Conference (DAC)}, publisher = {IEEE}, year = {2021}, doi = {10.1109/dac18074.2021.9586251} }  Li C-X and Wu S-L (2022), "A SHSS–SS iteration method for non-Hermitian positive definite linear systems", Results in Applied Mathematics., February, 2022. Vol. 13, pp. 100225. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: In this paper, based on the single-step HSS (SHSS) method and shift-splitting (SS) method, we propose a SHSS--SS iteration method to solve the non-Hermitian positive definite linear systems by coupling the SHSS method with the SS method. Theoretical analysis shows that the SHSS--SS method is convergent under suitable conditions. Numerical experiments are reported to verify the efficiency of the SHSS--SS method; numerical comparisons show that the proposed SHSS--SS method is superior to the SHSS method and the SS method. BibTeX: @article{Li2022, author = {Cui-Xia Li and Shi-Liang Wu}, title = {A SHSS–SS iteration method for non-Hermitian positive definite linear systems}, journal = {Results in Applied Mathematics}, publisher = {Elsevier BV}, year = {2022}, volume = {13}, pages = {100225}, doi = {10.1016/j.rinam.2021.100225} }  Liang G, Tong Q, Zhu C and bi J (2021), "Escaping Saddle Points with Stochastically Controlled Stochastic Gradient Methods", March, 2021. [Abstract] [BibTeX] Abstract: Stochastically controlled stochastic gradient (SCSG) methods have been proved to converge efficiently to first-order stationary points which, however, can be saddle points in nonconvex optimization. It has been observed that a stochastic gradient descent (SGD) step introduces anistropic noise around saddle points for deep learning and non-convex half space learning problems, which indicates that SGD satisfies the correlated negative curvature (CNC) condition for these problems. Therefore, we propose to use a separate SGD step to help the SCSG method escape from strict saddle points, resulting in the CNC-SCSG method. The SGD step plays a role similar to noise injection but is more stable. We prove that the resultant algorithm converges to a second-order stationary point with a convergence rate of O( -2 log( 1/)) where 𝜖 is the pre-specified error tolerance. This convergence rate is independent of the problem dimension, and is faster than that of CNC-SGD. A more general framework is further designed to incorporate the proposed CNC-SCSG into any first-order method for the method to escape saddle points. Simulation studies illustrate that the proposed algorithm can escape saddle points in much fewer epochs than the gradient descent methods perturbed by either noise injection or a SGD step. BibTeX: @article{Liang2021, author = {Guannan Liang and Qianqian Tong and Chunjiang Zhu and Jinbo bi}, title = {Escaping Saddle Points with Stochastically Controlled Stochastic Gradient Methods}, year = {2021} }  Liang M, Zheng B and Zheng Y (2021), "A proximal point like method for solving tensor least-squares problems", Calcolo., November, 2021. Vol. 59(1) Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: The goal of this paper is to solve the tensor least-squares (TLS) problem associated with multilinear system A x^m-1=b, where A is an mth-order n-dimensional tensor and b is a vector in R^n, which has practical applications in numerical PDEs, data mining, tensor complementary problems, higher order statistics and so on. We transform the TLS problem into a multi-block optimization problem with consensus constraints, and propose an alternating linearized method with proximal regularization for it. Under some mild assumptions, it is shown that every limit point of the sequence generated by this method is a stationary point. Moreover, when the tensor A can be constructed in the tensor-train format explicitly, the total number of operations with respect to the method mentioned above decreases from the order of 𝒪(n^m-1) to 𝒪((m-1)^2 nr^2)+𝒪(mnr^3), alleviating the curse-of-dimensionality. As an application, the inverse iteration methods, derived from the proposed methods, for solving the tensor eigenvalue problems are presented. Some numerical examples are provided to illustrate the feasibility of our algorithms. BibTeX: @article{Liang2021a, author = {Maolin Liang and Bing Zheng and Yutao Zheng}, title = {A proximal point like method for solving tensor least-squares problems}, journal = {Calcolo}, publisher = {Springer Science and Business Media LLC}, year = {2021}, volume = {59}, number = {1}, doi = {10.1007/s10092-021-00450-5} }  Liao L-D, Li R-X and Wang X (2021), "A new iterative method for a class of linear system arising from image restoration problems", Results in Applied Mathematics., 11, 2021. Vol. 12, pp. 100221. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: In this paper, by utilizing the matrix properties arising from the image restoration model, a new iterative method for solving the corresponding augmented linear system is proposed. Theoretical results about the convergence properties and computational advantage of the new method are studied in detail, showing that it just involves a matrix–vector product, which can be implemented by fast Fourier transform (FFT) or discrete Cosine transform (DCT) algorithms and can save much computation cost. Numerical experiments are provided, further confirm that our theoretical results is reliable and our method is feasible and effective. BibTeX: @article{Liao2021, author = {Li-Dan Liao and Rui-Xia Li and Xiang Wang}, title = {A new iterative method for a class of linear system arising from image restoration problems}, journal = {Results in Applied Mathematics}, publisher = {Elsevier BV}, year = {2021}, volume = {12}, pages = {100221}, doi = {10.1016/j.rinam.2021.100221} }  Lim JP, Aanjaneya M, Gustafson J and Nagarakatte S (2021), "An Approach to Generate Correctly Rounded Math Librariesfor New Floating Point Variants", In Proceedings of the ACM Conference on Programming Languages. Vol. 5(29) [Abstract] [BibTeX] Abstract: Given the importance of floating point (FP) performance in numerous domains, several new variants of FP and its alternatives have been proposed (e.g., Bfloat16, TensorFloat32, and posits). These representations do not have correctly rounded math libraries. Further, the use of existing FP libraries for these new representations can produce incorrect results. This paper proposes a novel approach for generating polynomial approximations that can be used to implement correctly rounded math libraries. Existing methods generate polynomials that approximate the real value of an elementary function f(x) and produce wrong results due to approximation errors and rounding errors in the implementation. In contrast, our approach generates polynomials that approximate the correctly rounded value of f(x) (i.e., the value of f(x) rounded to the target representation). It provides more margin to identify efficient polynomials that produce correctly rounded results for all inputs. We frame the problem of generating efficient polynomials that produce correctly rounded results as a linear programming problem. Using our approach, we have developed correctly rounded, yet faster, implementations of elementary functions for multiple target representations. BibTeX: @inproceedings{Lim2021, author = {Jay P. Lim and Mridul Aanjaneya and John Gustafson and Santosh Nagarakatte}, title = {An Approach to Generate Correctly Rounded Math Librariesfor New Floating Point Variants}, booktitle = {Proceedings of the ACM Conference on Programming Languages}, year = {2021}, volume = {5}, number = {29} }  Liu J, Jiang J, Lin J and Wang J (2021), "Generalized Newton methods for graph signal matrix completion", Digital Signal Processing., February, 2021. , pp. 103009. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The matrix completion problem can be found in many applications such as classification, image inpainting and collaborative filtering. In recent years, the emerging field of graph signal processing (GSP) has shed new light on this problem, deriving the graph signal matrix completion problem which incorporates the correlation of data elements. The nuclear-norm based methods possess satisfactory recovery performance, while they suffer from high computational cost and usually have slow convergence rate. In this paper, we propose two new iterative algorithms for solving the nuclear-norm regularization based graph signal matrix completion (NRGSMC) problem. By adopting approximate diagonalization approaches to estimate singular value decomposition (SVD), we obtain two generalized Newton algorithms, the generalized Newton with truncated Jacobi method (GNTJM) and the generalized Newton with parallel truncated Jacobi method (GNPTJM). The proposed methods are with low complexity and fast convergence by using the second-order information associated with the problem. Numerical results on three real-world data sets demonstrate that our schemes have evidently faster convergence rate than the gradient method with exact SVD, while maintain the similar completion performance. BibTeX: @article{Liu2021, author = {Jinling Liu and Junzheng Jiang and Jiming Lin and Junyi Wang}, title = {Generalized Newton methods for graph signal matrix completion}, journal = {Digital Signal Processing}, publisher = {Elsevier BV}, year = {2021}, pages = {103009}, doi = {10.1016/j.dsp.2021.103009} }  Liu X, Liu Y, Dun M, Yin B, Yang H, Luan Z and Qian D (2021), "Accelerating Sparse Approximate Matrix Multiplication on GPUs", March, 2021. [Abstract] [BibTeX] Abstract: Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit the performance potential of GPUs for acceleration. In this paper, we present cuSpAMM, the first parallel SpAMM algorithm optimized for multiple GPUs. Several performance optimizations have been proposed, including algorithm re-design to adapt to the thread parallelism, blocking strategies for memory access optimization, and the acceleration with the tensor core. In addition, we scale cuSpAMM to run on multiple GPUs with an effective load balance scheme. We evaluate cuSpAMM on both synthesized and real-world datasets on multiple GPUs. The experiment results show that cuSpAMM achieves significant performance speedup compared to vendor optimized cuBLAS and cuSPARSE libraries. BibTeX: @article{Liu2021a, author = {Xiaoyan Liu and Yi Liu and Ming Dun and Bohong Yin and Hailong Yang and Zhongzhi Luan and Depei Qian}, title = {Accelerating Sparse Approximate Matrix Multiplication on GPUs}, year = {2021} }  Liu Y, Du X and Ma S (2021), "Innovative study on clustering center and distance measurement of K-means algorithm: mapreduce efficient parallel algorithm based on user data of JD mall", Electronic Commerce Research., March, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: The traditional K-means algorithm is very sensitive to the selection of clustering centers and the calculation of distances, so the algorithm easily converges to a locally optimal solution. In addition, the traditional algorithm has slow convergence speed and low clustering accuracy, as well as memory bottleneck problems when processing massive data. Therefore, an improved K-means algorithm is proposed in this paper. In this algorithm, the selection of the initial points in the traditional clustering algorithm is improved first, and then a new global measure, the effective distance measure, is proposed. Its main idea is to calculate the effective distance between two data samples by sparse reconstruction. Finally, on the basis of the MapReduce framework, the efficiency of the algorithm is further improved by adjusting the Hadoop cluster. Based on the real customer data from the JD Mall dataset, this paper introduces the DBI, Rand and other indicators to evaluate the clustering effects of various algorithms. The results show that the proposed algorithm not only has good convergence and accuracy but also achieves better performances than those of other compared algorithms. BibTeX: @article{Liu2021b, author = {Yang Liu and Xinxin Du and Shuaifeng Ma}, title = {Innovative study on clustering center and distance measurement of K-means algorithm: mapreduce efficient parallel algorithm based on user data of JD mall}, journal = {Electronic Commerce Research}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s10660-021-09458-z} }  Liu X, Wang J and Yuan X (2021), "Spectral Clustering Algorithm Based on OptiSim Selection", IAENG International Journal of Applied Mathematics. [Abstract] [BibTeX] Abstract: The spectral clustering (SC) method has a good clustering effect on arbitrary structure datasets because of its solid theoretical basis. However, the required time complexity is high, thus limiting the application of SC in big datasets. To reduce time complexity, we propose an SC algorithm based on OptiSim Selection (SCOSS) in this study. This new algorithm starts from selecting a representative subset by using an optimizable k-dissimilarity selection algorithm (OptiSim) and then uses the Nystrom method to approximate the eigenvectors ¨ of the similarity matrix. Theoretical deductions and experiment results show that the proposed algorithm can use less clustering time to achieve a good clustering result. BibTeX: @article{Liu2021c, author = {Xuejuan Liu and Junguo Wang and Xiangying Yuan}, title = {Spectral Clustering Algorithm Based on OptiSim Selection}, journal = {IAENG International Journal of Applied Mathematics}, year = {2021} }  Liu X, Wen Z and Yuan Y-X (2021), "Subspace Methods for Nonlinear Optimization", SIAM Transactions on Applied Mathematics. Vol. 2(4), pp. 585-651. [Abstract] [BibTeX] [URL] Abstract: Subspace techniques such as Krylov subspace methods have been well known and extensively used in numerical linear algebra. They are also ubiquitous and becoming indispensable tools in nonlinear optimization due to their ability to handle large scale problems. There are generally two types of principals: i) the decision variable is updated in a lower dimensional subspace; ii) the objective function or constraints are approximated in a certain smaller subspace of their domain. The key ingredients are the constructions of suitable subspaces and subproblems according to the specific structures of the variables and functions such that either the exact or inexact solutions of subproblems are readily available and the corresponding computational cost is significantly reduced. A few relevant techniques include but not limited to direct combinations, block coordinate descent, active sets, limited-memory, Anderson acceleration, subspace correction, sampling and sketching. This paper gives a comprehensive survey on the subspace methods and their recipes in unconstrained and constrained optimization, nonlinear least squares problem, sparse and low rank optimization, linear and nonlinear eigenvalue computation, semidefinite programming, stochastic optimization and etc. In order to provide helpful guidelines, we emphasize on high level concepts for the development and implementation of practical algorithms from the subspace framework BibTeX: @article{Liu2021d, author = {Xin Liu and Zaiwen Wen and Ya-Xiang Yuan}, title = {Subspace Methods for Nonlinear Optimization}, journal = {SIAM Transactions on Applied Mathematics}, year = {2021}, volume = {2}, number = {4}, pages = {585--651}, url = {https://doc.global-sci.org/uploads/Issue/CSIAM-AM/v2n4/24_585.pdf?1638174694} }  Long R (2021), "An Approach to Predicting Performance of Sparse Computations on Nvidia GPUs". Thesis at: The University of Texas at El Paso. [Abstract] [BibTeX] Abstract: Sparse problems arise from a variety of applications, from scientific simulations to graph analytics. Traditional HPC systems have failed to effectively provide high bandwidth for sparse problems. This limitation is primarily because of the nature of sparse computations and their irregular memory access patterns.\ We predict the performance of sparse computations given an input matrix and GPU hardware characteristics. This prediction is done by identifying hardware bottlenecks in modern NVIDIA GPUs using roofline trajectory models. Roofline trajectory models give us insight into the performance by simultaneously showing us the effects of strong and weak scaling. We then create regression models for our benchmarks to model performance metrics. The outputs of these models are compared against empirical results.\ We expect our results to be useful to application developers in understanding the performance of their sparse algorithms in GPUs and to hardware designers in fine-tuning GPU features to better meet the requirements of sparse applications BibTeX: @phdthesis{Long2021, author = {Long, Rogelio}, title = {An Approach to Predicting Performance of Sparse Computations on Nvidia GPUs}, school = {The University of Texas at El Paso}, year = {2021} }  Lu Z (2021), "A sparse approximate inverse for triangular matrices based on Jacobi iteration", June, 2021. [Abstract] [BibTeX] Abstract: In this paper, we propose a sparse approximate inverse for triangular matrices (SAIT) based on Jacobi iteration. The main operation of the algorithm is matrix-matrix multiplication. We apply the SAIT to iterative methods with ILU preconditioners. Then the two triangular solvers in the ILU preconditioning procedure are replaced by two matrix-vector multiplications, which can be fine-grained parallelized. We test the new algorithm by solving some linear systems and eigenvalue problems. BibTeX: @article{Lu2021, author = {Zhongjie Lu}, title = {A sparse approximate inverse for triangular matrices based on Jacobi iteration}, year = {2021} }  Luo X-l and Xiao H (2021), "Generalized continuation Newton methods and the trust-region updating strategy for the underdetermined system", March, 2021. [Abstract] [BibTeX] Abstract: This paper considers the generalized continuation Newton method and the trust-region updating strategy for the underdetermined system of nonlinear equations.Moreover, in order to improve its computational efficiency, the new method uses a switching updating technique of the Jacobian matrix. That is to say, it does not compute the next Jacobian matrix and replaces it with the current jacobian matrix when the linear approximation model of the merit function approximates it well. The numerical results show that the new method is more robust and faster than the traditional optimization method such as the Levenberg-Marquardt method (a variant of trust-region methods, the built-in subroutine fsolve.m of the MATLAB environment). The computational speed of the new method is about eight to fifty times as fast as that of fsolve. Furthermore, it also proves the global convergence and the local superlinear convergence of the new method under some standard assumptions. BibTeX: @article{Luo2021, author = {Xin-long Luo and Hang Xiao}, title = {Generalized continuation Newton methods and the trust-region updating strategy for the underdetermined system}, year = {2021} }  Luo H (2021), "Accelerated primal-dual methods for linearly constrained convex optimization problems", September, 2021. [Abstract] [BibTeX] Abstract: This work proposes an accelerated primal-dual dynamical system for affine constrained convex optimization and presents a class of primal-dual methods with nonergodic convergence rates. In continuous level, exponential decay of a novel Lyapunov function is established and in discrete level, implicit, semi-implicit and explicit numerical discretizations for the continuous model are considered sequentially and lead to new accelerated primal-dual methods for solving linearly constrained optimization problems. Special structures of the subproblems in those schemes are utilized to develop efficient inner solvers. In addition, nonergodic convergence rates in terms of primal-dual gap, primal objective residual and feasibility violation are proved via a tailored discrete Lyapunov function. Moreover, our method has also been applied to decentralized distributed optimization for fast and efficient solution. BibTeX: @article{Luo2021a, author = {Hao Luo}, title = {Accelerated primal-dual methods for linearly constrained convex optimization problems}, year = {2021} }  Luo L, Liu Y, Yang H and Qian D (2021), "Magas: matrix-based asynchronous graph analytics on shared memory systems", The Journal of Supercomputing., October, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: Graph analytics plays an important role in many areas such as big data and artificial intelligence. The vertex-centric programming model provides friendly interfaces to programmers and is extensively used in graph processing frameworks. However, it is prone to generate many irregular memory accesses and scheduling overhead due to vertex-based execution and scheduling of programs in the backend. Instead, the matrix-based model provides a different approach by using high-performance matrix operations in the backend to improve the efficiency of graph processing. Unfortunately, current matrix-based frameworks only support the synchronous parallel model, which constrains its application to various graph algorithms. To address these problems, this paper proposes a graph processing framework, which combines matrix operations with the asynchronous model while providing friendly programming interfaces similar to vertex-centric programming model. Firstly, we propose an approach to map the vertex-based graph processing to matrix operations in the asynchronous model. Then, we propose two asynchronous scheduling policies, Gauss–Seidel policy and relaxed Gauss–Seidel policy, for different graph algorithms. After that, our framework applies the batch scheduling and optimized in-memory data structure to reduce the scheduling overhead introduced by the asynchronous model. Experimental results show that our framework performs better than the popular vertex programming frameworks such as GraphLab and GRACE in both performance and speedup and achieves similar performance compared to the BSP-based matrix framework such as GraphMat. BibTeX: @article{Luo2021b, author = {Le Luo and Yi Liu and Hailong Yang and Depei Qian}, title = {Magas: matrix-based asynchronous graph analytics on shared memory systems}, journal = {The Journal of Supercomputing}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s11227-021-04091-x} }  Lupu D and Necoara I (2021), "Convergence analysis of stochastic higher-order majorization-minimization algorithms", March, 2021. [Abstract] [BibTeX] Abstract: Majorization-minimization algorithms consist of successively minimizing a sequence of upper bounds of the objective function so that along the iterations the objective function decreases. Such a simple principle allows to solve a large class of optimization problems, even nonconvex, nonsmooth and stochastic. We present a stochastic higher-order algorithmic framework for minimizing the average of a very large number of sufficiently smooth functions. Our stochastic framework is based on the notion of stochastic higher-order upper bound approximations of the finite-sum objective function and minibatching. We present convergence guarantees for nonconvex and convex optimization when the higher-order upper bounds approximate the objective function up to an error that is p times differentiable and has a Lipschitz continuous p derivative. More precisely, we derive asymptotic stationary point guarantees for nonconvex problems, and for convex ones we establish local linear convergence results, provided that the objective function is uniformly convex. Unlike other higher-order methods, ours work with any batch size. Moreover, in contrast to most existing stochastic Newton and third-order methods, our approach guarantees local convergence faster than with first-order oracle and adapts to the problem's curvature. Numerical simulations also confirm the efficiency of our algorithmic framework. BibTeX: @article{Lupu2021, author = {Daniela Lupu and Ion Necoara}, title = {Convergence analysis of stochastic higher-order majorization-minimization algorithms}, year = {2021} }  Ma D, Orban D and Saunders MA (2021), "A Julia implementation of Algorithm NCL for constrained optimization", January, 2021. [Abstract] [BibTeX] [DOI] Abstract: Algorithm NCL is designed for general smooth optimization problems where first and second derivatives are available, including problems whose constraints may not be linearly independent at a solution (i.e., do not satisfy the LICQ). It is equivalent to the LANCELOT augmented Lagrangian method, reformulated as a short sequence of nonlinearly constrained subproblems that can be solved efficiently by IPOPT and KNITRO, with warm starts on each subproblem. We give numerical results from a Julia implementation of Algorithm NCL on tax policy models that do not satisfy the LICQ, and on nonlinear least-squares problems and general problems from the CUTEst test set. BibTeX: @article{Ma2021, author = {Ding Ma and Dominique Orban and Michael A. Saunders}, title = {A Julia implementation of Algorithm NCL for constrained optimization}, year = {2021}, doi = {10.13140/RG.2.2.29888.35841} }  Ma W, Hu Y, Yuan W and Liu X (2021), "Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices", Mathematical Problems in Engineering. [Abstract] [BibTeX] [DOI] Abstract: Solving triangular systems is the building block for preconditioned GMRES algorithm. Inexact preconditioning becomes attractive because of the feature of high parallelism on accelerators. In this paper, we propose and implement an iterative, inexact block triangular solve on multi-GPUs based on PETSc’s framework. In addition, by developing a distributed block sparse matrix-vector multiplication procedure and investigating the optimized vector operations, we form the multi-GPU-enabled preconditioned GMRES with the block Jacobi preconditioner. In the implementation, the GPU-Direct technique is employed to avoid host-device memory copies. The preconditioning step used by PETSc’s structure and the cuSPARSE library are also investigated for performance comparisons. The experiments show that the developed GMRES with inexact preconditioning on 8 GPUs can achieve up to 4.4x speedup over the CPU-only implementation with exact preconditioning using 8 MPI processes. BibTeX: @article{Ma2021a, author = {Wenpeng Ma and Yiwen Hu and Wu Yuan and Xiazhen Liu}, title = {Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices}, journal = {Mathematical Problems in Engineering}, year = {2021}, doi = {https://www.hindawi.com/journals/mpe/2021/6804723/} }  Manieri L, Falsone A and Prandini M (2021), "Hyper-graph partitioning for a multi-agent reformulation of large-scale MILPs", IEEE Control Systems Letters. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: This paper addresses the challenge of solving large-scale Mixed Integer Linear Programs (MILPs). A resolution scheme is proposed for the class of MILPs with a hidden constraint-coupled multi-agent structure. In particular, we focus on the problem of disclosing such a structure to then apply a computationally efficient decentralized optimization algorithm recently proposed in the literature. The multi-agent reformulation problem consists in manipulating the matrix defining the linear constraints of the MILP so as to put it in a singly-bordered block-angular form, where the blocks define local constraints and decision variables of the agents, whereas the border defines the coupling constraints. We translate the matrix reformulation problem into a hyper-graph partitioning problem and introduce a novel algorithm which accounts for the specific requirements on the singly-bordered block-angular form to best take advantage of the decentralized optimization approach. Numerical results show the effectiveness of the proposed hyper-graph partitioning algorithm. BibTeX: @article{Manieri2021, author = {Lucrezia Manieri and Alessandro Falsone and Maria Prandini}, title = {Hyper-graph partitioning for a multi-agent reformulation of large-scale MILPs}, journal = {IEEE Control Systems Letters}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/lcsys.2021.3093338} }  Markvardsen A, Rees T, Wathen M, Lister A, Odagiu P, Anuchitanukul A, Farmer T, Lim A, Montesino F, Snow T and McCluskey A (2021), "FitBenchmarking: an open source Python package comparing data fitting software", June, 2021. Vol. 6(62), pp. 3127. The Open Journal. [Abstract] [BibTeX] [DOI] Abstract: Fitting a mathematical model to data is a fundamental task across all scientific disciplines. FitBenchmarking has been designed to help: itemizeitem Scientists, who want to know the best algorithm for fitting their data to a given model using specific hardware. item Scientific software developers, who want to identify the best fitting algorithms and implementations. This allows them to recommend a default solver, to see if it is worth adding a new minimizer, and to test their implementation. item Mathematicians and numerical software developers, who want to understand the types of problems on which current algorithms do not perform well, and to have a route to expose newly developed methods to users. itemize Representatives of each of these communities have got together to build FitBenchmarking. We hope this tool will help foster fruitful interactions and collaborations across the disciplines. BibTeX: @article{Markvardsen2021, author = {Anders Markvardsen and Tyrone Rees and Michael Wathen and Andrew Lister and Patrick Odagiu and Atijit Anuchitanukul and Tom Farmer and Anthony Lim and Federico Montesino and Tim Snow and Andrew McCluskey}, title = {FitBenchmarking: an open source Python package comparing data fitting software}, publisher = {The Open Journal}, year = {2021}, volume = {6}, number = {62}, pages = {3127}, doi = {10.21105/joss.03127} }  Marrakchi S and Jemni M (2021), "Static Scheduling with Load Balancing for Solving Triangular Band Linear Systems on Multicore Processors", Fundamenta Informaticae. Vol. 179, pp. 35-58. IOS Press. [Abstract] [BibTeX] [DOI] Abstract: A new approach for solving triangular band linear systems is established in this study to balance the load and obtain a high degree of parallelism. Our investigation consists to attribute both adequate start time and processor to each task and eliminate the useless dependencies which are not used in the parallel solve stage. Thereby, processors execute in parallel their related tasks taking account of the considered precedence constraints. The theoretical lower bounds for parallel execution time and the number of processors required to carry out the task graph in the shortest time are determined. Experimentations are realized on a shared-memory multicore processor. The experimental results are fitted to the values derived from the determined mathematical formulas. The comparison of results obtained by our contribution with those from triangular systems resolution routine belonging to the library PLASMA, Parallel Linear Algebra Software for Multicore Architectures, confirms the efficiency of the proposed approach. BibTeX: @article{Marrakchi2021, author = {Sirine Marrakchi and Mohamed Jemni}, title = {Static Scheduling with Load Balancing for Solving Triangular Band Linear Systems on Multicore Processors}, journal = {Fundamenta Informaticae}, publisher = {IOS Press}, year = {2021}, volume = {179}, pages = {35-58}, doi = {10.3233/FI-2021-2012} }  Mei G, Tu J, Xiao L and Piccialli F (2021), "An efficient graph clustering algorithm by exploiting k-core decomposition and motifs", November, 2021. , pp. 107564. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Clustering analysis has been widely used in trust evaluation for various complex networks such as wireless sensor networks and online social networks. Spectral clustering is one of the most commonly used algorithms for graph-structured data (networks). However, conventional spectral clustering is inherently difficult to perform in large networks. In this paper, we proposed an efficient graph clustering algorithm, KCoreMotif, specifically for large networks by exploiting k-core decomposition and motifs. We first conducted the k-core decomposition of the large input network, then performed the motif-based spectral clustering for the top k-core subgraphs, and finally grouped the remaining vertices in the rest (k-1)-core subgraphs into previously found clusters to obtain the final clusters. Comparative results of 18 groups of real-world datasets demonstrated that KCoreMotif was accurate yet efficient for large networks, which also means it can be further used to evaluate the intra-cluster and inter-cluster trusts for large networks. BibTeX: @article{Mei2021, author = {Gang Mei and Jingzhi Tu and Lei Xiao and Francesco Piccialli}, title = {An efficient graph clustering algorithm by exploiting k-core decomposition and motifs}, publisher = {Elsevier BV}, year = {2021}, pages = {107564}, doi = {10.1016/j.compeleceng.2021.107564} }  Meng X, Xu F, Ye H and Cao F (2021), "The sparse factorization of nonnegative matrix in distributed network", Advances in Computational Intelligence., September, 2021. Vol. 1(5) Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: This paper proposes some distributed algorithms to solve the sparse factorization of a large-scale nonnegative matrix (SFNM). These distributed algorithms combine some merits of classical nonnegative matrix factorization (NMF) algorithms and distributed learning network. Our proposed algorithms utilize the whole nodes of network to solve a factorization problem of a nonnegative matrix; the fact is that per node copes with a part of the matrix, then uses the distributed average consensus (DAC) algorithm or regional nodes to communicate the parameters gained by each node to ensure them to be convergent or easy to calculation. Different from other existing distributed learning algorithms of NMF, which always need high-qualified hardware or complicated computing methods, our algorithms make a full use of the simplicity of traditional NMF algorithms and distributed thoughts. Some artificial datasets are used for testing these algorithms, and the experimental results with comparisons show that the proposed algorithms perform favorably in terms of accuracy and efficiency. BibTeX: @article{Meng2021, author = {Xinhong Meng and Fusheng Xu and Hailiang Ye and Feilong Cao}, title = {The sparse factorization of nonnegative matrix in distributed network}, journal = {Advances in Computational Intelligence}, publisher = {Springer Science and Business Media LLC}, year = {2021}, volume = {1}, number = {5}, doi = {10.1007/s43674-021-00009-5} }  Milaković S, Selvitopi O, Nisa I, Budimlić Z and Buluc A (2021), "Parallel Algorithms for Masked Sparse Matrix-Matrix Products", November, 2021. [Abstract] [BibTeX] Abstract: Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-product similarities. For an important class of algorithms, only a subset of the output entries are needed, and the resulting operation is known as Masked SpGEMM since a subset of the output entries is considered to be "masked out". Existing algorithms for Masked SpGEMM usually do not consider mask as part of multiplication and either first compute a regular SpGEMM followed by masking, or perform a sparse inner product only for output elements that are not masked out. In this work, we investigate various novel algorithms and data structures for this rather challenging and important computation, and provide guidelines on how to design a fast Masked-SpGEMM for shared-memory architectures. Our evaluations show that factors such as matrix and mask density, mask structure and cache behavior play a vital role in attaining high performance for Masked SpGEMM. We evaluate our algorithms on a large number of matrices using several real-world benchmarks and show that our algorithms in most cases significantly outperform the state of the art for Masked SpGEMM implementations. BibTeX: @article{Milakovic2021, author = {Srdjan Milaković and Oguz Selvitopi and Israt Nisa and Zoran Budimlić and Aydin Buluc}, title = {Parallel Algorithms for Masked Sparse Matrix-Matrix Products}, year = {2021} }  Miletto MC, Nesia LL, Schnorra LM and Legrand A (2021), "Performance Analysis of Irregular Task-Based Applications on Hybrid Platforms: Structure Matters" [Abstract] [BibTeX] [URL] Abstract: Efficiently exploiting computational resources in heterogeneous platforms is a real challenge which has motivated the adoption of the task-based programming paradigm where resource usage is dynamic and adaptive. Unfortunately, classical performance visualization techniques used in routine performance analysis often fail to provide any insight in this new context, especially when the application structure is irregular. In this paper, we propose several performance visualization techniques tailored for the analysis of task-based multifrontal sparse linear solvers whose structure is particularly complex. We show that by building on both a performance model of irregular tasks and on structure of the application (in particular the elimination tree), we can detect and highlight anomalies and understand resource utilization from the application point-of-view in a very insightful way. We validate these novel performance analysis techniques with the QR_mumps sparse parallel solver by describing a series of case studies where we identify and address non trivial performance issues thanks to our visualization methodology BibTeX: @article{Miletto2021, author = {Marcelo Cogo Miletto and Lucas Leandro Nesia and Lucas Mello Schnorra and Arnaud Legrand}, title = {Performance Analysis of Irregular Task-Based Applications on Hybrid Platforms: Structure Matters}, year = {2021}, url = {https://hal.inria.fr/hal-03298021/document} }  Miyatake Y and Sogabe T (2021), "Adaptive projected SOR algorithms for nonnegative quadratic programming", December, 2021. [Abstract] [BibTeX] Abstract: The optimal value of the projected successive overrelaxation (PSOR) method for nonnegative quadratic programming problems is problem-dependent. We present a novel adaptive PSOR algorithm that adaptively controls the relaxation parameter using the Wolfe conditions. The method and its variants can be applied to various problems without requiring a specific assumption regarding the matrix defining the objective function, and the cost for updating the parameter is negligible in the whole iteration. Numerical experiments show that the proposed methods often perform comparably to (or sometimes superior to) the PSOR method with a nearly optimal relaxation parameter. BibTeX: @article{Miyatake2021, author = {Yuto Miyatake and Tomohiro Sogabe}, title = {Adaptive projected SOR algorithms for nonnegative quadratic programming}, year = {2021} }  Mlakar D, Winter M, Parger M and Steinberger M (2021), "Speculative Parallel Reverse Cuthill-McKeeReordering on Multi- and Many-core Architectures" [Abstract] [BibTeX] [URL] Abstract: Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.\ Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.\ We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA’s single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss. BibTeX: @article{Mlakar2021, author = {Daniel Mlakar and Martin Winter and Mathias Parger and Markus Steinberger}, title = {Speculative Parallel Reverse Cuthill-McKeeReordering on Multi- and Many-core Architectures}, year = {2021}, url = {https://www.markussteinberger.net/papers/SpeculativeRCM.pdf} }  Møyner O (2021), "Faster Simulation with Optimized Automatic Differentiation and Compiled Linear Solvers", In Advanced Modeling with the MATLAB Reservoir Simulation Toolbox., November, 2021. , pp. 200-254. Cambridge University Press. [Abstract] [BibTeX] [DOI] Abstract: Many different factors contribute to elapsed runtime of reservoir simulators. Once the cases become larger and more complex, the required wait time for results can become prohibitive. This chapter discusses three features recently introduced into the MRST AD-OO framework to make simulation of large cases more efficient. By analysing the sparsity pattern of the Jacobians for some of the most common operations involved in computing residual flow equations, we have developed different implementations of automatic differentiation that offer better memory usage and requires fewer floating point operations. Using these so-called AD backends ensures (much) faster assembly of linearized systems. Likewise, these systems can be solved much faster by utilizing external packages for linear algebra; herein, primarily represented by the AMGCL header-only C++ library for solving large sparse linear systems with algebraic multigrid (AMG) methods. Last, but not least, the new packed problem'' format simplifies the management of multiple simulation cases and enables automatic restart of simulations and an ability for early inspection of results from large batch simulations. Altogether, these features are essential if you are working with bigger simulation models and want timely results that persist across MATLAB sessions. BibTeX: @incollection{Moeyner2021, author = {Olav Møyner}, title = {Faster Simulation with Optimized Automatic Differentiation and Compiled Linear Solvers}, booktitle = {Advanced Modeling with the MATLAB Reservoir Simulation Toolbox}, publisher = {Cambridge University Press}, year = {2021}, pages = {200--254}, doi = {10.1017/9781009019781.011} }  Mollaebrahim S and Beferull-Lozano B (2021), "Design of Asymmetric Shift Operators for Efficient Decentralized Subspace Projection", IEEE Transactions on Signal Processing. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: A large number of applications in decentralized signal processing includes projecting a vector of noisy observations onto a subspace dictated by prior information about the field being monitored. Accomplishing such a task in a centralized fashion in networks is prone to a number of issues such as large power consumption, congestion at certain nodes and suffers from robustness issues against possible node failures. Decentralized subspace projection is an alternative method to address those issues. Recently, it has been shown that graph filters (GFs) can be implemented to perform decentralized subspace projection. However, most of the existing methods have focused on designing GFs for symmetric topologies. However, in this paper, motivated by the typical scenario of asymmetric communications in Wireless Sensor Networks, we study the optimal design of graph shift operators to perform decentralized subspace projection for asymmetric topologies. Firstly, the existence of feasible solutions (graph shift operators) to achieve an exact projection is characterized, and then an optimization problem is proposed to obtain the shift operator. We also provide an ADMM-based decentralized algorithm for the design of the shift operator. A necessary condition for the existence of an exact subspace projection solution is also provided in terms of network connectivity. In the case where achieving an exact projection is not feasible due to the sparse connectivity, we provide an efficient solution to compute the projection matrix with high accuracy by using a set of parallel graph filters. Numerical results show the benefits of the proposed methods, as compared to the currently existing state-of-the-art methods. BibTeX: @article{Mollaebrahim2021, author = {Siavash Mollaebrahim and Baltasar Beferull-Lozano}, title = {Design of Asymmetric Shift Operators for Efficient Decentralized Subspace Projection}, journal = {IEEE Transactions on Signal Processing}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/tsp.2021.3066787} }  Montoison A and Orban D (2021), "GPMR: An Iterative Method for Unsymmetric Partitioned Linear Systems", November, 2021. [Abstract] [BibTeX] [DOI] Abstract: We introduce an iterative method named GPMR for solving 2x2 block unsymmetric linear systems. GPMR is based on a new process that reduces simultaneously two rectangular matrices to upper Hessenberg form and that is closely related to the block-Arnoldi process. GPMR is tantamount to Block-GMRES with two right-hand sides in which the two approximate solutions are summed at each iteration, but requires less storage and work per iteration. We compare the performance of GPMR with GMRES and Block-GMRES on linear systems from the SuiteSparse Matrix Collection. In our experiments, GPMR terminates significantly earlier than GMRES on a residual-based stopping condition with an improvement ranging from around 10% up to 50% in terms of number of iterations. We also illustrate by experiment that GPMR appears more resilient to loss of orthogonality than Block-GMRES. BibTeX: @article{Montoison2021, author = {Alexis Montoison and Dominique Orban}, title = {GPMR: An Iterative Method for Unsymmetric Partitioned Linear Systems}, year = {2021}, doi = {10.13140/RG.2.2.24069.68326} }  Moríñigo JA, Bustos A and Mayo-García R (2021), "Error resilience of three GMRES implementations under fault injection", The Journal of Supercomputing., November, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: The resilience behavior of three GMRES prototyped implementations (with Incomplete LU, Flexible and randomized-SVD-based preconditioners) has been analyzed with a soft errors injection approach. A low-level fault injector is inserted into the GMRES solvers, which randomly select locations in the program to inject the fault across multiple executions. This fault injection approach combines the configurability of high-level and the accuracy of low-level techniques at the same time, so the effect of faults may be closely emulated. In order to gather enough statistical data, a set of eighteen sparse matrix-based linear systems Ax = b has been solved with these GMRES implementations in the injection experiments and monitored. The results of this prototype-based fault injection suggest an improved error resilience behavior of the randomized-SVD-based preconditioned GMRES version in many of the analyzed matrices, which points out to its interest in supercomputing applications where silent errors are more prominent. BibTeX: @article{Morinigo2021, author = {José A. Moríñigo and Andrés Bustos and Rafael Mayo-García}, title = {Error resilience of three GMRES implementations under fault injection}, journal = {The Journal of Supercomputing}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s11227-021-04148-x} }  Mudiyanselage TKB (2021), "Machine Learning Methods for Effectively Discovering Complex Relationships in Graph Data". Thesis at: Georgia State University. [Abstract] [BibTeX] Abstract: Graphs are extensively employed in many systems due to their capability to capture the interactions (edges) among data (nodes) in many real-life scenarios. Social networks, biological networks and molecular graphs are some of the domains where data have inherent graph structural information. Built graphs can be used to make predictions in Machine Learning (ML) such as node classifications, link predictions, graph classifications, etc. But, existing ML algorithms hold a core assumption that data instances are independent of each other and hence prevent incorporating graph information into ML. This irregular and variable sized nature of non-Euclidean data makes learning underlying patterns of the graph more sophisticated. One approach is to convert the graph information into a lower dimensional space and use traditional learning methods on the reduced space. Meanwhile, Deep Learning has better performance than ML due to convolutional layers and recurrent layers which consider simple correlations in spatial and temporal data, respectively. This proves the importance of taking data interrelationships into account and Graph Convolutional Networks (GCNs) are inspired by this fact to exploit the structure of graphs to make better inference in both node-centric and graph-centric applications. In this dissertation, the graph based ML prediction is addressed in terms of both node classification and link prediction tasks. At first, GCN is thoroughly studied and compared with other graph embedding methods specific to biological networks. Next, we present several new GCN algorithms to improve the prediction performance related to biomedical networks and medical imaging tasks. A circularRNA (circRNA) and disease association network is modeled for both node classification and link prediction tasks to predict diseases relevant to circRNAs to demonstrate the effectiveness of graph convolutional learning. A GCN based chest X-ray image classification outperforms state-of-the-art transfer learning methods. Next, the graph representation is used to analyze the feature dependencies of data and select an optimal feature subset which respects the original data structure. Finally, the usability of this algorithm is discussed in identifying disease specific genes by exploiting gene-gene interactions. BibTeX: @phdthesis{Mudiyanselage2021, author = {Thosini K. Bamunu Mudiyanselage}, title = {Machine Learning Methods for Effectively Discovering Complex Relationships in Graph Data}, school = {Georgia State University}, year = {2021} }  Murthy DK (2021), "Optimized Sparse Matrix Operations And Hardware Implementation Using FPGA". Thesis at: Texas State University. [Abstract] [BibTeX] [URL] Abstract: The increasing importance of sparse connectivity representing real-world data has been exemplified by the recent work in areas of graph analytics, machine language, and high-performance. Sparse matrices are the critical component in many scientific computing applications, where increasing the sparse matrix operation efficiency can contribute significantly to improve overall system efficiency. The primary challenge is handling the nonzero values efficiently by storing them using specific storage format and performing matrix operations, taking advantage of the sparsity. This thesis proposes an optimized algorithm for performing sparse matrix operations concerning storage and hardware implementation on FPGAs. The proposed thesis work includes simple arithmetic operations to complex decomposition algorithms using Verilog design. Operations of the sparse matrix are tested with testbench matrices of different size, sparsity percentage, and sparsity pattern. The design was able to achieve low latency, high throughput, and minimal resources utilization when compared with the conventional matrix algorithm. Our approach enables solving more significant problems than previously possible, allowing FPGAs to more interesting issues. BibTeX: @phdthesis{Murthy2021, author = {Dinesh Kumar Murthy}, title = {Optimized Sparse Matrix Operations And Hardware Implementation Using FPGA}, school = {Texas State University}, year = {2021}, url = {https://digital.library.txstate.edu/bitstream/handle/10877/14054/MURTHY-THESIS-2021.pdf?sequence=1} }  Muts P (2021), "Decomposition methods for mixed-integer nonlinear programming". Thesis at: Universidad de Málaga. [Abstract] [BibTeX] [URL] Abstract: Mixed-integer nonlinear programming (MINLP) is an important and challenging field of optimization. The problems from this class can contain continuous and integral variables as well as linear and nonlinear constraints. This class of problems has a pivotal role in science and industry, since it provides an accurate way to describe phenomena in different areas like chemical and mechanical engineering, supply chain, management, etc. Most of the state-of-the-art algorithms for solving nonconvex MINLP problems are based on branch-and-bound. The main drawback of this approach is that the search tree may grow very fast preventing the algorithm to find a high-quality solution in a reasonable time. An alternative to avoid generating one big search tree is to make use of decomposition to make the solution process more tractable. Decomposition provides a general framework where one splits the original problem into smaller sub-problems and combines their solutions into an easier global master problem.\ This thesis deals with decomposition methods for mixed-integer nonlinear programming. The main objective of this thesis is to develop alternative approaches to branch-and-bound based on decomposition-based successive approximation methods. For industry and science, it is important to compute an optimal solution, or at least, improve the best available so far. Moreover, this should be done within a reasonable time. Therefore, the goal is to design efficient algorithms to solve large-scale problems that have a direct practical application. In particular, we focus on models that have an application in energy system planning and operation. In this thesis, two main research lines can be distinguished. The first deals with Outer Approximation methods while the second studies a Column Generation approach. We investigate and analyse theoretical and practical aspects of both ideas within a decomposition framework. The main purpose of this study is to develop systematic decomposition-based successive approximation approaches to solve large-scale problems using Outer Approximation and Column Generation. Chapter 1 introduces an important concept needed for decomposition, i.e. a block-separable reformulation of a MINLP problem. In addition, it describes the above-mentioned methods, including branchand-bound, and several other key concepts needed for this thesis, e.g. Inner Approximation, etc.\ Chapters 2, 3 and 4 investigate the use of Outer Approximation. Chapter 2 presents a decomposition-based Outer Approximation algorithm for solving convex MINLP problems based on construction of supporting hyperplanes for a feasible set. Chapter 3 extends decomposition-based Outer Approximation algorithm to nonconvex MINLP problems by introducing a piecewise nonconvex Outer Approximation of a nonconvex feasible set. Another perspective of the Outer Approximation definition for nonconvex problems is considered in Chapter 4. It presents a decomposition-based Inner and Outer Refinement algorithm, which constructs an Outer Approximation while computing the Inner Approximation using Column Generation. The Outer Approximation used in the Inner and Outer Refinement algorithm is based on the multi-objective view of the so-called resource-constrained version of the original problem.\ Two chapters are devoted to Column Generation. Chapter 4 presents a Column Generation algorithm to compute an Inner Approximation of the original problem. Moreover, it describes a partition-based heuristic algorithm which uses an Inner Approximation refinement. Chapter 5 discusses several acceleration techniques for Column Generation. Furthermore, it presents a Column Generation-based heuristic algorithm that can be applied to any MINLP problem. The algorithm utilizes a proejction-based primal heuristic to generate several high-quality solution candidates.\ Chapter 6 contains a short description of the implementation in Python of the MINLP solver DECOGO. Chapter 7 summarizes the findings obtained during the elaboration of this thesis. BibTeX: @phdthesis{Muts2021, author = {Pavlo Muts}, title = {Decomposition methods for mixed-integer nonlinear programming}, school = {Universidad de Málaga}, year = {2021}, url = {https://www.researchgate.net/profile/Pavlo-Muts/publication/353526416_Decomposition_methods_for_mixed-integer_nonlinear_programming/links/6101a4fc1ca20f6f86e5edcf/Decomposition-methods-for-mixed-integer-nonlinear-programming.pdf} }  Myers JM and Dunlavy DM (2021), "Using Computation Effectively for Scalable PoissonTensor Factorization: Comparing Methods BeyondComputational Efficiency" [Abstract] [BibTeX] Abstract: Poisson Tensor Factorization (PTF) is an important data analysis method for analyzing patterns and relationships in multiway count data. In this work, we consider several algorithms for computing a low-rank PTF of tensors with sparse count data values via maximum likelihood estimation. Such an approach reduces to solving a nonlinear, non-convex optimization problem, which can leverage considerable parallel computation due to the structure of the problem. However, since the maximum likelihood estimator corresponds to the global minimizer of this optimization problem, it is important to consider how effective methods are at both leveraging this inherent parallelism as well as computing a good approximation to the global minimizer. In this work we present comparisons of multiple methods for PTF that illustrate the tradeoffs in computational efficiency and accurately computing the maximum likelihood estimator. We present results using synthetic and real-world data tensors to demonstrate some of the challenges when choosing a method for a given tensor. BibTeX: @article{Myers2021, author = {Jeremy M. Myers and Daniel M. Dunlavy}, title = {Using Computation Effectively for Scalable PoissonTensor Factorization: Comparing Methods BeyondComputational Efficiency}, year = {2021} }  Nayak P, Göbel F and Anzt H (2021), "A Collaborative Peer Review Process for Grading Coding Assignments", In Computational Science – ICCS 2021. , pp. 654-660. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: With software technology becoming one of the most important aspects of computational science, it is imperative that we train students in the use of software development tools and teach them to adhere to sustainable software development workflows. In this paper, we showcase how we employ a collaborative peer review workflow for the homework assignments of our course on Numerical Linear Algebra for High Performance Computing (HPC). In the workflow we employ, the students are required to operate with the git version control system, perform code reviews, realize unit tests, and plug into a continuous integration system. From the students’ performance and feedback, we are optimistic that this workflow encourages the acceptance and usage of software development tools in academic software development. BibTeX: @incollection{Nayak2021, author = {Pratik Nayak and Fritz Göbel and Hartwig Anzt}, title = {A Collaborative Peer Review Process for Grading Coding Assignments}, booktitle = {Computational Science – ICCS 2021}, publisher = {Springer International Publishing}, year = {2021}, pages = {654--660}, doi = {10.1007/978-3-030-77980-1_49} }  Nepomuceno R, Sterle R, Valarini G, Pereira M, Yviquel H and Araujo G (2021), "Enabling OpenMP Task Parallelism on Multi-FPGAs", March, 2021. [Abstract] [BibTeX] Abstract: FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large modern workloads. To achieve that, FPGAs need to be interconnected in a Multi-FPGA architecture capable of accelerating a single application. However, programming such architecture is a challenging endeavor that still requires additional research. This paper extends the OpenMP task-based computation offloading model to enable a number of FPGAs to work together as a single Multi-FPGA architecture. Experimental results for a set of OpenMP stencil applications running on a Multi-FPGA platform consisting of 6 Xilinx VC709 boards interconnected through fiber-optic links have shown close to linear speedups as the number of FPGAs and IP-cores per FPGA increase. BibTeX: @article{Nepomuceno2021, author = {R. Nepomuceno and R. Sterle and G. Valarini and M. Pereira and H. Yviquel and G. Araujo}, title = {Enabling OpenMP Task Parallelism on Multi-FPGAs}, year = {2021} }  Neto-Bradley AP (2021), "A Social Logic of Energy -- A data science approach to understanding andmodelling energy transitions of India’s urban poor". Thesis at: University of Cambridge. [Abstract] [BibTeX] Abstract: Continued use of traditional solid biomass fuels for cooking in Indian households poses a serious public health risk. Particulate emissions in the form of soot contributed to approximately 600,000 deaths in 2019, a burden that falls disproportionately on women, children, and vulnerable populations. Despite over 95% of the population having access to clean cooking fuel distribution, following recent government initiatives to promote liquefied petroleum gas, biomass cooking fuel use is still widespread. This is the case even in cities, where low-income households have low levels of sustained clean cooking fuel use.// Interventions to promote transition to clean cooking often focus on cost and technology, informed by an economic-technical view of energy transition, but not all households benefit as expected from these interventions. Previous studies on socio-economic determinants of transition offer limited insight into the reasons for why some households can slip through the net of such interventions. The explanation lies in the socio-cultural and economic heterogeneity across households and the inherent spatial inequalities in urban India.// This thesis explores the influence of local socio-economic and cultural factors, and household practices and habits, on clean cooking transition with a view to understanding how the associated heterogeneity can be characterised, and integrated into quantitative energy models and methods. Public national survey and census data is supplemented with primary data collection, which provides valuable quantitative and qualitative data on low-income urban households.// Tree-based regression is used to investigate the influence of socio-economic and cultural factors within quantitative models. Determinants are found to exhibit non-linear trends, with thresholds for change in influence on transition. A statistical clustering reveals different typologies of household amongst clean cooking adopters, indicative of different enabling circumstances and pathways to transition. Continued use of biomass is found to be common across recently transitioned households.// The heterogeneity amongst low-income households, and the emergent transition pathways, are further investigated through data collected on low-income households in Bangalore. A novel method is used which combines mixed data in a two-stage clustering analysis, offering a means to characterise heterogeneity across households, identifying distinct transition pathways and associated barriers. The findings illustrate how wider socio-economic inequality is intertwined with access to sustained clean cooking.// A Bayesian multilevel microsimulation approach is proposed to model the spatial heterogeneity in clean cooking at a city scale. This approach combines publicly available data to generate a synthetic population, and estimates cooking fuel use and fuel stacking using a Bayesian multilevel model. The model takes into account household cooking practices, local spatial effects, and city level economic and policy context. The model reveals how low uptake of clean cooking fuel, and continued biomass use, is related to underlying spatial socio-economic inequalities in cities. BibTeX: @phdthesis{NetoBradley2021, author = {André Paul Neto-Bradley}, title = {A Social Logic of Energy -- A data science approach to understanding andmodelling energy transitions of India’s urban poor}, school = {University of Cambridge}, year = {2021} }  Nguyen T, MacLean C, Siracusa M, Doerfler D, Wright NJ and Williams S (2021), "FPGA-based HPC accelerators: An evaluation on performance and energy efficiency", Concurrency and Computation: Practice and Experience., August, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: Hardware specialization is a promising direction for the future of digital computing. Reconfigurable technologies enable hardware specialization with modest non-recurring engineering cost, but their performance and energy efficiency compared to state-of-the-art processor architectures remain an open question. In this article, we use FPGAs to evaluate the benefits of building specialized hardware for numerical kernels found in scientific applications. In order to properly evaluate performance, we not only compare Intel Arria 10 and Xilinx U280 performance against Intel Xeon, Intel Xeon Phi, and NVIDIA V100 GPUs, but we also extend the Empirical Roofline Toolkit (ERT) to FPGAs in order to assess our results in terms of the Roofline model. We show design optimization and tuning techniques for peak FPGA performance at reasonable hardware usage and power consumption. As FPGA peak performance is known to be far less than that of a GPU, we also benchmark the energy efficiency of each platform for the scientific kernels comparing against microbenchmark and technological limits. Results show that while FPGAs struggle to compete in absolute terms with GPUs on memory- and compute-intensive kernels, they require far less power and can deliver nearly the same energy efficiency. BibTeX: @article{Nguyen2021, author = {Tan Nguyen and Colin MacLean and Marco Siracusa and Douglas Doerfler and Nicholas J. Wright and Samuel Williams}, title = {FPGA-based HPC accelerators: An evaluation on performance and energy efficiency}, journal = {Concurrency and Computation: Practice and Experience}, publisher = {Wiley}, year = {2021}, doi = {10.1002/cpe.6570} }  Nickolay S, Jung E-S, Kettimuthu R and Foster I (2021), "Towards Accommodating Real-time Jobs on HPC Platforms", March, 2021. [Abstract] [BibTeX] Abstract: Increasing data volumes in scientific experiments necessitate the use of high-performance computing (HPC) resources for data analysis. In many scientific fields, the data generated from scientific instruments and supercomputer simulations must be analyzed rapidly. In fact, the requirement for quasi-instant feedback is growing. Scientists want to use results from one experiment to guide the selection of the next or even to improve the course of a single experiment. Current HPC systems are typically batch-scheduled under policies in which an arriving job is run immediately only if enough resources are available; otherwise, it is queued. It is hard for these systems to support real-time jobs. Real-time jobs, in order to meet their requirements, should sometimes have to preempt batch jobs and/or be scheduled ahead of batch jobs that were submitted earlier. Accommodating real-time jobs may negatively impact system utilization also, especially when preemption/restart of batch jobs is involved. We first explore several existing scheduling strategies to make real-time jobs more likely to be scheduled in due time. We then rigorously formulate the problem as a mixed-integer linear programming for offline scheduling and develop novel scheduling heuristics for online scheduling. We perform simulation studies using trace logs of Mira, the IBM BG/Q system at Argonne National Laboratory, to quantify the impact of real-time jobs on batch job performance for various percentages of real-time jobs in the workload. We present new insights gained from grouping jobs into different categories based on runtime and the number of nodes used and studying the performance of each category. Our results show that with 10% real-time job percentages, just-in-time checkpointing combined with our heuristic can improve the slowdowns of real-time jobs by 35% while limiting the increase of the slowdowns of batch jobs to 10%. BibTeX: @article{Nickolay2021, author = {Sam Nickolay and Eun-Sung Jung and Rajkumar Kettimuthu and Ian Foster}, title = {Towards Accommodating Real-time Jobs on HPC Platforms}, year = {2021} }  Nisa I, Pandey P, Ellis M, Oliker L, Buluc A and Yelick K (2021), "Distributed-Memory k-mer Counting on GPUs", In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium., May, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: A fundamental step in many bioinformatics computations is to count the frequency of fixed-length sequences, called k-mers, a problem that has received considerable attention as an important target for shared memory parallelization. With datasets growing at an exponential rate, distributed memory parallelization is becoming increasingly critical. Existing distributed memory k-mer counters do not take advantage of GPUs for accelerating computations. Additionally, they do not employ domain-specific optimizations to reduce communication volume in a distributed environment. In this paper, we present the first GPU-accelerated distributed-memory parallel k-mer counter. We evaluate the communication volume as the major bottleneck in scaling k-mer counting to multiple GPU-equipped compute nodes and implement a supermer-based optimization to reduce the communication volume and to enhance scalability. Our empirical analysis examines the balance of communication to computation on a state-of-the-art system, the Summit supercomputer at Oak Ridge National Lab. Results show overall speedups of up to two orders of magnitude with GPU optimization over CPU-based k mer counters. Furthermore, we show an additional 1.5× speedup using the supermer-based communication optimization. BibTeX: @inproceedings{Nisa2021, author = {Israt Nisa and Prashant Pandey and Marquita Ellis and Leonid Oliker and Aydin Buluc and Katherine Yelick}, title = {Distributed-Memory k-mer Counting on GPUs}, booktitle = {Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium}, publisher = {IEEE}, year = {2021}, doi = {10.1109/ipdps49936.2021.00061} }  Nobel P, Agrawal A and Boyd S (2021), "Computing Tighter Bounds on the n-Queens Constant via Newton's Method", December, 2021. [Abstract] [BibTeX] Abstract: In recent work Simkin shows that bounds on an exponent occurring in the famous n-queens problem can be evaluated by solving convex optimization problems, allowing him to find bounds far tighter than previously known. In this note we use Simkin's formulation, a sharper bound developed by Knuth, and a Newton method that scales to large problem instances, to find even sharper bounds. BibTeX: @article{Nobel2021, author = {Parth Nobel and Akshay Agrawal and Stephen Boyd}, title = {Computing Tighter Bounds on the n-Queens Constant via Newton's Method}, year = {2021} }  Nolet CJ, Gala D, Raff E, Eaton J, Rees B, Zedlewski J and Oates T (2021), "Semiring Primitives for Sparse Neighborhood Methods on the GPU", April, 2021. [Abstract] [BibTeX] Abstract: High-performance primitives for mathematical operations on sparse vectors must deal with the challenges of skewed degree distributions and limits on memory consumption that are typically not issues in dense operations. We demonstrate that a sparse semiring primitive can be flexible enough to support a wide range of critical distance measures while maintaining performance and memory efficiency on the GPU. We further show that this primitive is a foundational component for enabling many neighborhood-based information retrieval and machine learning algorithms to accept sparse input. To our knowledge, this is the first work aiming to unify the computation of several critical distance measures on the GPU under a single flexible design paradigm and we hope that it provides a good baseline for future research in this area. Our implementation is fully open source and publicly available at https://github.com/rapidsai/cuml. BibTeX: @article{Nolet2021, author = {Corey J. Nolet and Divye Gala and Edward Raff and Joe Eaton and Brad Rees and John Zedlewski and Tim Oates}, title = {Semiring Primitives for Sparse Neighborhood Methods on the GPU}, year = {2021} }  Nystrom JK (2021), "Wavelet Methods for Very-short Term Forecasting of functional Time Series". Thesis at: Air Force Institute of Technology. [Abstract] [BibTeX] [URL] Abstract: Space launch operations at Kennedy Space Center and Cape Canaveral Space Force Station (KSC/CCSFS) are complicated by unique requirements for near-real time determination of risk from lightning. Lightning forecast weather sensor networks produce data that are noisy, high volume, and high frequency time series for which traditional forecasting methods are often ill-suited. Current approaches result in significant residual uncertainties and consequentially may result in forecasting operational policies that are excessively conservative or inefficient. This work proposes a new methodology of wavelet-enabled semiparametric modeling to develop accurate and timely forecasts robust against chaotic functional data. Wavelets methods are first used to de-noise the weather data, which is then used to estimate a single-index model for forecasting. This semiparametric technique mitigates noise of the chaotic signal while avoiding any possible distributional misspecification. A screening experiment with augmentations is used to demonstrate how to explore the complex factor space of model parameters, guiding decisions regarding model formulation and gaining insight for follow-on research. Imputation methods are applied on the spatially-based EFM time series making use of the inherit autocorrelation in the data, resulting in improved modeling using machine learning and artificial intelligence techniques. Results indicate a promising technique for operationally relevant lightning prediction from chaotic sensor measurements. BibTeX: @phdthesis{Nystrom2021, author = {Jared K. Nystrom}, title = {Wavelet Methods for Very-short Term Forecasting of functional Time Series}, school = {Air Force Institute of Technology}, year = {2021}, url = {https://scholar.afit.edu/etd/5085/} }  Ohmori S and Yoshimoto K (2021), "A Primal-Dual Interior-Point Method for Facility Layout Problem with Relative-Positioning Constraints", Algorithms., February, 2021. Vol. 14(2), pp. 60. MDPI AG. [Abstract] [BibTeX] [DOI] Abstract: We consider the facility layout problem (FLP) in which we find the arrangements of departments with the smallest material handling cost that can be expressed as the product of distance times flows between departments. It is known that FLP can be formulated as a linear programming problem if the relative positioning of departments is specified, and, thus, can be solved to optimality. In this paper, we describe a custom interior-point algorithm for solving FLP with relative positioning constraints (FLPRC) that is much faster than the standard methods used in the general-purpose solver. We build a compact formation of FLPRC and its duals, which enables us to establish the optimal condition very quickly. We use this optimality condition to implement the primal-dual interior-point method with an efficient Newton step computation that exploit special structure of a Hessian. We confirm effectiveness of our proposed model through applications to several well-known benchmark data sets. Our algorithm shows much faster speed for finding the optimal solution. BibTeX: @article{Ohmori2021, author = {Shunichi Ohmori and Kazuho Yoshimoto}, title = {A Primal-Dual Interior-Point Method for Facility Layout Problem with Relative-Positioning Constraints}, journal = {Algorithms}, publisher = {MDPI AG}, year = {2021}, volume = {14}, number = {2}, pages = {60}, doi = {10.3390/a14020060} }  Oktay E and Carson E (2021), "Multistage Mixed Precision Iterative Refinement", July, 2021. [Abstract] [BibTeX] Abstract: Low precision arithmetic, in particular half precision (16-bit) floating point arithmetic, is now available in commercial hardware. Using lower precision can offer significant savings in computation and communication costs with proportional savings in energy. Motivated by this, there have recently emerged a number of new iterative refinement schemes for solving linear systems Ax=b, both based on standard LU factorization and GMRES solvers, that exploit multiple different precisions. Each particular algorithm and each combination of precisions leads to different condition number-based constraints for convergence of the backward and forward errors, and each has different performance costs. Given that the user may not necessarily know the condition number of their matrix a priori, it may be difficult to select the optimal variant for their problem. In this work, we develop a three-stage mixed precision iterative refinement solver which aims to combine existing mixed precision approaches to balance performance and accuracy and improve usability. For a given combination of precisions, the algorithm begins with the least expensive approach and convergence is monitored via inexpensive computations with quantities produced during the iteration. If slow convergence or divergence is detected using particular stopping criteria, the algorithm switches to use more expensive, but more reliable GMRES-based refinement approaches. After presenting the algorithm and its details, we perform extensive numerical experiments on a variety of random dense problems and problems from real applications. Our experiments demonstrate that the theoretical constraints derived in the literature are often overly strict in practice, further motivating the need for a multistage approach. BibTeX: @article{Oktay2021, author = {Eda Oktay and Erin Carson}, title = {Multistage Mixed Precision Iterative Refinement}, year = {2021} }  Oladipo ID, AbdulRaheem M, Awotunde JB, Bhoi AK, Adeniyi EA and Abiodun MK (2021), "Machine Learning and Deep Learning Algorithms for Smart Cities: A Start-of-the-Art Review", In IoT and IoE Driven Smart Cities., December, 2021. , pp. 143-162. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: The development in our urban cities has increased significant risks with everyday lives, like traffic congestion, pollution of the atmosphere, energy use, and public safety among others. Internet of Things (IoT) system has been used to tackle different research issues in a smart city. With the rapid development of IoT technologies, researchers have been motivated to develop smart services that extract knowledge from big data generated from IoT-based devices/sensors. The development of various models like forecast, preparation, monitoring, and ambiguity exploration in smart cities has been enhanced by the applications of deep learning (DL) and machine learning (ML) techniques, and for the urban development. These have also yielded greater results in the process of the huge data and input variables coming from IoT-based cognitive cities. Therefore, this chapter reviews the applicability of the state-of-the-art ML and DL in smart cities' developments. It also discusses the novel application taxonomy of ML and DL smart cities and environmental planning that includes terms that are used interchangeably. Research shows that urban transportation, energy, and healthcare system are the main areas of applications that ML and DL techniques contributed in addressing their problems. The finding from the reviews reveals that ML and DL methods that are mostly applicable, and used in smart cities and urban development, are decision trees, support vector machine, artificial neural network, Bayesian, neuro-fuzzy, ensembles, and their hybridizations. Due to the complexities of both ML and DL with broad coverage of smart city applications, the study shows that there are various challenges ahead in applying these algorithms for this emerging field. The chapter discusses a range of potential directions related to ML and DL efficacy, evolving frameworks, convergence of information, and protection of privacy hoping that these would take the relevant research one step further to fully develop data analytics for smart cities. BibTeX: @incollection{Oladipo2021, author = {Idowu Dauda Oladipo and Muyideen AbdulRaheem and Joseph Bamidele Awotunde and Akash Kumar Bhoi and Emmanuel Abidemi Adeniyi and Moses Kazeem Abiodun}, title = {Machine Learning and Deep Learning Algorithms for Smart Cities: A Start-of-the-Art Review}, booktitle = {IoT and IoE Driven Smart Cities}, publisher = {Springer International Publishing}, year = {2021}, pages = {143--162}, doi = {10.1007/978-3-030-82715-1_7} }  Oliver SE, Cartis C, Kriest I, Tett SFB and Khatiwala S (2021), "A derivative-free optimisation method for global ocean biogeochemical models", September, 2021. Copernicus GmbH. [Abstract] [BibTeX] [DOI] Abstract: The performance of global ocean biogeochemical models, and the Earth System Models in which they are embedded, can be improved by systematic calibration of the parameter values against observations. However, such tuning is seldom undertaken as these models are computationally very expensive. Here we investigate the performance of DFO-LS, a local, derivative-free optimisation algorithm which has been designed for computationally expensive models with irregular model-data misfit landscapes typical of biogeochemical models. We use DFO-LS to calibrate six parameters of a relatively complex global ocean biogeochemical model (MOPS) against synthetic dissolved oxygen, inorganic phosphate and inorganic nitrate observations from a reference run of the same model with a known parameter configuration. The performance of DFO-LS is compared with that of CMA-ES, another derivative-free algorithm that was applied in a previous study to the same model in one of the first successful attempts at calibrating a global model of this complexity. We find that DFO-LS successfully recovers 5 of the 6 parameters in approximately 40 evaluations of the misfit function (each one requiring a 3000 year run of MOPS to equilibrium), while CMA-ES needs over 1200 evaluations. Moreover, DFO-LS reached a baseline misfit, defined by observational noise, in just 11–14 evaluations, whereas CMA-ES required approximately 340 evaluations. We also find that the performance of DFO-LS is not significantly affected by observational sparsity, however fewer parameters were successfully optimised in the presence of observational uncertainty. The results presented here suggest that DFO-LS is sufficiently inexpensive and robust to apply to the calibration of complex, global ocean biogeochemical models. BibTeX: @article{Oliver2021, author = {Sophy Elizabeth Oliver and Coralia Cartis and Iris Kriest and Simon F. B. Tett and Samar Khatiwala}, title = {A derivative-free optimisation method for global ocean biogeochemical models}, publisher = {Copernicus GmbH}, year = {2021}, doi = {10.5194/gmd-2021-175} }  Olofsson S, Schultz ES, Mhamdi A, Mitsos A, Deisenroth MP and Misener R (2021), "Design of Dynamic Experiments for Black-Box Model Discrimination", February, 2021. [Abstract] [BibTeX] Abstract: Diverse domains of science and engineering require and use mechanistic mathematical models, e.g. systems of differential algebraic equations. Such models often contain uncertain parameters to be estimated from data. Consider a dynamic model discrimination setting where we wish to chose: (i) what is the best mechanistic, time-varying model and (ii) what are the best model parameter estimates. These tasks are often termed model discrimination/selection/validation/verification. Typically, several rival mechanistic models can explain data, so we incorporate available data and also run new experiments to gather more data. Design of dynamic experiments for model discrimination helps optimally collect data. For rival mechanistic models where we have access to gradient information, we extend existing methods to incorporate a wider range of problem uncertainty and show that our proposed approach is equivalent to historical approaches when limiting the types of considered uncertainty. We also consider rival mechanistic models as dynamic black boxes that we can evaluate, e.g. by running legacy code, but where gradient or other advanced information is unavailable. We replace these black-box models with Gaussian process surrogate models and thereby extend the model discrimination setting to additionally incorporate rival black-box model. We also explore the consequences of using Gaussian process surrogates to approximate gradient-based methods. BibTeX: @article{Olofsson2021, author = {Simon Olofsson and Eduardo S. Schultz and Adel Mhamdi and Alexander Mitsos and Marc Peter Deisenroth and Ruth Misener}, title = {Design of Dynamic Experiments for Black-Box Model Discrimination}, year = {2021} }  Ondel L, Yee-Mui L-ML, Kocour M, Corro CF and Burget L (2021), "GPU-Accelerated Forward-Backward Algorithm with Application to Lattice-Free MMI" [Abstract] [BibTeX] Abstract: We propose to express the forward-backward algorithm in terms of operations between sparse matrices in a specific semiring. This new perspective naturally leads to a GPU-friendly algorithm which is easy to implement in Julia or any programming languages with native support of semiring algebra. We use this new implementation to train a TDNN with the LF-MMI objective function and we compare the training time of our system with PyChain—a recently introduced C++/CUDA implementation of the LF-MMI loss. Our implementation is about two times faster while not having to use any approximation such as the "leaky-HMM". BibTeX: @inproceedings{Ondel2021, author = {Lucas Ondel and Léa-Marie Lam-Yee-Mui and Martin Kocour and Caio Filippo Corro and Lukás Burget}, title = {GPU-Accelerated Forward-Backward Algorithm with Application to Lattice-Free MMI}, year = {2021} }  Operto S (2021), "Up-to-date assessment of 3D frequency-domain full waveform inversion based on the sparse multifrontal solver MUMPS", In Fifth EAGE Workshop on High Performance Computing for Upstream. European Association of Geoscientists & Engineers. [Abstract] [BibTeX] [DOI] Abstract: Efficient frequency-domain Full Waveform Inversion (FWI) can be applied on long-offset/wide-azimuth stationary-recording seabed acquisitions carried out with ocean-bottom cables (OBC) and ocean bottom nodes (OBN) since the wide angular illumination provided by these surveys allows for limiting the inversion to a few discrete frequencies. In the frequency domain, the forward problem is a boundary value problem requiring the solution of large and sparse linear systems with multiple right-hand sides. In this study, we revisit the potential of the massively-parallel sparse multifrontal solver MUMPS to perform efficiently the multi-source forward problem of 3D visco-acoustic FWI. The execution time and memory consumption of the solver are further improved by exploiting the low rank properties of the sub-blocks of the dense frontal matrices, the sparsity of the right-hand sides (seismic sources) and the work in progress on the use of mixed precision arithmetic. We revisit a 3D OBC case study from the North Sea in the 3.5 Hz-13 Hz frequency band using between 10 and 70 nodes of the Jean-Zay supercomputer of IDRIS and show that, even without exploiting low rank properties, problems involving 50 millions of unknowns and probably more can be tackled today with this technology. BibTeX: @inproceedings{Operto2021, author = {S. Operto}, title = {Up-to-date assessment of 3D frequency-domain full waveform inversion based on the sparse multifrontal solver MUMPS}, booktitle = {Fifth EAGE Workshop on High Performance Computing for Upstream}, publisher = {European Association of Geoscientists & Engineers}, year = {2021}, doi = {10.3997/2214-4609.2021612016} }  Oseledets I and Fanaskov V (2021), "Direct optimization of BPX preconditioners", Journal of Computational and Applied Mathematics., September, 2021. , pp. 113811. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: We consider an automatic construction of locally optimal preconditioners for positive definite linear systems. To achieve this goal, we introduce a differentiable loss function that does not explicitly include the estimation of minimal eigenvalue. Nevertheless, the resulting optimization problem is equivalent to a direct minimization of the condition number. To demonstrate our approach, we construct a parametric family of modified BPX preconditioners. Namely, we define a set of empirical basis functions for coarse finite element spaces and tune them to achieve better condition number. For considered model equations (that includes Poisson, Helmholtz, Convection–diffusion, Biharmonic, and others), we achieve from two to twenty times smaller condition numbers for symmetric positive definite linear systems. BibTeX: @article{Oseledets2021, author = {Ivan Oseledets and Vladimir Fanaskov}, title = {Direct optimization of BPX preconditioners}, journal = {Journal of Computational and Applied Mathematics}, publisher = {Elsevier BV}, year = {2021}, pages = {113811}, doi = {10.1016/j.cam.2021.113811} }  Otero J (2021), "Heritage Conservation Future: Where We Stand, Challenges Ahead, and a Paradigm Shift", Global Challenges. (2100084) [Abstract] [BibTeX] [DOI] Abstract: Global cultural heritage is a lucrative asset. It is an important industry generating millions of jobs and billions of euros in revenue yearly. However, despite the tremendous economic and socio-cultural benefits, little attention is usually paid to its conservation and to developing innovative big-picture strategies to modernize its professional field. This perspective aims to compile some of the relevant current global needs to explore alternative ways for shaping future steps associated with the 2030 Agenda for Sustainable Development. From this perspective, it is conceptualized how emerging artificial intelligence (AI) and digital socio-technological models of production based on democratic Peer-2-Peer (P2P) interactions can represent an alternative transformative solution by going beyond the current global communication and technical limitations in the heritage conservation community, while also providing novel digital tools to conservation practitioners, which can truly revolutionize the conservation decision-making process and improve global conservation standards. BibTeX: @article{Otero2021, author = {Jorge Otero}, title = {Heritage Conservation Future: Where We Stand, Challenges Ahead, and a Paradigm Shift}, journal = {Global Challenges}, year = {2021}, number = {2100084}, doi = {10.1002/gch2.202100084} }  Oyarzun G, Peyrolon D, Alvarez C and Martorell X (2021), "An FPGA cached sparse matrix vector product (SpMV) for unstructured computational fluid dynamics simulations", July, 2021. [Abstract] [BibTeX] Abstract: Field Programmable Gate Arrays generate algorithmic specific architectures that improve the code's FLOP per watt ratio. Such devices are re-gaining interest due to the rise of new tools that facilitate their programming, such as OmpSs. The computational fluid dynamics community is always investigating new architectures that can improve its algorithm's performance. Commonly, those algorithms have a low arithmetic intensity and only reach a small percentage of the peak performance. The sparse matrix-vector multiplication is one of the most time-consuming operations on unstructured simulations. The matrix's sparsity pattern determines the indirect memory accesses of the multiplying vector. This data path is hard to predict, making traditional implementations fail. In this work, we present an FPGA architecture that maximizes the vector's re-usability by introducing a cache-like architecture. The cache is implemented as a circular list that maintains the BRAM vector components while needed. Following this strategy, up to 16 times of acceleration is obtained compared to a naive implementation of the algorithm. BibTeX: @article{Oyarzun2021, author = {Guillermo Oyarzun and Daniel Peyrolon and Carlos Alvarez and Xavier Martorell}, title = {An FPGA cached sparse matrix vector product (SpMV) for unstructured computational fluid dynamics simulations}, year = {2021} }  Palanduz KM (2021), "On separable primal-dual algorithms for verylarge-scale optimization". Thesis at: Faculty of Engineering at Stellenbosch University. [Abstract] [BibTeX] Abstract: In this study, we develop two novel separable primal-dual algorithms, containing closed-form primal and dual variable expressions. The separability of the primal-dual expressions allow both algorithms to exploit massively parallel computational devices, which is desirable for very large-scale optimization. One of the algorithms is ideally suited for low-rank singular value decomposition (SVD), since the separable primal and dual updates become embarrassingly parallel for the SVD problem, allowing the algorithm to efficiently exploit general purpose graphical compute units (GPGPUs).\ In the first part of this study, we develop an iterative separable augmented Lagrangian algorithm (SALA), which has the salient feature of embarrassingly parallel primal and dual variable expressions, hence the algorithm is ideal for implementation on massively parallel computational devices, such as GPGPUs. SALA solves a sequence of quadratic-like problems, able to capture reciprocal and exponential-like behavior; a desirable property in structural optimization. Since SALA resides in the class of alternating directions of multiplier method type algorithms, we demonstrate numerical results on structural problems requiring medium levels of accuracy.\ In the second part of this study, we propose a separable Lagrangian algorithm (SLA) for very large-scale optimization. SLA, derived from the dual of Falk, solves a sequence of quadraticlike problems and, like SALA, is able to capture reciprocal and exponential-like behavior. SLA has embarrassingly parallel primal updates, while the dual variables require the solution of a positive-definite linear system. Indeed, both primal and dual variable updates can exploit massively parallel computational devices. We demonstrate numerical results for structural problems involving hundreds of millions of variables and constraints, solved for in a few minutes on a single quad-core machine.\ Following the development of SLA, we address the low-rank SVD problem. Two separate algorithms are developed, using a variation of SLA that exploits the structure of the SVD problem, resulting in embarrassingly parallel primal and dual updates. Both algorithms use a GPGPU accelerated, constrained and convex sequential approximate optimization (SAO) approach to maximize the well-known Rayleigh quotient, while addressing the difficulties inherent to state-of-the-art Krylov subspace methods, such as resilience to slowly decaying singular values and constant memory requirements. The convex SAO subproblems are conditioned using a novel scaling strategy, allowing for generic solver settings to be used across a wide range of singular value distributions. We demonstrate outstanding numerical results compared to state-of-the-art Lanczos methods, in both CPU and GPGPU implementations, which significantly reduce the time-complexity required for large-scale problems.\ Finally, we propose a multi-solver approach to soften the no-free-lunch (NFL) theorems for optimization on large-scale structural problems. State-of-the-art algorithms and SLA, each exploiting different solution strategies, compete simultaneously for a problem solution on a single multi-core system. Numerical results demonstrate the efficacy of using a multi-solver approach over a range of test problems, since said approach outperforms any standalone solver tested in terms of mean solution time. BibTeX: @phdthesis{Palanduz2021, author = {Kemal Marice Palanduz}, title = {On separable primal-dual algorithms for verylarge-scale optimization}, school = {Faculty of Engineering at Stellenbosch University}, year = {2021} }  Pan K, Sun H-W, Xu Y and Xu Y (2021), "An efficient multigrid solver for two-dimensional spatial fractional diffusion equations with variable coefficients", Applied Mathematics and Computation., August, 2021. Vol. 402, pp. 126091. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Extrapolation cascadic multigrid (EXCMG) method with the conjugate gradient smoother is shown to be an efficient solver for large sparse symmetric positive definite systems resulting from linear finite element discretization of second-order elliptic boundary value problems [Pan et al. J. Comput. Phys. 344 (2017) 499–515]. In this paper, we generalize the EXCMG method to solve a class of spatial fractional diffusion equations (SFDEs) with variable coefficients. Both steady-state and time-dependent problems are considered. First of all, space-fractional derivatives defined in Riemann–Liouville sense are discretized by using the weighted average of shifted Grünwald formula, which results in a fully dense and nonsymmetric linear system for the steady-state problem, or a semi-discretized ODE system for the time-dependent problem. Then, to solve the former problem, we propose the EXCMG method with the biconjugate gradient stabilized smoother to deal with the dense and nonsymmetric linear system. Next, such technique is extended to solve the latter problem since it becomes fully discrete when the Crank-Nicolson scheme is introduced to handle the temporal derivative. Finally, several numerical examples are reported to show that the EXCMG method is an efficient solver for both steady-state and time-dependent SFDEs, and performs much better than the V-cycle multigrid method with banded-splitting smoother for time-dependent SFDEs [Lin et al. J. Comput. Phys. 336 (2017) 69–86]. BibTeX: @article{Pan2021, author = {Kejia Pan and Hai-Wei Sun and Yuan Xu and Yufeng Xu}, title = {An efficient multigrid solver for two-dimensional spatial fractional diffusion equations with variable coefficients}, journal = {Applied Mathematics and Computation}, publisher = {Elsevier BV}, year = {2021}, volume = {402}, pages = {126091}, doi = {10.1016/j.amc.2021.126091} }  Pandey P, Wheatman B, Xu H and Buluç A (2021), "Terrace: A Hierarchical Graph Container for Skewed Dynamic Graphs", In Proceedings of the ACM SIGMOD/PODS Conference. [Abstract] [BibTeX] Abstract: Various applications model problems as streaming graphs, which need to quickly apply a stream of updates and run algorithms on the updated graph. Furthermore, many dynamic real-world graphs, such as social networks, follow a skewed distribution of vertex degrees, where there are a few high-degree vertices and many low-degree vertices.\ Existing static graph-processing systems optimized for graph skewness achieve high performance and low space usage by preprocessing a cache-efficient graph partitioning based on vertex degree. In the streaming setting, the whole graph is not available upfront, however, so finding an optimal partitioning is not feasible in the presence of updates. As a result, existing streaming graph-processing systems take a "one-size-fits-all" approach, leaving performance on the table.\ We present Terrace, a system for streaming graphs that uses a hierarchical data structure design to store a vertex’s neighbors in different data structures depending on the degree of the vertex. This multi-level structure enables Terrace to dynamically partition vertices based on their degrees and adapt to skewness in the underlying graph.\ Our experiments show that Terrace supports faster batch insertions for batch sizes up to 1M when compared to Aspen, a state-ofthe-art graph streaming system. On graph query algorithms, Terrace is between 1.7×--2.6× faster than Aspen and between 0.5×--1.3× as fast as Ligra, a state-of-the-art static graph-processing system BibTeX: @inproceedings{Pandey2021, author = {Prashant Pandey and Brian Wheatman and Helen Xu and Aydin Buluç}, title = {Terrace: A Hierarchical Graph Container for Skewed Dynamic Graphs}, booktitle = {Proceedings of the ACM SIGMOD/PODS Conference}, year = {2021} }  Parasyris K, Georgakoudis G, Menon H, Diffenderfer J, Laguna I, Osei-Kuffuor D and Schordan M (2021), "HPAC", November, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: As we approach the limits of Moore's law, researchers are exploring new paradigms for future high-performance computing (HPC) systems. Approximate computing has gained traction by promising to deliver substantial computing power. However, due to the stringent accuracy requirements of HPC scientific applications, the broad adoption of approximate computing methods in HPC requires an in-depth understanding of the application's amenability to approximations.\ We develop HPAC, a framework with compiler and runtime support for code annotation and transformation, and accuracy vs. performance trade-off analysis of OpenMP HPC applications. We use HPAC to perform an in-depth analysis of the effectiveness of approximate computing techniques when applied to HPC applications. The results reveal possible performance gains of approximation and its interplay with parallel execution. For instance, in the LULESH proxy application approximation provides substantial performance gains due to the reduction of memory accesses. However, in the leukocyte benchmark approximation induces load imbalance in the parallel execution and thus limiting the performance gains. BibTeX: @inproceedings{Parasyris2021, author = {Konstantinos Parasyris and Giorgis Georgakoudis and Harshitha Menon and James Diffenderfer and Ignacio Laguna and Daniel Osei-Kuffuor and Markus Schordan}, title = {HPAC}, publisher = {ACM}, year = {2021}, doi = {10.1145/3458817.3476216} }  Park J and Lee K (2021), "S-MPEC: Sparse Matrix Multiplication Performance Estimator on a Cloud Environment", Cluster Computing., May, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: Sparse matrix multiplication (SPMM) is widely used for various machine learning algorithms. As the applications of SPMM using large-scale datasets become prevalent, executing SPMM jobs on an optimized setup has become very important. Execution environments of distributed SPMM tasks on cloud resources can be set up in diverse ways with respect to the input sparse datasets, distinct SPMM implementation methods, and the choice of cloud instance types. In this paper, we propose S-MPEC which can predict latency to complete various SPMM tasks using Apache Spark on distributed cloud environments. We first characterize various distributed SPMM implementations on Apache Spark. Considering the characters and hardware specifications on the cloud, we propose unique features to build a GB-regressor model and Bayesian optimizations. Our proposed S-MPEC model can predict latency on an arbitrary SPMM task accurately and recommend an optimal implementation method. Thorough evaluation of the proposed system reveals that a user can expect 44% less latency to complete SPMM tasks compared with the native SPMM implementations in Apache Spark. BibTeX: @article{Park2021, author = {Jueon Park and Kyungyong Lee}, title = {S-MPEC: Sparse Matrix Multiplication Performance Estimator on a Cloud Environment}, journal = {Cluster Computing}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s10586-021-03287-3} }  Parker J, Hill P, Dickinson D and Dudson B (2021), "Parallel tridiagonal matrix inversion with a hybrid multigrid–Thomas algorithm method", Journal of Computational and Applied Mathematics., July, 2021. , pp. 113706. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Tridiagonal matrix inversion is an important operation with many applications. It arises frequently in solving discretized one-dimensional elliptic partial differential equations, and forms the basis for many algorithms for block tridiagonal matrix inversion for discretized PDEs in higher-dimensions. In such systems, this operation is often the scaling bottleneck in parallel computation. In this paper, we derive a hybrid multigrid–Thomas algorithm designed to efficiently invert tridiagonal matrix equations in a highly-scalable fashion in the context of time evolving partial differential equation systems. We decompose the domain between processors, using multigrid to solve on a grid consisting of the boundary points of each processor’s local domain. We then reconstruct the solution on each processor using a direct solve with the Thomas algorithm. This algorithm has the same theoretical optimal scaling as cyclic reduction and recursive doubling. We use our algorithm to solve Poisson’s equation as part of the spatial discretization of a time-evolving PDE system. Our algorithm is faster than cyclic reduction per inversion and retains good scaling efficiency to twice as many cores. BibTeX: @article{Parker2021, author = {J.T. Parker and P.A. Hill and D. Dickinson and B.D. Dudson}, title = {Parallel tridiagonal matrix inversion with a hybrid multigrid–Thomas algorithm method}, journal = {Journal of Computational and Applied Mathematics}, publisher = {Elsevier BV}, year = {2021}, pages = {113706}, doi = {10.1016/j.cam.2021.113706} }  Pas P, Schuurmans M and Patrinos P (2021), "Alpaqa: A matrix-free solver for nonlinear MPC and large-scale nonconvex optimization", December, 2021. [Abstract] [BibTeX] Abstract: This paper presents alpaqa, an open-source C++ implementation of an augmented Lagrangian method for nonconvex constrained numerical optimization, using the first-order PANOC algorithm as inner solver. The implementation is packaged as an easy-to-use library that can be used in C++ and Python. Furthermore, two improvements to the PANOC algorithm are proposed and their effectiveness is demonstrated in NMPC applications and on the CUTEst benchmarks for numerical optimization. The source code of the alpaqa library is available at https://github.com/kul-optec/alpaqa and binary packages can be installed from https://pypi.org/project/alpaqa . BibTeX: @article{Pas2021, author = {Pieter Pas and Mathijs Schuurmans and Panagiotis Patrinos}, title = {Alpaqa: A matrix-free solver for nonlinear MPC and large-scale nonconvex optimization}, year = {2021} }  Pasadakis D, Alappat CL, Schenk O and Wellein G (2021), "Multiway p-spectral graph cuts on Grassmann manifolds", Machine Learning., 11, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: Nonlinear reformulations of the spectral clustering method have gained a lot of recent attention due to their increased numerical benefits and their solid mathematical background. We present a novel direct multiway spectral clustering algorithm in the p-norm, for p (1,2]. The problem of computing multiple eigenvectors of the graph p-Laplacian, a nonlinear generalization of the standard graph Laplacian, is recasted as an unconstrained minimization problem on a Grassmann manifold. The value of p is reduced in a pseudocontinuous manner, promoting sparser solution vectors that correspond to optimal graph cuts as p approaches one. Monitoring the monotonic decrease of the balanced graph cuts guarantees that we obtain the best available solution from the p-levels considered. We demonstrate the effectiveness and accuracy of our algorithm in various artificial test-cases. Our numerical examples and comparative results with various state-of-the-art clustering methods indicate that the proposed method obtains high quality clusters both in terms of balanced graph cut metrics and in terms of the accuracy of the labelling assignment. Furthermore, we conduct studies for the classification of facial images and handwritten characters to demonstrate the applicability in real-world datasets. BibTeX: @article{Pasadakis2021, author = {Dimosthenis Pasadakis and Christie Louis Alappat and Olaf Schenk and Gerhard Wellein}, title = {Multiway p-spectral graph cuts on Grassmann manifolds}, journal = {Machine Learning}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s10994-021-06108-1} }  Patil SV and Kulkarni DB (2021), "K-way spectral graph partitioning for load balancing in parallel computing", International Journal of Information Technology., August, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: A domain of problem-solving models the problems using graphs, for the graphs are effective representation of such problems, leading to their efficient solutions. The nodes in a graph represent a division of unit work--the computation, and the connecting edges represent communication required among the nodes to accomplish that unit work. The weight is assigned to the nodes and connecting edges for the cost incurred to compute and to collaborate, respectively. Graph partitioning exploits the concurrency in the problem being modeled and maps the problem onto parallel processors to guarantee efficient and load-balanced execution. The objective is to--(i) equally distribute the computations on available computing power (parallel processors) and (ii) minimize the cost of collaboration. To achieve the said objectives for any complex problem, the spectral graph partitioning is demonstrated here--that uses eigenvectors of the graph’s laplacian matrix. The results are tested via the realization of the stochastic block model. The quality of graph partitioning is tested by comparing it with ground truth results. Further, for a large-scale graph, the parallel implementation of spectral graph partitioning on GPGPU is presented. The GPGPU implementation provides better speedup with scalability. BibTeX: @article{Patil2021, author = {S. V. Patil and D. B. Kulkarni}, title = {K-way spectral graph partitioning for load balancing in parallel computing}, journal = {International Journal of Information Technology}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s41870-021-00777-w} }  Pearson JW and Potschka A (2021), "A Preconditioned Inexact Active-Set Method for Large-Scale Nonlinear Optimal Control Problems", December, 2021. [Abstract] [BibTeX] Abstract: We provide a global convergence proof of the recently proposed sequential homotopy method with an inexact Krylov--semismooth-Newton method employed as a local solver. The resulting method constitutes an active-set method in function space. After discretization, it allows for efficient application of Krylov-subspace methods. For a certain class of optimal control problems with PDE constraints, in which the control enters the Lagrangian only linearly, we propose and analyze an efficient, parallelizable, symmetric positive definite preconditioner based on a double Schur complement approach. We conclude with numerical results for a badly conditioned and highly nonlinear benchmark optimization problem with elliptic partial differential equations and control bounds. The resulting method is faster than using direct linear algebra for the 2D benchmark and allows for the parallel solution of large 3D problems. BibTeX: @article{Pearson2021, author = {John W. Pearson and Andreas Potschka}, title = {A Preconditioned Inexact Active-Set Method for Large-Scale Nonlinear Optimal Control Problems}, year = {2021} }  Peng R and Vempala S (2021), "Solving Sparse Linear Systems Faster than Matrix Multiplication", January, 2021. , pp. 504-521. Society for Industrial and Applied Mathematics. [Abstract] [BibTeX] [DOI] Abstract: Can linear systems be solved faster than matrix multiplication? While there has been remarkable progress for the special cases of graph structured linear systems, in the general setting, the bit complexity of solving an n × n linear system Ax = b is O(n^), where ω < 2.372864 is the matrix multiplication exponent. Improving on this has been an open problem even for sparse linear systems with poly(n) condition number.\ In this paper, we present an algorithm that solves linear systems in sparse matrices asymptotically faster than matrix multiplication for any ω > 2. This speedup holds for any input matrix A with o(n^–1/log ( (A))) non-zeros, where κ (A) is the condition number of A. For poly(n)-conditioned matrices with O(n) nonzeros, and the current value of ω, the bit complexity of our algorithm to solve to within any 1/poly(n) error is O(n^2.331645).\ Our algorithm can be viewed as an efficient, randomized implementation of the block Krylov method via recursive low displacement rank factorizations. It is inspired by the algorithm of [Eberly et al. ISSAC ‘06 ‘07] for inverting matrices over finite fields. In our analysis of numerical stability, we develop matrix anti-concentration techniques to bound the smallest eigenvalue and the smallest gap in eigenvalues of semi-random matrices. BibTeX: @incollection{Peng2021, author = {Richard Peng and Santosh Vempala}, title = {Solving Sparse Linear Systems Faster than Matrix Multiplication}, publisher = {Society for Industrial and Applied Mathematics}, year = {2021}, pages = {504--521}, doi = {10.1137/1.9781611976465.31} }  Petelin G, Antoniou M and Papa G (2021), "Multi-objective approaches to ground station scheduling for optimization of communication with satellites", Optimization and Engineering., March, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: The ground station scheduling problem is a complex scheduling problem involving multiple objectives. Evolutionary techniques for multi-objective optimization are becoming popular among different fields, due to their effectiveness in obtaining a set of trade-off solutions. In contrast to some conventional methods, that aggregate the objectives into one weighted-sum objective function, multi-objective evolutionary algorithms manage to find a set of solutions in the Pareto-optimal front. Selecting one algorithm, however, for a specific problem adds additional challenge. In this paper the ground station scheduling problem was solved through six different evolutionary multi-objective algorithms, the NSGA-II, NSGA-III, SPEA2, GDE3, IBEA, and MOEA/D. The goal is to test their efficacy and performance to a number of benchmark static instances of the ground scheduling problem. Benchmark instances are of different sizes, allowing further testing of the behavior of the algorithms to different dimensionality of the problem. The solutions are compared to the recent solutions of a weighted-sum approach solved by the GA. The results show that all multi-objective algorithms manage to find as good solution as the weighted-sum, while giving more additional alternatives. The decomposition-based MOEA/D outperforms the rest of the algorithms for the specific problem in almost all aspects. BibTeX: @article{Petelin2021, author = {Gašper Petelin and Margarita Antoniou and Gregor Papa}, title = {Multi-objective approaches to ground station scheduling for optimization of communication with satellites}, journal = {Optimization and Engineering}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s11081-021-09617-z} }  Phillips M, Kerkemeier S and Fischer P (2021), "Auto-Tuned Preconditioners for the Spectral Element Method on GPUs", October, 2021. [Abstract] [BibTeX] Abstract: The Poisson pressure solve resulting from the spectral element discretization of the incompressible Navier-Stokes equation requires fast, robust, and scalable preconditioning. In the current work, a parallel scaling study of Chebyshev-accelerated Schwarz and Jacobi preconditioning schemes is presented, with special focus on GPU architectures, such as OLCF's Summit. Convergence properties of the Chebyshev-accelerated schemes are compared with alternative methods, such as low-order preconditioners combined with algebraic multigrid. Performance and scalability results are presented for a variety of preconditioner and solver settings. The authors demonstrate that Chebyshev-accelerated-Schwarz methods provide a robust and effective smoothing strategy when using p-multigrid as a preconditioner in a Krylov-subspace projector. At the same time, optimal preconditioning parameters can vary for different geometries, problem sizes, and processor counts. This variance motivates the development of an autotuner to optimize solver parameters on-line, during the course of production simulations. BibTeX: @article{Phillips2021, author = {Malachi Phillips and Stefan Kerkemeier and Paul Fischer}, title = {Auto-Tuned Preconditioners for the Spectral Element Method on GPUs}, year = {2021} }  Pistikopoulos EN, Barbosa-Povoa A, Lee JH, Misener R, Mitsos A, Reklaitis GV, Venkatasubramanian V, You F and Gani R (2021), "Process Systems Engineering – The Generation Next?", Computers & Chemical Engineering., February, 2021. , pp. 107252. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Process Systems Engineering (PSE) is the scientific discipline of integrating scales and components describing the behavior of a physicochemical system, via mathematical modelling, data analytics, design, optimization and control. PSE provides the ‘glue’ within scientific chemical engineering, and offers a scientific basis and computational tools towards addressing contemporary and future challenges such as in energy, environment, the ‘industry of tomorrow’ and sustainability. This perspective article offers a guide towards the next generation of PSE developments by looking at its history, core competencies, current status and ongoing trends. BibTeX: @article{Pistikopoulos2021, author = {E N Pistikopoulos and Ana Barbosa-Povoa and Jay H Lee and Ruth Misener and Alexander Mitsos and G V Reklaitis and V Venkatasubramanian and Fengqi You and Rafiqul Gani}, title = {Process Systems Engineering – The Generation Next?}, journal = {Computers & Chemical Engineering}, publisher = {Elsevier BV}, year = {2021}, pages = {107252}, doi = {10.1016/j.compchemeng.2021.107252} }  Qi B, Komatsu K, Sato M and Kobayashi H (2021), "A dynamic parameter tuning method for SpMM parallel execution", Concurrency and Computation: Practice and Experience., December, 2021. Wiley. [Abstract] [BibTeX] [DOI] Abstract: Sparse matrix-matrix multiplication (SpMM) is a basic kernel that is used by many algorithms. Several researches focus on various optimizations for SpMM parallel execution. However, a division of a task for parallelization is not well considered yet. Generally, a matrix is equally divided into blocks for processes even though the sparsities of input matrices are different. The parameter that divides a task into multiple processes for parallelization is fixed. As a result, load imbalance among the processes occurs. To balance the loads among the processes, this article proposes a dynamic parameter tuning method by analyzing the sparsities of input matrices. The experimental results show that the proposed method improves the performance of SpMM for examined matrices by up to 39.5% on a single vector engine and 3.49× on a single CPU. BibTeX: @article{Qi2021, author = {Bin Qi and Kazuhiko Komatsu and Masayuki Sato and Hiroaki Kobayashi}, title = {A dynamic parameter tuning method for SpMM parallel execution}, journal = {Concurrency and Computation: Practice and Experience}, publisher = {Wiley}, year = {2021}, doi = {10.1002/cpe.6755} }  Qian X (2021), "Graph processing and machine learning architectures with emerging memory technologies: a survey", Science China Information Sciences., May, 2021. Vol. 64(6) Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: This paper surveys domain-specific architectures (DSAs) built from two emerging memory technologies. Hybrid memory cube (HMC) and high bandwidth memory (HBM) can reduce data movement between memory and computation by placing computing logic inside memory dies. On the other hand, the emerging non-volatile memory, metal-oxide resistive random access memory (ReRAM) has been considered as a promising candidate for future memory architecture due to its high density, fast read access and low leakage power. The key feature is ReRAM’s capability to perform the inherently parallel in-situ matrix-vector multiplication in the analog domain. We focus on the DSAs for two important applications—graph processing and machine learning acceleration. Based on the understanding of the recent architectures and our research experience, we also discuss several potential research directions. BibTeX: @article{Qian2021, author = {Xuehai Qian}, title = {Graph processing and machine learning architectures with emerging memory technologies: a survey}, journal = {Science China Information Sciences}, publisher = {Springer Science and Business Media LLC}, year = {2021}, volume = {64}, number = {6}, doi = {10.1007/s11432-020-3219-6} }  Qiu J, Dhulipala L, Tang J, Peng R and Wang C (2021), "LightNE: A Lightweight Graph Processing System for Network Embedding", In Proceedings of the 2021 International Conference on Management of Data., June, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: We propose LightNE, a cost-effective, scalable, and high quality network embedding system that scales to graphs with hundreds of billions of edges on a single machine. In contrast to the mainstream belief that distributed architecture and GPUs are needed for large-scale network embedding with good quality, we prove that we can achieve higher quality, better scalability, lower cost and faster runtime with shared-memory, CPU-only architecture. LightNE combines two theoretically grounded embedding methods NetSMF and ProNE. We introduce the following techniques to network embedding for the first time: (1) a newly proposed downsampling method to reduce the sample complexity of NetSMF while preserving its theoretical advantages; (2) a high-performance parallel graph processing stack GBBS to achieve high memory efficiency and scalability; (3) sparse parallel hash table to aggregate and maintain the matrix sparsifier in memory; and (4) Intel MKL for efficient randomized SVD and spectral propagation. BibTeX: @inproceedings{Qiu2021, author = {Jiezhong Qiu and Laxman Dhulipala and Jie Tang and Richard Peng and Chi Wang}, title = {LightNE: A Lightweight Graph Processing System for Network Embedding}, booktitle = {Proceedings of the 2021 International Conference on Management of Data}, publisher = {ACM}, year = {2021}, doi = {10.1145/3448016.3457329} }  Rajamanickam S, Acer S, Berger-Vergiat L, Dang V, Ellingwood N, Harvey E, Kelley B, Trott CR, Wilke J and Yamazaki I (2021), "Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels", March, 2021. [Abstract] [BibTeX] Abstract: As hardware architectures are evolving in the push towards exascale, developing Computational Science and Engineering (CSE) applications depend on performance portable approaches for sustainable software development. This paper describes one aspect of performance portability with respect to developing a portable library of kernels that serve the needs of several CSE applications and software frameworks. We describe Kokkos Kernels, a library of kernels for sparse linear algebra, dense linear algebra and graph kernels. We describe the design principles of such a library and demonstrate portable performance of the library using some selected kernels. Specifically, we demonstrate the performance of four sparse kernels, three dense batched kernels, two graph kernels and one team level algorithm. BibTeX: @article{Rajamanickam2021, author = {Sivasankaran Rajamanickam and Seher Acer and Luc Berger-Vergiat and Vinh Dang and Nathan Ellingwood and Evan Harvey and Brian Kelley and Christian R. Trott and Jeremiah Wilke and Ichitaro Yamazaki}, title = {Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels}, year = {2021} }  Ramezani M, Sivasubramaniam A and Kandemir MT (2021), "GraphGuess: Approximate Graph Processing System with Adaptive Correction", April, 2021. [Abstract] [BibTeX] Abstract: Graph-based data structures have drawn great attention in recent years. The large and rapidly growing trend on developing graph processing systems focus mostly on improving the performance by preprocessing the input graph and modifying its layout. These systems usually take several hours to days to complete processing a single graph on high-end machines, let alone the overhead of pre-processing which most of the time can be dominant. Yet for most of graph applications the exact answer is not always crucial, and providing a rough estimate of the final result is adequate. Approximate computing is introduced to trade off accuracy of results for computation or energy savings that could not be achieved by conventional techniques alone. Although various computing platforms and application domains benefit from approximate computing, it has not been thoroughly explored yet in the context of graph processing systems. In this work, we design, implement and evaluate GraphGuess, inspired from the domain of approximate graph theory and extend it to a general, practical graph processing system. GraphGuess is essentially an approximate graph processing technique with adaptive correction, which can be implemented on top of any graph processing system. We build a vertex-centric processing system based on GraphGuess, where it allows the user to trade off accuracy for better performance. Our experimental studies show that using GraphGuess can significantly reduce the processing time for large scale graphs while maintaining high accuracy. BibTeX: @article{Ramezani2021, author = {Morteza Ramezani and Anand Sivasubramaniam and Mahmut T. Kandemir}, title = {GraphGuess: Approximate Graph Processing System with Adaptive Correction}, year = {2021} }  Rasouli M, Kirby RM and Sundar H (2021), "A Compressed, Divide and Conquer Algorithm forScalable Distributed Matrix-Matrix Multiplication", In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. [Abstract] [BibTeX] Abstract: Matrix-matrix multiplication (GEMM) is a widely used linear algebra primitive common in scientific computing and data sciences. While several highly-tuned libraries and implementations exist, these typically target either sparse or dense matrices. The performance of these tuned implementations on unsupported types can be poor, and this is critical in cases where the structure of the computations is associated with varying degrees of sparsity. One such example is Algebraic Multigrid (AMG), a popular solver and preconditioner for large sparse linear systems. In this work, we present a new divide and conquer sparse GEMM, that is also highly performant and scalable when the matrix becomes dense, as in the case of AMG matrix hierarchies. In addition, we implement a lossless data compression method to reduce the communication cost. We combine this with an efficient communication pattern during distributed-memory GEMM to provide 2.24 times (on average) better performance than the state-of-the-art library PETSc. Additionally, we show that the performance and scalability of our method surpass PETSc even more when the density of the matrix increases. We demonstrate the efficacy of our methods by comparing our GEMM with PETSc on a wide range of matrices. BibTeX: @inproceedings{Rasouli2021, author = {Majid Rasouli and Robert M. Kirby and Hari Sundar}, title = {A Compressed, Divide and Conquer Algorithm forScalable Distributed Matrix-Matrix Multiplication}, booktitle = {Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region}, year = {2021} }  Rees T and Wathen M (2021), "An Element-Based Preconditioner for Mixed Finite Element Problems", January, 2021. Vol. 43(5), pp. S884-S907. Society for Industrial & Applied Mathematics (SIAM). [Abstract] [BibTeX] [DOI] Abstract: We introduce a new and generic approximation to Schur complements arising from inf-sup stable mixed finite element discretizations of self-adjoint multiphysics problems. The approximation exploits the discretization mesh by forming local, or element, Schur complements of an appropriate system and projecting them back to the global degrees of freedom. The resulting Schur complement approximation is sparse, has low construction cost (with the same order of operations as assembling a general finite element matrix), and can be solved using off-the-shelf techniques, such as multigrid. Using results from saddle point theory, we give conditions such that this approximation is spectrally equivalent to the global Schur complement. We present several numerical results to demonstrate the viability of this approach on a range of applications. Interestingly, numerical results show that the method gives an effective approximation to the nonsymmetric Schur complement from the steady state Navier--Stokes equations. BibTeX: @article{Rees2021, author = {Tyrone Rees and Michael Wathen}, title = {An Element-Based Preconditioner for Mixed Finite Element Problems}, publisher = {Society for Industrial & Applied Mathematics (SIAM)}, year = {2021}, volume = {43}, number = {5}, pages = {S884--S907}, doi = {10.1137/20m1336461} }  Regev S, Chiang N-Y, Darve E, Petra CG, Saunders MA, Świrydowicz K and Peleš S (2021), "A Hybrid Direct-Iterative Method for Solving KKT Linear Systems", October, 2021. [Abstract] [BibTeX] Abstract: We propose a solution strategy for linear systems arising in interior method optimization, which is suitable for implementation on hardware accelerators such as graphical processing units (GPUs). The current gold standard for solving these systems is the LDL^T factorization. However, LDL^T requires pivoting during factorization, which substantially increases communication cost and degrades performance on GPUs. Our novel approach solves a large indefinite system by solving multiple smaller positive definite systems, using an iterative solve for the Schur complement and an inner direct solve (via Cholesky factorization) within each iteration. Cholesky is stable without pivoting, thereby reducing communication and allowing reuse of the symbolic factorization. We demonstrate the practicality of our approach and show that on large systems it can efficiently utilize GPUs and outperform LDL^T factorization of the full system. BibTeX: @article{Regev2021, author = {Shaked Regev and Nai-Yuan Chiang and Eric Darve and Cosmin G. Petra and Michael A. Saunders and Kasia Świrydowicz and Slaven Peleš}, title = {A Hybrid Direct-Iterative Method for Solving KKT Linear Systems}, year = {2021} }  Rehfeldt D, Hobbie H, Schönheit D, Koch T, Möst D and Gleixner A (2021), "A massively parallel interior-point solver for LPs with generalized arrowhead structure, and applications to energy system models", European Journal of Operational Research., July, 2021. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Linear energy system models are a crucial component of energy system design and operations, as well as energy policy consulting. If detailed enough, such models lead to large-scale linear programs, which can be intractable even for the best state-of-the-art solvers. This article introduces an interior-point solver that exploits common structures of energy system models to efficiently run in parallel on distributed-memory systems. The solver is designed for linear programs with doubly-bordered block-diagonal constraint matrix and makes use of a Schur complement based decomposition. In order to handle the large number of linking constraints and variables commonly observed in energy system models, a distributed Schur complement preconditioner is used. In addition, the solver features a number of more generic techniques such as parallel matrix scaling and structure-preserving presolving. The implementation is based on the solver PIPS-IPM. We evaluate the computational performance on energy system models with up to four billion nonzero entries in the constraint matrix--and up to one billion columns and one billion rows. This article mainly concentrates on the energy system model ELMOD, which is a linear optimization model representing the European electricity markets by the use of a nodal pricing market-clearing. It has been widely applied in the literature on energy system analyses in recent years. However, it will be demonstrated that the new solver is also applicable to other energy system models. BibTeX: @article{Rehfeldt2021, author = {Daniel Rehfeldt and Hannes Hobbie and David Schönheit and Thorsten Koch and Dominik Möst and Ambros Gleixner}, title = {A massively parallel interior-point solver for LPs with generalized arrowhead structure, and applications to energy system models}, journal = {European Journal of Operational Research}, publisher = {Elsevier BV}, year = {2021}, doi = {10.1016/j.ejor.2021.06.063} }  Rekatsinas C (2022), "A fast global nodewise mass matrix inversion framework tailored for sparse block-diagonal systems", Thin-Walled Structures., March, 2022. Vol. 172, pp. 108700. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The present work aims to elucidate the benefits of high-order layerwise mechanics in explicit dynamic and eigen frequency problems applicable to thin-walled structures. High order layerwise mechanics are mainly used for the solution of linear and non-linear static problems due to their ability to predict accurately the strain and stress fields. The increased accuracy of high order layerwise mechanics is attributed to the coupling between rotational–translational​ degrees of freedom; but simultaneously consists their major drawback when it comes to the formulation of the mass matrix. That is, while the 2D and 3D continuum elements can form diagonal mass matrices, which are easily inverted, in the case of high order layerwise elements block-diagonal mass matrices are formulated due to the coupling of the degrees of freedom. Even if the sparsity of block-diagonal matrices is large, the solution time of explicit dynamics and eigen frequency problems is significantly increased. The present work aims to alleviate this issue by proposing an efficient numerical framework, adopting a global nodewise mass matrix inversion procedure which relies on the block-diagonal nature of the matrix, attained by exploiting the Gauss–Lobatto–Legendrequadrature. Applications of sandwich structures are considered and multiphysics explicit dynamics and modal analyses are conducted exploiting the proposed mass inversion methodology by accelerating the solution procedure in computationally demanding problems focused on thin-walled structures made of different materials through the thickness by incorporating high order layerwise mechanics. BibTeX: @article{Rekatsinas2022, author = {C.S. Rekatsinas}, title = {A fast global nodewise mass matrix inversion framework tailored for sparse block-diagonal systems}, journal = {Thin-Walled Structures}, publisher = {Elsevier BV}, year = {2022}, volume = {172}, pages = {108700}, doi = {10.1016/j.tws.2021.108700} }  Richter M, Tanner G, Carpentieri B and Chappell DJ (2021), "Solving linear systems from dynamical energy analysis - using and reusing preconditioners", INTER-NOISE and NOISE-CON Congress and Conference Proceedings., August, 2021. Vol. 263(5), pp. 1041-1052. Institute of Noise Control Engineering (INCE). [Abstract] [BibTeX] [DOI] Abstract: Dynamical energy analysis (DEA) is a computational method to address high-frequency vibro-acoustics in terms of ray densities. It has been used to describe wave equations governing structure-borne sound in two-dimensional shell elements as well as three-dimensional electrodynamics. To describe either of those problems, the wave equation is reformulated as a propagation of boundary densities. These densities are expressed by finite dimensional approximations. All use-cases have in common that they describe the resulting linear problem using a very large matrix which is block-sparse, often real-valued, but non-symmetric. In order to efficiently use DEA, it is therefore important to also address the performance of solving the corresponding linear system. We will cover three aspects in order to reduce the computational time: The use of preconditioners, properly chosen initial conditions, and choice of iterative solvers. Especially the aspect of potentially reusing preconditioners for different input parameters is investigated. BibTeX: @article{Richter2021, author = {Martin Richter and Gregor Tanner and Bruno Carpentieri and David J. Chappell}, title = {Solving linear systems from dynamical energy analysis - using and reusing preconditioners}, journal = {INTER-NOISE and NOISE-CON Congress and Conference Proceedings}, publisher = {Institute of Noise Control Engineering (INCE)}, year = {2021}, volume = {263}, number = {5}, pages = {1041--1052}, doi = {10.3397/in-2021-1740} }  Rolinger TB, Krieger CD and Sussman A (2021), "Optimizing Memory-Compute Colocation for Irregular Applications on a Migratory Thread Architecture", In roceedings of the 35th International Parallel & Distributed Processing Symposium. [Abstract] [BibTeX] Abstract: The movement of data between memory and processors has become a performance bottleneck for many applications. This is made worse for applications with sparse and irregular memory accesses, as they exhibit weak locality and make poor utilization of cache. As a result, colocating memory and compute is crucial for achieving high performance on irregular applications. There are two paradigms for memory-compute colocation. The first is the conventional approach of moving the data to the compute. The second paradigm is to move the compute to the data, which is less conventional and not as well understood. An example are migratory threads, which physically relocate upon remote accesses to the compute resource that hosts the data. In this paper, we explore the paradigm of moving compute to the data by optimizing memory-compute colocation for irregular applications on a migratory thread architecture. Our optimization method includes both initial data placement as well as data replication. We evaluate our optimization on sparse matrix-vector multiply (SpMV) and sparse matrix-matrix multiply (SpGEMM). Our results show that we can achieve speed-ups as high as 4.2x on SpMV and 6x on SpGEMM when compared to the default data layout. We also highlight that our optimization to improve memory-compute colocation can be applicable to both migratory threads and more conventional systems. To this end, we evaluate our optimization approach on a conventional compute cluster using the Chapel programming language. We demonstrate speed-ups as high as 18x for SpMV. BibTeX: @inproceedings{Rolinger2021, author = {Rolinger, Thomas B. and Krieger, Christopher D. and Sussman, Alan}, title = {Optimizing Memory-Compute Colocation for Irregular Applications on a Migratory Thread Architecture}, booktitle = {roceedings of the 35th International Parallel & Distributed Processing Symposium}, year = {2021} }  Rolinger TB, Craft J, Krieger CD and Sussman A (2021), "Towards High Productivity and Performance for Irregular Applications in Chapel", In Proceedings of the The 4th AnnualParallel Applications Workshop, Alternatives To MPI+X. [Abstract] [BibTeX] [URL] Abstract: Large scale irregular applications, such as sparse linear algebra and graph analytics, exhibit fine-grained memory access patterns and operate on very large data sets. The Partitioned Global Address Space (PGAS) model simplifies the development of distributed-memory irregular applications, as all the memory in the system is viewed logically as a single shared address space. The Chapel programming language provides a PGAS programming model and offers high productivity for irregular application developers, as remote communication is performed implicitly. However, irregular applications written in Chapel often struggle to achieve high performance due to implicit fine-grained remote communication. In this work, we explore techniques to bridge the gap between high productivity and high performance for irregular applications using the Chapel programming language. We present high-level implementations of the Breadth First Search (BFS) and PageRank applications. We then describe optimized versions that utilize message aggregation and data replication in ways that could potentially be applied automatically, improving performance by as much as 1,219x for BFS and 22x for PageRank. When compared to MPI+OpenMP implementations that employ optimizations of the same type as those applied to the Chapel codes, our optimized code is 3.7x faster on average for BFS but 1.3x slower for PageRank. BibTeX: @inproceedings{Rolinger2021a, author = {Thomas B. Rolinger and Joseph Craft and Christopher D. Krieger and Alan Sussman}, title = {Towards High Productivity and Performance for Irregular Applications in Chapel}, booktitle = {Proceedings of the The 4th AnnualParallel Applications Workshop, Alternatives To MPI+X}, year = {2021}, url = {https://www.researchgate.net/profile/Thomas-Rolinger/publication/356907247_Towards_High_Productivity_and_Performance_for_Irregular_Applications_in_Chapel/links/61b22cf7ec18aa3c19e35625/Towards-High-Productivity-and-Performance-for-Irregular-Applications-in-Chapel.pdf} }  Rusu C (2021), "An iterative Jacobi-like algorithm to compute a few sparse eigenvalue-eigenvector pairs", May, 2021. [Abstract] [BibTeX] Abstract: In this paper, we describe a new algorithm to compute the extreme eigenvalue/eigenvector pairs of a symmetric matrix. The proposed algorithm can be viewed as an extension of the Jacobi transformation method for symmetric matrix diagonalization to the case where we want to compute just a few eigenvalues/eigenvectors. The method is also particularly well suited for the computation of sparse eigenspaces. We show the effectiveness of the method for sparse low-rank approximations and show applications to random symmetric matrices, graph Fourier transforms, and with the sparse principal component analysis in image classification experiments. BibTeX: @article{Rusu2021, author = {Cristian Rusu}, title = {An iterative Jacobi-like algorithm to compute a few sparse eigenvalue-eigenvector pairs}, year = {2021} }  Saad-Eldin A, Pedigo BD, Priebe CE and Vogelstein JT (2021), "Graph Matching via Optimal Transport", November, 2021. [Abstract] [BibTeX] Abstract: The graph matching problem seeks to find an alignment between the nodes of two graphs that minimizes the number of adjacency disagreements. Solving the graph matching is increasingly important due to it's applications in operations research, computer vision, neuroscience, and more. However, current state-of-the-art algorithms are inefficient in matching very large graphs, though they produce good accuracy. The main computational bottleneck of these algorithms is the linear assignment problem, which must be solved at each iteration. In this paper, we leverage the recent advances in the field of optimal transport to replace the accepted use of linear assignment algorithms. We present GOAT, a modification to the state-of-the-art graph matching approximation algorithm "FAQ" (Vogelstein, 2015), replacing its linear sum assignment step with the "Lightspeed Optimal Transport" method of Cuturi (2013). The modification provides improvements to both speed and empirical matching accuracy. The effectiveness of the approach is demonstrated in matching graphs in simulated and real data examples. BibTeX: @article{SaadEldin2021, author = {Ali Saad-Eldin and Benjamin D. Pedigo and Carey E. Priebe and Joshua T. Vogelstein}, title = {Graph Matching via Optimal Transport}, year = {2021} }  Saeed YA, Ismail RA and Al-Haj Baddar SW (2021), "D-SAG: A Distributed Sort-Based Algorithm for Graph Clustering", Arabian Journal for Science and Engineering. [Abstract] [BibTeX] [DOI] [URL] Abstract: Graph clustering has become a mainstream branch of computing due to its necessity for solving a wide range of problems nowadays. Thus, harnessing the capabilities of parallel and distributed computing has become instrumental. In this work, we introduce SAG, a quasilinear sort-based algorithm for graph clustering that maps naturally to distributed and/or parallel architectures. The main idea behind SAG is that nodes within a cluster naturally have similar adjacent nodes. Experiments on graphs with varying sizes compared SAG to its distributed counter-part, D-SAG, in terms of execution time, space, speedup, efficiency, and cost. Results showed the superiority of D-SAG in terms of execution time for graphs with more than 0.2× 106 nodes. Moreover, the best speedup D-SAG achieved was 3.7-fold for synthetic graphs and 3.96-fold for real-world graphs, both using 6 computers. BibTeX: @article{Saeed2021, author = {Saeed, Yaman A. and Ismail, Raed A. and Al-Haj Baddar, Sherenaz W.}, title = {D-SAG: A Distributed Sort-Based Algorithm for Graph Clustering}, journal = {Arabian Journal for Science and Engineering}, year = {2021}, url = {https://doi.org/10.1007/s13369-021-05664-x}, doi = {10.1007/s13369-021-05664-x} }  Sáez RC (2021), "Analysis of Parallelization Strategies in the context of Hierarchical Matrix Factorizations". Thesis at: Universitat Jaume I. [Abstract] [BibTeX] Abstract: ℋ-Matrices were born as a powerful numerical tool to tackle applications whose data generates structures that end laying in between dense and sparse scenarios. The key benefit that makes ℋ-Matrices valuable is the savings that offer both in terms of storage and computations, in such a way that they are reduced to log-linear costs.\ The key behind the success of ℋ-Matrices is the controllable compression they offer: by choosing the appropriate admissibility condition to discern important versus dispensable data and designing good partitioning algorithms, one can choose the accuracy loss that wants to assume in exchange for computations acceleration and memory consumption reduction. This is the reason why ℋ-Matrices are specially suitable for boundary element and finite element methods where the pursued result does not need to be totally accurate, but it is important to have it ready as fast as possible, as it can determine, for example, whether an engineering design is ready to be produced or needs to be improved.\ On their side, task-parallelism has proved sufficiently its benefits when being employed to optimize the parallel execution when solving linear systems of equations. Particularly, tiled or block algorithms combined with this parallelism strategy have widely been (and are still) employed to provide the scientific community with powerful and efficient parallel solutions for multicore architectures.\ The main objective of this thesis is designing, implementing and evaluating parallel algorithms to operate efficiently with ℋ-Matrices in multicore architectures. To this end, the first contribution we describe is a study in which we prove that task-parallelism is suitable for operating with HMatrices, by simplifying as much as possible the ℋ-Arithmetic scenario. Next, we describe in detail the difficulties that need to be addressed when parallelizing the complex implementations that operate with this type of matrices. Afterwards, we explain how the new features included in OmpSs-2 programming model helped us avoiding the majority of the described issues and thus we were able to attain a fair efficiency when executing a task-parallel ℋ-LU. Lastly, we illustrate how we explored a regularized version of ℋ-Matrices, which we call Tile ℋ-Matrices, in which we are able to maintain competitive-witℋ-pure-ℋ-Matrices precision and compression ratios, while leveraging the well known benefits of tile algorithms applied to matrices provided with (regular) tiles (this is, mostly homogeneous block dimensions). BibTeX: @phdthesis{Saez2021, author = {Rocío Carratalá Sáez}, title = {Analysis of Parallelization Strategies in the context of Hierarchical Matrix Factorizations}, school = {Universitat Jaume I}, year = {2021} }  Sala R, Schlüter A, Sator C and Müller R (2022), "Unifying relations between iterative linear equation solvers and explicit Euler approximations for associated parabolic regularized equations", Results in Applied Mathematics., February, 2022. Vol. 13, pp. 100227. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Iterative methods to solve linear equation systems are widely used in computational physics, engineering and many areas of applied mathematics. In recent works, performance improvements have been achieved based on modifications of several classes of iterative algorithms by various research communities driven by different perspectives and applications. This note presents a brief analysis of conventional and unifying perspectives by highlighting relations between several well-known iterative methods to solve linear equation systems and explicit Euler approximations of the associated parabolic regularized equations. Special cases of equivalence and general relations between different iterative methods such as Jacobi iterations, Richardson iterations, Steepest Descent and Quasi-Newton methods are shown and discussed. The results and discussion extend the conventional perspectives on these iterative methods and give way to intuitive physical interpretations and analogies. The accessibly presented relations give complementary educational insights and aim to inspire transdisciplinary developments of new iterative methods, solvers and preconditioners. BibTeX: @article{Sala2022, author = {R. Sala and A. Schlüter and C. Sator and R. Müller}, title = {Unifying relations between iterative linear equation solvers and explicit Euler approximations for associated parabolic regularized equations}, journal = {Results in Applied Mathematics}, publisher = {Elsevier BV}, year = {2022}, volume = {13}, pages = {100227}, doi = {10.1016/j.rinam.2021.100227} }  Salim A, Condat L, Kovalev D and Richtárik P (2021), "An Optimal Algorithm for Strongly Convex Minimization under Affine Constraints", February, 2021. [Abstract] [BibTeX] Abstract: Optimization problems under affine constraints appear in various areas of machine learning. We consider the task of minimizing a smooth strongly convex function F(x) under the affine constraint K x = b, with an oracle providing evaluations of the gradient of F and matrix-vector multiplications by K and its transpose. We provide lower bounds on the number of gradient computations and matrix-vector multiplications to achieve a given accuracy. Then we propose an accelerated primal--dual algorithm achieving these lower bounds. Our algorithm is the first optimal algorithm for this class of problems. BibTeX: @article{Salim2021, author = {Adil Salim and Laurent Condat and Dmitry Kovalev and Peter Richtárik}, title = {An Optimal Algorithm for Strongly Convex Minimization under Affine Constraints}, year = {2021} }  Sandhu P, Verbrugge C and Hendren L (2021), "A Hybrid Synchronization Mechanism for Parallel Sparse Triangular Solve", In Proceedings of the 34th International Workshop on Languages and Compilers for Parallel Computing. [Abstract] [BibTeX] [URL] Abstract: Sparse triangular solve, SpTS, is an important and recurring component of many sparse linear solvers that are extensively used in many big-data analytics and machine learning algorithms. Despite its inherent sequential execution, a number of parallel algorithms like level-set and synchronizationfree have been proposed. The coarse-grained synchronization mechanism of the level-set method uses a synchronization barrier between the generated level-sets, while the fine-grained synchronization approach of the sync-free algorithm makes use of atomic operations for each non-zero access. Both the synchronization mechanisms can prove to be expensive on CPUs for different sparsity structures of the matrices. We propose a novel and efficient synchronization approach which brings out the best of these two algorithms by avoiding the synchronization barrier while minimizing the use of atomic operations. Our web-based and parallel SpTS implementation with this hybrid synchronization mechanism, tested on around 2000 real-life sparse matrices, shows impressive performance speedups for a number of matrices over the classic level-set implementation. BibTeX: @inproceedings{Sandhu2021, author = {Prabhjot Sandhu and Clark Verbrugge and Laurie Hendren}, title = {A Hybrid Synchronization Mechanism for Parallel Sparse Triangular Solve}, booktitle = {Proceedings of the 34th International Workshop on Languages and Compilers for Parallel Computing}, year = {2021}, url = {https://lcpc2021.github.io/pre_workshop_papers/Sandhu_lcpc21.pdf} }  Santos FFD, Brandalero M, Sullivan M, Junior RLR, Basso PM, Hubner PM, Carro L and Rech P (2021), "Reduced Precision DWC: an Efficient Hardening Strategy for Mixed-Precision Architectures", IEEE Transactions on Computers. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: Duplication with Comparison (DWC) is an effective software-level solution to improve the reliability of computing devices. However, it introduces significant performance and energy consumption overheads that could render the protected application unsuitable for high-performance computing or real-time safety-critical applications. Modern computing architectures offer the possibility to execute operations in various precisions, and recent NVIDIA GPUs even feature dedicated functional units for computing with programmable accuracy. In this work, we propose Reduced-Precision Duplication with Comparison (RP-DWC) as a means to leverage the available mixed-precision hardware resources to implement software-level fault detection with reduced overheads. We discuss the benefits and challenges associated with RP-DWC and show that the intrinsic difference between the mixed-precision copies allows for the detection of most, but not all, errors. However, as the undetected faults are the ones that fall into the difference between precisions, they are the ones that produce a much smaller impact in the application output. We investigate, through fault injection and beam experiment campaigns, using three microbenchmarks and two real applications on Volta GPUs, RP-DWC impact into fault detection, performance, and energy consumption. We show that RP-DWC achieves an excellent coverage (up to 86%) with minimal overheads (0.1% time and 24% energy consumption overhead). BibTeX: @article{Santos2021, author = {Fernando Fernandes Dos Santos and Marcelo Brandalero and Michael Sullivan and Rubens Luiz Rech Junior and Pedro Martins Basso and Prof. Michael Hubner and Luigi Carro and Paolo Rech}, title = {Reduced Precision DWC: an Efficient Hardening Strategy for Mixed-Precision Architectures}, journal = {IEEE Transactions on Computers}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/tc.2021.3058872} }  Sateesan A, Sinha S, Smitha K. G and Vinod AP (2021), "A Survey of Algorithmic and Hardware Optimization Techniques for Vision Convolutional Neural Networks on FPGAs", Neural Processing Letters. [Abstract] [BibTeX] Abstract: In today’s world, the applications of convolutional neural networks (CNN) are limitless and are employed in numerous fields. The CNNs get wider and deeper to achieve near-human accuracy. Implementing such networks on resource constrained hardware is a cumbersome task. CNNs need to be optimized both on hardware and algorithmic levels to compress and fit into resource limited devices. This survey aims to investigate different optimization techniques of Vision CNNs, both on algorithmic and hardware level, which would help in efficient hardware implementation, especially for FPGAs. BibTeX: @article{Sateesan2021, author = {Arish Sateesan and Sharad Sinha and Smitha K. G. and A. P. Vinod}, title = {A Survey of Algorithmic and Hardware Optimization Techniques for Vision Convolutional Neural Networks on FPGAs}, journal = {Neural Processing Letters}, year = {2021} }  Schuman CD, Kay B, Date P, Kannan R, Sao P and Potok TE (2021), "Sparse Binary Matrix-Vector Multiplication on Neuromorphic Computers", In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops., June, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Neuromorphic computers offer the opportunity for low-power, efficient computation. Though they have been primarily applied to neural network tasks, there is also the opportunity to leverage the inherent characteristics of neuromorphic computers (low power, massive parallelism, collocated processing and memory) to perform non-neural network tasks. Here, we demonstrate how an approach for performing sparse binary matrix-vector multiplication on neuromorphic computers. We describe the approach, which relies on the connection between binary matrix-vector multiplication and breadth first search, and we introduce the algorithm for performing this calculation in a neuromorphic way. We validate the approach in simulation. Finally, we provide a discussion of the runtime of this algorithm and discuss where neuromorphic computers in the future may have a computational advantage when performing this computation. BibTeX: @inproceedings{Schuman2021, author = {Catherine D. Schuman and Bill Kay and Prasanna Date and Ramakrishnan Kannan and Piyush Sao and Thomas E. Potok}, title = {Sparse Binary Matrix-Vector Multiplication on Neuromorphic Computers}, booktitle = {Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops}, publisher = {IEEE}, year = {2021}, doi = {10.1109/ipdpsw52791.2021.00054} }  Scola L (2021), "Artificial Intelligence Against Climate Change", In Lecture Notes in Networks and Systems. , pp. 378-397. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: The industrial, transportation, and residential sectors draw the most energy in the United States. With most energy created by burning fossil fuels, a highly inefficient method of energy creation, global greenhouse gas levels are rising, raising the temperature of the earth, causing natural processes to become unbalanced. The health of the earth is declining. The rise of technology and persisting growth of computing devices known as the Internet of Things (IoT) and increasing automation of systems through Artificial Intelligence (AI) and Machine Learning (ML) is a factor of energy expenditure as more humans desire devices and more systems are built. The ethical implications of utilizing new technology should be evaluated before creating more. This paper explores modern computing systems in the sectors that draw the most energy, and, more specifically, the role AI and IoT play in them. Each sector may become more energy efficient, productive, and safer by introducing edge computing through IoT devices and coupling it with AI computing abilities that already automate most processes. Multiple studies show energy consumption and costs are lowered when edge computing is paired with the IoT and AI. There is less human involvement, more regularity in execution and performance, and more widespread use because of the accessibility. This creates safer, cheaper, energy-efficient systems that utilize existing technology. The ethical implications of these systems are much more positive than what already exists. Coupling the power of AI with the IoT will reduce energy expenditure in modern systems and create a more sustainable world. BibTeX: @incollection{Scola2021, author = {Leila Scola}, title = {Artificial Intelligence Against Climate Change}, booktitle = {Lecture Notes in Networks and Systems}, publisher = {Springer International Publishing}, year = {2021}, pages = {378--397}, doi = {10.1007/978-3-030-80126-7_29} }  Scott J and Tůma M (2021), "A Null-Space Approach for Large-Scale Symmetric Saddle Point Systems with a Small and Non Zero (2,2) Block" [Abstract] [BibTeX] [URL] Abstract: Null-space methods have long been used to solve large sparse n × n symmetric saddle point systems of equations in which the (2, 2) block is zero. This paper focuses on the case where the (1, 1) block is ill conditioned or rank deficient and the k × k (2, 2) block is non zero and small (k ≪ n). Such systems arise in a range of practical applications, including sparse-dense linear least squares problems. A novel null-space approach is proposed that transforms the system matrix into a nicer symmetric saddle point matrix of order n that has a non zero (2, 2) block of order at most 2k and, importantly, the (1, 1) block is positive definite. Success of any null-space approach depends on constructing a suitable null-space basis. We propose new methods for constructing such bases for wide matrices that have far fewer rows than columns with the aim of balancing stability of the transformed saddle point matrix with preserving sparsity in the (1, 1) block. Linear least squares problems that contain a small number of dense rows are used to illustrate our ideas and to explore their potential for solving large-scale systems. BibTeX: @article{Scott2021, author = {Jennifer Scott and Miroslav Tůma}, title = {A Null-Space Approach for Large-Scale Symmetric Saddle Point Systems with a Small and Non Zero (2,2) Block}, year = {2021}, url = {https://www2.karlin.mff.cuni.cz/ mirektuma/ps/null.pdf} }  Scott CB and Mjolsness E (2021), "Graph diffusion distance: Properties and efficient computation", PLOS ONE., April, 2021. Vol. 16(4), pp. e0249624. Public Library of Science (PLoS). [Abstract] [BibTeX] [DOI] Abstract: We define a new family of similarity and distance measures on graphs, and explore their theoretical properties in comparison to conventional distance metrics. These measures are defined by the solution(s) to an optimization problem which attempts find a map minimizing the discrepancy between two graph Laplacian exponential matrices, under norm-preserving and sparsity constraints. Variants of the distance metric are introduced to consider such optimized maps under sparsity constraints as well as fixed time-scaling between the two Laplacians. The objective function of this optimization is multimodal and has discontinuous slope, and is hence difficult for univariate optimizers to solve. We demonstrate a novel procedure for efficiently calculating these optima for two of our distance measure variants. We present numerical experiments demonstrating that (a) upper bounds of our distance metrics can be used to distinguish between lineages of related graphs; (b) our procedure is faster at finding the required optima, by as much as a factor of 10^3; and (c) the upper bounds satisfy the triangle inequality exactly under some assumptions and approximately under others. We also derive an upper bound for the distance between two graph products, in terms of the distance between the two pairs of factors. Additionally, we present several possible applications, including the construction of infinite “graph limits” by means of Cauchy sequences of graphs related to one another by our distance measure. BibTeX: @article{Scott2021a, author = {C. B. Scott and Eric Mjolsness}, editor = {Gabriele Oliva}, title = {Graph diffusion distance: Properties and efficient computation}, journal = {PLOS ONE}, publisher = {Public Library of Science (PLoS)}, year = {2021}, volume = {16}, number = {4}, pages = {e0249624}, doi = {10.1371/journal.pone.0249624} }  Scott J and Tuma M (2021), "Solving large linear least squares with linear equality constraints", June, 2021. [Abstract] [BibTeX] Abstract: We consider the problem of efficiently solving large-scale linear least squares problems that have one or more linear constraints that must be satisfied exactly. Whilst some classical approaches are theoretically well founded, they can face difficulties when the matrix of constraints contains dense rows or if an algorithmic transformation used in the solution process results in a modified problem that is much denser than the original one. To address this, we propose modifications and new ideas, with an emphasis on requiring the constraints are satisfied with a small residual. We examine combining the null-space method with our recently developed algorithm for computing a null space basis matrix for a wide'' matrix. We further show that a direct elimination approach enhanced by careful pivoting can be effective in transforming the problem to an unconstrained sparse-dense least squares problem that can be solved with existing direct or iterative methods. We also present a number of solution variants that employ an augmented system formulation, which can be attractive when solving a sequence of related problems. Numerical experiments using problems coming from practical applications are used throughout to demonstrate the effectiveness of the different approaches. BibTeX: @article{Scott2021b, author = {Jennifer Scott and Miroslav Tuma}, title = {Solving large linear least squares with linear equality constraints}, year = {2021} }  Sebbouh O, Cuturi M and Peyré G (2021), "Randomized Stochastic Gradient Descent Ascent", November, 2021. [Abstract] [BibTeX] Abstract: An increasing number of machine learning problems, such as robust or adversarial variants of existing algorithms, require minimizing a loss function that is itself defined as a maximum. Carrying a loop of stochastic gradient ascent (SGA) steps on the (inner) maximization problem, followed by an SGD step on the (outer) minimization, is known as Epoch Stochastic Gradient Descent Ascent (ESGDA). While successful in practice, the theoretical analysis of ESGDA remains challenging, with no clear guidance on choices for the inner loop size nor on the interplay between inner/outer step sizes. We propose RSGDA (Randomized SGDA), a variant of ESGDA with stochastic loop size with a simpler theoretical analysis. RSGDA comes with the first (among SGDA algorithms) almost sure convergence rates when used on nonconvex min/strongly-concave max settings. RSGDA can be parameterized using optimal loop sizes that guarantee the best convergence rates known to hold for SGDA. We test RSGDA on toy and larger scale problems, using distributionally robust optimization and single-cell data matching using optimal transport as a testbed. BibTeX: @article{Sebbouh2021, author = {Othmane Sebbouh and Marco Cuturi and Gabriel Peyré}, title = {Randomized Stochastic Gradient Descent Ascent}, year = {2021} }  Selvitopi O, Brock B, Nisa I, Tripathy A, Yelick K and Buluç A (2021), "Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication", In Proceedings of the ACM International Conference on Supercomputing., June, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies. BibTeX: @inproceedings{Selvitopi2021, author = {Oguz Selvitopi and Benjamin Brock and Israt Nisa and Alok Tripathy and Katherine Yelick and Aydın Buluç}, title = {Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication}, booktitle = {Proceedings of the ACM International Conference on Supercomputing}, publisher = {ACM}, year = {2021}, doi = {10.1145/3447818.3461472} }  Sethi JK and Mittal M (2021), "An efficient correlation based adaptive LASSO regression method for air quality index prediction", Earth Science Informatics., April, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: One of the adverse effects of population growth and urbanization in developing countries is air pollution. Due to which more than 4.2 million deaths occur every year. Therefore, prediction of air quality is a subject worth in-depth research and has received substantial interest in the recent years from academic units and the government. Feature selection methods are applied before prediction to identify potentially significant predictors based on exploratory data analysis. In this research work, a feature selection method based on Least Absolute Selection and Shrinkage Operator (LASSO) named Correlation based Adaptive LASSO (CbAL) Regression method has been proposed for predicting the air quality. For the experimental evaluation, cross regional data, including the concentration of pollutants and the meteorological factors of Delhi and its surrounding cities, has been taken from the Central Pollution Control Board (CPCB) Website. Further, to validate this feature selection method, various machine learning techniques have been taken into consideration and some preventive measures have been suggested to enhance the air quality. Feature selection analysis reveals that carbon monoxide, sulphur dioxide, nitrogen dioxide and Ozone are the most important factors for forecasting the air quality and the pollutants found in the cities of Noida and Gurugram have a more substantial impact on the Air Quality Index of Delhi than other surrounding cities. The model evaluation depicts that the feature subset extracted by the proposed method performs better than the complete dataset and the subset extracted by LASSO Regression with an average classification accuracy of 78%. The findings of this study can help to identify important contributors of AQI so that viable measures to improve the air quality of Delhi can be carried out. BibTeX: @article{Sethi2021, author = {Jasleen Kaur Sethi and Mamta Mittal}, title = {An efficient correlation based adaptive LASSO regression method for air quality index prediction}, journal = {Earth Science Informatics}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s12145-021-00618-1} }  Sgherzi F, Parravicini A, Siracusa M and Santambrogio MD (2021), "Solving Large Top-K Graph Eigenproblems with a Memory and Compute-optimized FPGA Design", March, 2021. [Abstract] [BibTeX] Abstract: Large-scale eigenvalue computations on sparse matrices are a key component of graph analytics techniques based on spectral methods. In such applications, an exhaustive computation of all eigenvalues and eigenvectors is impractical and unnecessary, as spectral methods can retrieve the relevant properties of enormous graphs using just the eigenvectors associated with the Top-K largest eigenvalues.\ In this work, we propose a hardware-optimized algorithm to approximate a solution to the Top-K eigenproblem on sparse matrices representing large graph topologies. We prototype our algorithm through a custom FPGA hardware design that exploits HBM, Systolic Architectures, and mixed-precision arithmetic. We achieve a speedup of 6.22x compared to the highly optimized ARPACK library running on an 80-thread CPU, while keeping high accuracy and 49x better power efficiency. BibTeX: @article{Sgherzi2021, author = {Francesco Sgherzi and Alberto Parravicini and Marco Siracusa and Marco Domenico Santambrogio}, title = {Solving Large Top-K Graph Eigenproblems with a Memory and Compute-optimized FPGA Design}, year = {2021} }  Shah N, Olascoaga LIG, Zhao S, Meert W and Verhelst M (2021), "DPU: DAG Processing Unit for Irregular Graphs with Precision-Scalable Posit Arithmetic in 28nm", December, 2021. [Abstract] [BibTeX] [DOI] Abstract: Computation in several real-world applications like probabilistic machine learning, sparse linear algebra, and robotic navigation, can be modeled as irregular directed acyclic graphs (DAGs). The irregular data dependencies in DAGs pose challenges to parallel execution on general-purpose CPUs and GPUs, resulting in severe under-utilization of the hardware. This paper proposes DPU, a specialized processor designed for the efficient execution of irregular DAGs. The DPU is equipped with parallel compute units that execute different subgraphs of a DAG independently. The compute units can synchronize within a cycle using a hardware-supported synchronization primitive, and communicate via an efficient interconnect to a global banked scratchpad. Furthermore, a precision-scalable posit arithmetic unit is developed to enable application-dependent precision. The DPU is taped-out in 28nm CMOS, achieving a speedup of 5.1× and 20.6× over state-of-the-art CPU and GPU implementations on DAGs of sparse linear algebra and probabilistic machine learning workloads. This performance is achieved while operating at a power budget of 0.23W, as opposed to 55W and 98W of the CPU and GPU, resulting in a peak efficiency of 538 GOPS/W with DPU, which is 1350× and 9000× higher than the CPU and GPU, respectively. Thus, with specialized architecture, DPU enables low-power execution of irregular DAG workloads. BibTeX: @article{Shah2021, author = {Nimish Shah and Laura Isabel Galindez Olascoaga and Shirui Zhao and Wannes Meert and Marian Verhelst}, title = {DPU: DAG Processing Unit for Irregular Graphs with Precision-Scalable Posit Arithmetic in 28nm}, year = {2021}, doi = {10.1109/JSSC.2021.3134897} }  Shang F, Huang H, Fan J, Liu Y, Liu H and Liu J (2021), "Asynchronous Parallel, Sparse Approximated SVRG for High-Dimensional Machine Learning", IEEE Transactions on Knowledge and Data Engineering. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: With increasing of data size and development of multi-core computers, asynchronous parallel stochastic optimization algorithms such as KroMagnon have gained significant attention. In this paper, we propose a new Sparse approximation and asynchronous parallel Stochastic Variance Reduced Gradient (SSVRG) method for sparse and high-dimensional machine learning problems. Unlike standard SVRG and its asynchronous parallel variant, KroMagnon, the snapshot point of SSVRG is set to the average of all the iterates in the previous epoch, which allows it to take much larger learning rates and makes it more robust to the choice of learning rates. In particular, we use the sparse approximation of the popular SVRG estimator to perform completely sparse updates. Therefore, SSVRG has a much lower per-iteration computational cost than its dense counterpart, SVRG++, and is very friendly to asynchronous parallel implementation. Moreover, we provide the convergence guarantees of SSVRG for both SC and non-SC problems, while existing asynchronous algorithms (e.g., KroMagnon) only have convergence guarantees for SC problems. Finally, we extend SSVRG to non-smooth and asynchronous parallel settings. Numerical results demonstrate that SSVRG converges significantly faster than the state-of-the-art asynchronous parallel methods, e.g., KroMagnon, and is usually more than three orders of magnitude faster than SVRG++. BibTeX: @article{Shang2021, author = {Fanhua Shang and Hua Huang and Jun Fan and Yuanyuan Liu and Hongying Liu and Jianhui Liu}, title = {Asynchronous Parallel, Sparse Approximated SVRG for High-Dimensional Machine Learning}, journal = {IEEE Transactions on Knowledge and Data Engineering}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/tkde.2021.3070539} }  Sharp D, Stoyanov M, Tomov S and Dongarra J (2021), "A More Portable HeFFTe: Implementing a Fallback Algorithm for Scalable Fourier Transforms", In Proceedings of the IEEE High Performance Extreme Computing Conference.. Thesis at: University of Tennessee. [Abstract] [BibTeX] [URL] Abstract: The Highly Efficient Fast Fourier Transform for Exascale (heFFTe) numerical library is a C++ implementation of distributed multidimensional FFTs targeting heterogeneous and scalable systems. To date, the library has relied on users to provide at least one installation from a selection of well-known libraries for the single node/MPI-rank one-dimensional FFT calculations that heFFTe is built on. In this paper, we describe the development of a CPU-based backend to heFFTe as a reference, or "stock", implementation. This allows the user to install and run heFFTe without any external dependencies that may include restrictive licensing or mandate specific hardware. Furthermore, this stock backend was implemented to take advantage of SIMD capabilities on the modern CPU, and includes both a custom vectorized complex data-type and a run-time generated call-graph for selecting which specific FFT algorithm to call. The performance of this backend greatly increases when vectorized instructions are available and, when vectorized, it provides reasonable scalability in both performance and accuracy compared to an alternative CPU-based FFT backend. In particular, we illustrate a highly-performant O(N log N) code that is about 10× faster compared to non-vectorized code for the complex arithmetic, and a scalability that matches heFFTe's scalability when used with vendor or other highly-optimized 1D FFT backends. The same technology can be used to derive other Fourier-related transformations that may be even not available in vendor libraries, e.g., the discrete sine (DST) or cosine (DCT) transforms, as well as their extension to multiple dimensions and O(N log N) timing. BibTeX: @inproceedings{Sharp2021, author = {Sharp, Daniel and Miroslav Stoyanov and Stanimire Tomov and Jack Dongarra}, title = {A More Portable HeFFTe: Implementing a Fallback Algorithm for Scalable Fourier Transforms}, booktitle = {Proceedings of the IEEE High Performance Extreme Computing Conference}, school = {University of Tennessee}, year = {2021}, url = {https://www.icl.utk.edu/files/publications/2021/icl-utk-1497-2021.pdf} }  Shen Y, Zuo Y and Zhang X (2021), "A faster generalized ADMM-based algorithm using a sequential updating scheme with relaxed stepsizes for multiple-block linearly constrained separable convex programming", Journal of Computational and Applied Mathematics., February, 2021. , pp. 113503. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The multi-block linearly constrained separable convex optimization is frequently applied in numerous applications, including image/signal processing, statistical learning, and data mining, where the objective function is the sum of multiple individual convex functions, and the key constraints are linear. A classical approach to solving such optimization problem could be the alternating direction method of multipliers(ADMM). It decomposes the subproblem into a series of small-scale ones such that its per-iteration cost may be meager. ADMM, however, is designed initially for the two-block model, and its convergence can not be guaranteed for a general multi-block model without additional assumptions. Dai et al. (2017) proposed the algorithm SUSLM (for Sequential Updating Scheme of the Lagrange Multiplier) for separable convex programming problems. The Lagrange multipliers are updated several times at each iteration, and a correction step is imposed at the end of each iteration. In order to derive its convergence property, a correction step is imposed at the end of each iteration. In this paper, we improve the SUSLM algorithm by introducing two controlled parameters in the updating expressions for decision variables and Lagrange multipliers. The condition of step sizes is then relaxed. We show experimentally that our SUSLM algorithm converges faster than SUSLM. Moreover, result comparisons on robust principal component analysis (RPCA) show better performances than other ADMM-based algorithms. BibTeX: @article{Shen2021, author = {Yuan Shen and Yannian Zuo and Xiayang Zhang}, title = {A faster generalized ADMM-based algorithm using a sequential updating scheme with relaxed stepsizes for multiple-block linearly constrained separable convex programming}, journal = {Journal of Computational and Applied Mathematics}, publisher = {Elsevier BV}, year = {2021}, pages = {113503}, doi = {10.1016/j.cam.2021.113503} }  Shen Z-L, Su M, Carpentieri B and Wen C (2021), "Shifted power-GMRES method accelerated by extrapolation for solving PageRank with multiple damping factors", Applied Mathematics and Computation., December, 2021. , pp. 126799. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Starting from the seminal paper published by Brin and Page in 1998, the PageRank model has been extended to many fields far beyond search engine rankings, such as chemistry, biology, bioinformatics, social network analysis, to name a few. Due to the large dimension of PageRank problems, in the past decade or so, considerable research efforts have been devoted to their efficient solution especially for the difficult cases where the damping factors are close to 1. However, there exists few research work concerning about the solution of the case where several PageRank problems with the same network structure and various damping factors need to be solved. In this paper, we generalize the Power method to solving the PageRank problem with multiple damping factors. We demonstrate that the solution has almost the equative cost of solving the most difficult PageRank system of the sequence, and the residual vectors of the PageRank systems after running this method are collinear. Based upon these results, we develop a more efficient method that combines this Power method with the shifted GMRES method. For further accelerating the solving phase, we present a seed system choosing strategy combined with an extrapolation technique, and analyze their effect. Numerical experiments demonstrate the potential of the proposed iterative solver for accelerating realistic PageRank computations with multiple damping factors. BibTeX: @article{Shen2021a, author = {Zhao-Li Shen and Meng Su and Bruno Carpentieri and Chun Wen}, title = {Shifted power-GMRES method accelerated by extrapolation for solving PageRank with multiple damping factors}, journal = {Applied Mathematics and Computation}, publisher = {Elsevier BV}, year = {2021}, pages = {126799}, doi = {10.1016/j.amc.2021.126799} }  Sheng Y (2021), "Applications of Random Matrix Theory in Statistics and Machine Learning". Thesis at: University of Pennsylvania. [Abstract] [BibTeX] Abstract: We live in an age of big data. Analyzing modern data sets can be very difficult because they usually present the following features: massive, high-dimensional, and heterogeneous. How to deal with these new features often plays a key role in modern statistical and machine learning research. This dissertation uses random matrix theory (RMT), a powerful mathematical tool, to study several important problems where the data is massive, high-dimensional, and sometimes heterogeneous.\ The first chapter briefly introduces some basics of random matrix theory (RMT). We also cover some classical applications of RMT to statistics and machine learning.\ The second chapter is about distributed linear regression, where we consider the ordinary least squares (OLS) estimators. Distributed statistical learning problems arise commonly when dealing with large datasets. In this setup, datasets are partitioned over machines, which compute locally, and communicate short messages. Communication is often the bottleneck. We study one-step and iterative weighted parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, send the results to a central server, and take a weighted average of the parameters. Optionally, we iterate, sending back the weighted average and doing local ridge regressions centered at it. How does this work compared to doing linear regression on the full data? Here we study the performance loss in estimation and test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We find the performance loss in one-step weighted averaging, and also give results for iterative ivaveraging. We also find that different problems are affected differently by the distributed framework.\ The third chapter studies a fundamental and highly important problem in this area: How to do ridge regression in a distributed computing environment? Ridge regression is an extremely popular method for supervised learning, and has several optimality properties, thus it is important to study. We study one-shot methods that construct weighted combinations of ridge regression estimators computed on each machine. By analyzing the mean squared error in a high dimensional random-effects model where each predictor has a small effect, we discover several new phenomena. We also propose a new Weighted ONe-shot DistributEd Ridge regression (WONDER) algorithm. We test WONDER in simulation studies and using the Million Song Dataset as an example. There it can save at least 100x in computation time, while nearly preserving test accuracy.\ The fourth chapter is trying to solve another possible issue with modern data sets, that is heterogeneity. Dimensionality reduction via PCA and factor analysis is an important tool of data analysis. A critical step is selecting the number of components. However, existing methods (such as the scree plot, likelihood ratio, parallel analysis, etc) do not have statistical guarantees in the increasingly common setting where the data are heterogeneous. There each noise entry can have a different distribution. To address this problem, we propose the Signflip Parallel Analysis (Signflip PA) method: it compares data singular values to those of “empirical null” data generated by flipping the sign of each entry randomly with probability one-half. We show that Signflip PA consistently selects factors above the noise level in highdimensional signal-plus-noise models (including spiked models and factor models) under heterogeneous settings. Here classical parallel analysis is no longer effective. To do this, we propose to leverage recent breakthroughs in random matrix theory, such as dimensionfree operator norm bounds and large deviations for the top eigenvalues of nonhomogeneous matrices. We also illustrate that Signflip PA performs well in numerical simulations and on empirical data examples. BibTeX: @phdthesis{Sheng2021, author = {Yue Sheng}, title = {Applications of Random Matrix Theory in Statistics and Machine Learning}, school = {University of Pennsylvania}, year = {2021} }  Shi H-JM, Xuan MQ, Oztoprak F and Nocedal J (2021), "On the Numerical Performance of Derivative-Free Optimization Methods Based on Finite-Difference Approximations", February, 2021. [Abstract] [BibTeX] Abstract: The goal of this paper is to investigate an approach for derivative-free optimization that has not received sufficient attention in the literature and is yet one of the simplest to implement and parallelize. It consists of computing gradients of a smoothed approximation of the objective function (and constraints), and employing them within established codes. These gradient approximations are calculated by finite differences, with a differencing interval determined by the noise level in the functions and a bound on the second or third derivatives. It is assumed that noise level is known or can be estimated by means of difference tables or sampling. The use of finite differences has been largely dismissed in the derivative-free optimization literature as too expensive in terms of function evaluations and/or as impractical when the objective function contains noise. The test results presented in this paper suggest that such views should be re-examined and that the finite-difference approach has much to be recommended. The tests compared NEWUOA, DFO-LS and COBYLA against the finite-difference approach on three classes of problems: general unconstrained problems, nonlinear least squares, and general nonlinear programs with equality constraints. BibTeX: @article{Shi2021, author = {Hao-Jun Michael Shi and Melody Qiming Xuan and Figen Oztoprak and Jorge Nocedal}, title = {On the Numerical Performance of Derivative-Free Optimization Methods Based on Finite-Difference Approximations}, year = {2021} }  Shi T, Ruth M and Townsend A (2021), "Parallel algorithms for computing the tensor-train decomposition", November, 2021. [Abstract] [BibTeX] Abstract: The tensor-train (TT) decomposition expresses a tensor in a data-sparse format used in molecular simulations, high-order correlation functions, and optimization. In this paper, we propose four parallelizable algorithms that compute the TT format from various tensor inputs: (1) Parallel-TTSVD for traditional format, (2) PSTT and its variants for streaming data, (3) Tucker2TT for Tucker format, and (4) TT-fADI for solutions of Sylvester tensor equations. We provide theoretical guarantees of accuracy, parallelization methods, scaling analysis, and numerical results. For example, for a d-dimension tensor in ℝ^n×dots× n, a two-sided sketching algorithm PSTT2 is shown to have a memory complexity of 𝒪(n^⌊ d2 ⌋), improving upon 𝒪(n^d-1) from previous algorithms. BibTeX: @article{Shi2021a, author = {Tianyi Shi and Maximilian Ruth and Alex Townsend}, title = {Parallel algorithms for computing the tensor-train decomposition}, year = {2021} }  Shioya A and Yamamoto Y (2021), "Block red-black MILU(0) preconditioner with relaxation on GPU", Parallel Computing., February, 2021. , pp. 102760. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: To accelerate the Krylov subspace-based linear equation solvers on Graphics Processing Units (GPUs), a stable, efficient and highly parallel preconditioner is essential. One of the strong candidates for such a preconditioner is the combination of the block red-black ordering and the relaxed modified incomplete LU factorization without fill-ins (MILU(0)). In this paper, we present techniques for implementing this type of preconditioner on General-purpose computing on GPU (GPGPU) using OpenACC. Our implementation is designed for 3-dimensional finite-difference computations with 7-point stencil, and the matrix storage format is optimized to realize coalesced memory access. Also, mixed-precision computation is employed to exploit the high single-precision performance of GPUs without sacrificing the accuracy of the computed solution. Extensive numerical tests were performed and the optimal values of various tunable parameters such as the number of blocks in each direction and the number of workers specified in OpenACC clauses are discussed. Performance comparison on NVIDIA Quadro GP100 and Tesla K40t GPUs shows that our solver is much faster than existing libraries like cuSPARSE, MAGMA, ViennaCL, and Ginkgo, especially when multiple linear equations with coefficient matrices sharing the same nonzero pattern are solved. BibTeX: @article{Shioya2021, author = {Akemi Shioya and Yusaku Yamamoto}, title = {Block red-black MILU(0) preconditioner with relaxation on GPU}, journal = {Parallel Computing}, publisher = {Elsevier BV}, year = {2021}, pages = {102760}, doi = {10.1016/j.parco.2021.102760} }  Shivakumar S, Li J, Kannan R and Aluru S (2021), "Efficient Parallel Sparse Symmetric Tucker Decomposition for High-Order Tensors", In Proceedings of the SIAM Conference on Applied and Computational Discrete Algorithms., January, 2021. , pp. 193-204. Society for Industrial and Applied Mathematics. [Abstract] [BibTeX] [DOI] Abstract: Tensor based methods are receiving renewed attention in recent years due to their prevalence in diverse real-world applications. There is considerable literature on tensor representations and algorithms for tensor decompositions, both for dense and sparse tensors. Many applications in hypergraph analytics, machine learning, psychometry, and signal processing result in tensors that are both sparse and symmetric, making it an important class for further study. Similar to the critical Tensor Times Matrix chain operation (TTMc) in general sparse tensors, the Sparse Symmetric Tensor Times Same Matrix chain (S3TTMc) operation is compute and memory intensive due to high tensor order and the associated factorial explosion in the number of non-zeros. In this work, we present a novel compressed storage format CSS for sparse symmetric tensors, along with an efficient parallel algorithm for the S3TTMc operation. We theoretically establish that S3TTMc on CSS achieves a better memory versus run-time trade-off compared to state of-the-art implementations. We demonstrate experimental findings that confirm these results and achieve up to 2.9× speedup on synthetic and real datasets. BibTeX: @incollection{Shivakumar2021, author = {Shruti Shivakumar and Jiajia Li and Ramakrishnan Kannan and Srinivas Aluru}, title = {Efficient Parallel Sparse Symmetric Tucker Decomposition for High-Order Tensors}, booktitle = {Proceedings of the SIAM Conference on Applied and Computational Discrete Algorithms}, publisher = {Society for Industrial and Applied Mathematics}, year = {2021}, pages = {193--204}, doi = {10.1137/1.9781611976830.18} }  Singh N, Ma L, Yang H and Solomonik E (2021), "Comparison of Accuracy and Scalability of Gauss--Newton and Alternating Least Squares for CANDECOMC/PARAFAC Decomposition", SIAM Journal on Scientific Computing., January, 2021. Vol. 43(4), pp. C290-C311. Society for Industrial & Applied Mathematics (SIAM). [Abstract] [BibTeX] [DOI] Abstract: Alternating least squares is the most widely used algorithm for CANDECOMC/PARAFAC (CP) tensor decomposition. However, alternating least squares may exhibit slow or no convergence, especially when high accuracy is required. An alternative approach is to regard CP decomposition as a nonlinear least squares problem and employ Newton-like methods. Direct solution of linear systems involving an approximated Hessian is generally expensive. However, recent advancements have shown that use of an implicit representation of the linear system makes these methods competitive with alternating least squares (ALS). We provide the first parallel implementation of a Gauss--Newton method for CP decomposition, which iteratively solves linear least squares problems at each Gauss--Newton step. In particular, we leverage a formulation that employs tensor contractions for implicit matrix-vector products within the conjugate gradient method. The use of tensor contractions enables us to employ the Cyclops library for distributed-memory tensor computations to parallelize the Gauss--Newton approach with a high-level Python implementation. In addition, we propose a regularization scheme for the Gauss--Newton method to improve convergence properties without any additional cost. We study the convergence of variants of the Gauss--Newton method relative to ALS for finding exact CP decompositions as well as approximate decompositions of real-world tensors. We evaluate the performance of sequential and parallel versions of both approaches, and study the parallel scalability on the Stampede2 supercomputer. BibTeX: @article{Singh2021, author = {Navjot Singh and Linjian Ma and Hongru Yang and Edgar Solomonik}, title = {Comparison of Accuracy and Scalability of Gauss--Newton and Alternating Least Squares for CANDECOMC/PARAFAC Decomposition}, journal = {SIAM Journal on Scientific Computing}, publisher = {Society for Industrial & Applied Mathematics (SIAM)}, year = {2021}, volume = {43}, number = {4}, pages = {C290--C311}, doi = {10.1137/20m1344561} }  Song C, Wright SJ and Diakonikolas J (2021), "Variance Reduction via Primal-Dual Accelerated Dual Averaging for Nonsmooth Convex Finite-Sums", February, 2021. [Abstract] [BibTeX] Abstract: We study structured nonsmooth convex finite-sum optimization that appears widely in machine learning applications, including support vector machines and least absolute deviation. For the primal-dual formulation of this problem, we propose a novel algorithm called Variance Reduction via Primal-Dual Accelerated Dual Averaging (VRPDA). In the nonsmooth and general convex setting, VRPDA has the overall complexity O(ndlogmin 1/\epsilon, n\ + d/𝜖 ) in terms of the primal-dual gap, where n denotes the number of samples, d the dimension of the primal variables, and 𝜖 the desired accuracy. In the nonsmooth and strongly convex setting, the overall complexity of VRPDA becomes O(ndlogmin1/\epsilon, n\ + d/\epsilon) in terms of both the primal-dual gap and the distance between iterate and optimal solution. Both these results for VRPDA improve significantly on state-of-the-art complexity estimates, which are O(ndlog min1/\epsilon, n\ + nd/) for the nonsmooth and general convex setting and O(ndlog min1/\epsilon, n\ + nd/\epsilon) for the nonsmooth and strongly convex setting, in a much more simple and straightforward way. Moreover, both complexities are better than lower bounds for general convex finite sums that lack the particular (common) structure that we consider. Our theoretical results are supported by numerical experiments, which confirm the competitive performance of VRPDA compared to state-of-the-art. BibTeX: @article{Song2021, author = {Chaobing Song and Stephen J. Wright and Jelena Diakonikolas}, title = {Variance Reduction via Primal-Dual Accelerated Dual Averaging for Nonsmooth Convex Finite-Sums}, year = {2021} }  Song L, Chi Y, Guo L and Cong J (2021), "Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication", November, 2021. [Abstract] [BibTeX] Abstract: Sparse matrix-vector multiplication (SpMV) multiplies a sparse matrix with a dense vector. SpMV plays a crucial role in many applications, from graph analytics to deep learning. The random memory accesses of the sparse matrix make accelerator design challenging. However, high bandwidth memory (HBM) based FPGAs are a good fit for designing accelerators for SpMV. In this paper, we present Serpens, an HBM based accelerator for general-purpose SpMV.Serpens features (1) a general-purpose design, (2) memory-centric processing engines, and (3) index coalescing to support the efficient processing of arbitrary SpMVs. From the evaluation of twelve large-size matrices, Serpens is 1.91x and 1.76x better in terms of geomean throughput than the latest accelerators GraphLiLy and Sextans, respectively. We also evaluate 2,519 SuiteSparse matrices, and Serpens achieves 2.10x higher throughput than a K80 GPU. For the energy efficiency, Serpens is 1.71x, 1.90x, and 42.7x better compared with GraphLily, Sextans, and K80, respectively. After scaling up to 24 HBM channels, Serpens achieves up to 30,204MTEPS and up to 3.79x over GraphLily. BibTeX: @article{Song2021a, author = {Linghao Song and Yuze Chi and Licheng Guo and Jason Cong}, title = {Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication}, year = {2021} }  Srinivasan A and Todorov E (2021), "Computing the Newton-step faster than Hessian accumulation", August, 2021. [Abstract] [BibTeX] Abstract: Computing the Newton-step of a generic function with N decision variables takes O(N^3) flops. In this paper, we show that given the computational graph of the function, this bound can be reduced to O(m3), where , m are the width and size of a tree-decomposition of the graph. The proposed algorithm generalizes nonlinear optimal-control methods based on LQR to general optimization problems and provides non-trivial gains in iteration-complexity even in cases where the Hessian is dense. BibTeX: @article{Srinivasan2021, author = {Akshay Srinivasan and Emanuel Todorov}, title = {Computing the Newton-step faster than Hessian accumulation}, year = {2021} }  Stramondo G (2021), "Memory system design for application-specific hardware". Thesis at: University of Amsterdam., April, 2021. [Abstract] [BibTeX] [URL] Abstract: Computing drives a lot of developments all around us, and leads to innovation in many fields of science, engineering, and entertainment. As such, the need for computing is increasing at fast pace. This pace has seen the prevalent use of multi- and many-core processors, where parallelism is a sustainable way (for now) to feed our computing needs. We now see single machines reaching multiple TFLOPs in performance, when combining multi-core CPUs and many-core accelerators.\ However, a second bottleneck arises in many of these computing systems: the memory. The so-called "memory wall", a term coined in 1994 by Wulf and McKee, is a metaphor for the significant performance limitations that the memory itself poses for computing systems. Simply put, the memory system is often unable to provide enough data to the computing system, thus limiting the performance of the entire computing system.\ One way to go around the memory wall is to redesign the memory system to support more parallelism, and be better suited for the applications running on the computing system. The work presented in this thesis illustrates different ways in which such a novel design can be approached and deployed, as well as the potential performance gains such novel memory systems can provide. BibTeX: @phdthesis{Stramondo2021, author = {G. Stramondo}, title = {Memory system design for application-specific hardware}, school = {University of Amsterdam}, year = {2021}, url = {https://hdl.handle.net/11245.1/2ebac42a-43a3-4938-8861-04dc3e90a2af} }  Sun G, Li L, Fang J and Li Q (2021), "On lower confidence bound improvement matrix-based approaches for multiobjective Bayesian optimization and its applications to thin-walled structures", Thin-Walled Structures., April, 2021. Vol. 161, pp. 107248. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: In engineering practice, most design criteria require time-consuming functional evaluation. To tackle such design problems, multiobjective Bayesian optimization has been widely applied to generation of optimal Pareto solutions. However, improvement function-based expected improvement (EI) and the hypervolume improvement-based lower confidence bound (LCB) infill-criteria are frequently criticized for their high computational cost. To address this issue, this study proposes a novel approach for developing multiobjective LCB criteria on the basis of LCB improvement matrix. Specifically, three cheap yet efficient infill-criteria are suggested by introducing three different improvement functions (namely, hypervolume improvement, Euclidean distance and maximin distance) that assemble the improvement matrix to a scalar value, which is then maximized for adding solution points sequentially. All these criteria have closed-form expressions and can maintain the anticipated properties, thereby largely reducing computational efforts without either integration or expensive evaluation of hypervolume indicator. The efficiency of the proposed criteria is demonstrated through the ZDT and DTLZ tests with different numbers of design variables and different complexities of objectives. The testing results exhibit that the proposed criteria have faster convergence, and enable to generate satisfactory Pareto front with fairly low computational cost compared with other conventional criteria. Finally, the best performing criterion is further applied to real-life design problems of tailor rolled blank (TRB) thin-walled structures under impact loads, which demonstrates a strong search capability for with good distribution of Pareto points, potentially providing an effective means to engineering design with strong nonlinearity and sophistication. BibTeX: @article{Sun2021, author = {Guangyong Sun and Linsong Li and Jianguang Fang and Qing Li}, title = {On lower confidence bound improvement matrix-based approaches for multiobjective Bayesian optimization and its applications to thin-walled structures}, journal = {Thin-Walled Structures}, publisher = {Elsevier BV}, year = {2021}, volume = {161}, pages = {107248}, doi = {10.1016/j.tws.2020.107248} }  Sun Q, Liu Y, Yang H, Dun M, Luan Z, Gan L, Yang G and Qian D (2021), "Input-aware Sparse Tensor Storage Format Selection for Optimizing MTTKRP", IEEE Transactions on Computers. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: The major bottleneck of Canonical polyadic decomposition (CPD) is matricized tensor times Khatri-Rao product (MTTKRP). To optimize the performance of MTTKRP, various sparse tensor formats have been proposed such as CSF and HiCOO. However, due to the spatial complexity of the tensors, no single format fits all tensors. To address this problem, we propose SpTFS, a framework that automatically predicts the optimal storage format for an input sparse tensor. Specifically, SpTFS leverages a set of sampling methods to lower the sparse tensor to fix-sized matrices and sparsity features. In addition, SpTFS adopts both supervised learning based and unsupervised learning based methods to predict the optimal sparse tensor storage formats. For supervised learning, we propose TnsNet that combines convolution neural network (CNN) and the feature layer, which effectively captures the sparsity patterns of the input tensors. Whereas for unsupervised learning, we propose TnsClustering that consists of a feature encoder using convolutional layers and fully connected layers, and a K-means++ model to cluster sparse tensors for optimal tensor format prediction, without massively profiling on the hardware platform. The experimental results show that both TnsNet and TnsClustering can achieve higher prediction accuracy and performance speedup compared to the state-of-the-art works. BibTeX: @article{Sun2021a, author = {Qingxiao Sun and Yi Liu and Hailong Yang and Ming Dun and Zhongzhi Luan and Lin Gan and Guangwen Yang and Depei Qian}, title = {Input-aware Sparse Tensor Storage Format Selection for Optimizing MTTKRP}, journal = {IEEE Transactions on Computers}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/tc.2021.3113028} }  Sundararajan A (2021), "Analysis and Design of Distributed Optimization Algorithms". Thesis at: The University of Wisconsin - Madison. [Abstract] [BibTeX] Abstract: This work concerns the analysis and design of distributed first-order optimization algorithms. The goal of such algorithms is to optimize a global function that is the average of local functions using only local computations and communications. Many recent algorithms have been proposed that achieve linear convergence to the global optimum. We provide a unified analysis that yields the worst-case linear convergence rate as a function of the properties of the functions and underlying network, as well as the parameters of the algorithm. The framework requires solving a small semidefinite program whose feasibility is a sufficient condition for certifying linear convergence of a distributed algorithm. We present results for both known, fixed graphs and unknown, time-varying graphs. The analysis framework is a computationally efficient method for distributed algorithm analysis that enables the rapid comparison, selection, and tuning of algorithms.\ This work also makes an effort to systematize distributed algorithm design by devising a canonical form for first order distributed algorithms. The canonical form characterizes any distributed algorithm that can be implemented using a single round of communication and gradient computation per iteration, and where each agent stores up to two state variables. The canonical form features a minimal set of parameters that are both unique and expressive enough to capture any distributed algorithm in this class. Using this canonical form, we propose a new algorithm, which we call SVL, that is easily implementable and achieves a faster worst-case convergence rate than all other known algorithms. BibTeX: @phdthesis{Sundararajan2021, author = {Akhil Sundararajan}, title = {Analysis and Design of Distributed Optimization Algorithms}, school = {The University of Wisconsin - Madison}, year = {2021} }  Swirydowicz K, Darve E, Jones W, Maack J, Regev S, Saunders MA, Thomas SJ and Peles S (2021), "Linear solvers for power grid optimization problems: a review of GPU-accelerated linear solvers", June, 2021. [Abstract] [BibTeX] Abstract: The linear equations that arise in interior methods for constrained optimization are sparse symmetric indefinite and become extremely ill-conditioned as the interior method converges. These linear systems present a challenge for existing solver frameworks based on sparse LU or LDL^T decompositions. We benchmark five well known direct linear solver packages using matrices extracted from power grid optimization problems. The achieved solution accuracy varies greatly among the packages. None of the tested packages delivers significant GPU acceleration for our test cases. BibTeX: @article{Swirydowicz2021, author = {Kasia Swirydowicz and Eric Darve and Wesley Jones and Jonathan Maack and Shaked Regev and Michael A. Saunders and Stephen J. Thomas and Slaven Peles}, title = {Linear solvers for power grid optimization problems: a review of GPU-accelerated linear solvers}, year = {2021} }  Szárnyas G, Bader DA, Davis TA, Kitchen J, Mattson TG, McMillan S and Welch E (2021), "LAGraph: Linear Algebra, Network Analysis Libraries, and the Study of Graph Algorithms", April, 2021. [Abstract] [BibTeX] Abstract: Graph algorithms can be expressed in terms of linear algebra. GraphBLAS is a library of low-level building blocks for such algorithms that targets algorithm developers. LAGraph builds on top of the GraphBLAS to target users of graph algorithms with high-level algorithms common in network analysis. In this paper, we describe the first release of the LAGraph library, the design decisions behind the library, and performance using the GAP benchmark suite. LAGraph, however, is much more than a library. It is also a project to document and analyze the full range of algorithms enabled by the GraphBLAS. To that end, we have developed a compact and intuitive notation for describing these algorithms. In this paper, we present that notation with examples from the GAP benchmark suite. BibTeX: @article{Szarnyas2021, author = {Gábor Szárnyas and David A. Bader and Timothy A. Davis and James Kitchen and Timothy G. Mattson and Scott McMillan and Erik Welch}, title = {LAGraph: Linear Algebra, Network Analysis Libraries, and the Study of Graph Algorithms}, year = {2021} }  Tadonki C (2021), "High Performance Optimization at the Door of the Exascale", June, 2021. [Abstract] [BibTeX] Abstract: quest for processing speed potential. In fact, we always get a fraction of the technically available computing power (so-called theoretical peak), and the gap is likely to go hand-to-hand with the hardware complexity of the target system. Among the key aspects of this complexity, we have: the heterogeneity of the computing units, the memory hierarchy and partitioning including the non-uniform memory access (NUMA) configuration, and the interconnect for data exchanges among the computing nodes. Scientific investigations and cutting-edge technical activities should ideally scale-up with respect to sustained performance. The special case of quantitative approaches for solving (large-scale) problems deserves a special focus. Indeed, most of common real-life problems, even when considering the artificial intelligence paradigm, rely on optimization techniques for the main kernels of algorithmic solutions. Mathematical programming and pure combinatorial methods are not easy to implement efficiently on large-scale supercomputers because of irregular control flow, complex memory access patterns, heterogeneous kernels, numerical issues, to name a few. We describe and examine our thoughts from the standpoint of large-scale supercomputers. BibTeX: @article{Tadonki2021, author = {Claude Tadonki}, title = {High Performance Optimization at the Door of the Exascale}, year = {2021} }  Taffet PA (2021), "Techniques for Measurement, Analysis, and Optimization for HPC Communication Performance". Thesis at: Rice University. [Abstract] [BibTeX] [URL] Abstract: Inter-node communication is a critical component of tightly coupled applications running on parallel high performance computing systems. Surveys of high performance computing benchmarks and applications show that most applications spend at least 20% of their execution time communicating, and some spend more than 50%. Thus, inter-node communication performance is important to the overall performance of parallel applications. Furthermore, as the scale of parallelism increases, communicating efficiently becomes more important and typically more difficult. Application developers often cannot address communication performance issues on their own, whether because of a lack of useful diagnostic information, or because they stem from system-level issues such as poor routing.\ This dissertation describes several techniques for measuring, analyzing, and optimizing communication performance for parallel applications running on a supercomputer with a fat tree interconnect, all of which can aid in improving communication performance of applications. First, I describe a sampling-based monitoring technique that uses a small amount of performance-related data in each packet to reconstruct quantitative estimates of traffic and congestion correlated with both application contexts and individual links. Using this information, it can distinguish between problems with an application’s communication pattern, its mapping onto a parallel system, and outside interference. Second, I propose an approach for generating optimized, traffic-aware routes on a statically routed network. The core of this approach is a combination of linear programming formulations for the optimal static routing problem. Third, I propose a technique for reconstructing application traffic patterns via compressed sensing from switch counters and other system-level information. The second and third contributions, combined to form a system called CoGARFrSN, use measures of communication traffic to produce better static routes that reduce congestion, which can be used effectively to turn a statically routed network into a coarse-grained adaptively routed network. Experiments with a network simulator show that CoGARFrSN routes often result in a 4-7× speedup over the traffic-oblivious static routing strategy typically used in fat trees for several communication motifs, and CoGARFrSN routes sometimes even perform significantly better than fine-grained hardware adaptive routing. BibTeX: @phdthesis{Taffet2021, author = {Philip A. Taffet}, title = {Techniques for Measurement, Analysis, and Optimization for HPC Communication Performance}, school = {Rice University}, year = {2021}, url = {https://scholarship.rice.edu/bitstream/handle/1911/111185/TAFFET-DOCUMENT-2021.pdf?sequence=1} }  Takayama N, Yaguchi T and Zhang Y (2021), "Comparison of Numerical Solvers for Differential Equations for Holonomic Gradient Method in Statistics", November, 2021. [Abstract] [BibTeX] Abstract: Definite integrals with parameters of holonomic functions satisfy holonomic systems of linear partial differential equations. When we restrict parameters to a one dimensional curve, the system becomes a linear ordinary differential equation (ODE) with respect to a curve in the parameter space. We can evaluate the integral by solving the linear ODE numerically. This approach to evaluate numerically definite integrals is called the holonomic gradient method (HGM) and it is useful to evaluate several normalizing constants in statistics. We will discuss and compare methods to solve linear ODE's to evaluate normalizing constants. BibTeX: @article{Takayama2021, author = {Nobuki Takayama and Takaharu Yaguchi and Yi Zhang}, title = {Comparison of Numerical Solvers for Differential Equations for Holonomic Gradient Method in Statistics}, year = {2021} }  Tavakoli EB, Riera M, Quraishi MH and Ren F (2021), "FSpGEMM: An OpenCL-based HPC Framework for Accelerating General Sparse Matrix-Matrix Multiplication on FPGAs", December, 2021. [Abstract] [BibTeX] Abstract: General sparse matrix-matrix multiplication (SpGEMM) is an integral part of many scientific computing, high-performance computing (HPC), and graph analytic applications. This paper presents a new compressed sparse vector (CSV) format for representing sparse matrices and FSpGEMM, an OpenCL-based HPC framework for accelerating general sparse matrix-matrix multiplication on FPGAs. The proposed FSpGEMM framework includes an FPGA kernel implementing a throughput-optimized hardware architecture based on Gustavson's algorithm and a host program implementing pre-processing functions for converting input matrices to the CSV format tailored for the proposed architecture. FSpGEMM utilizes a new buffering scheme tailored to Gustavson's algorithm. We compare FSpGEMM implemented on an Intel Arria 10 GX FPGA development board with Intel Math Kernel Library (MKL) implemented on an Intel Xeon E5-2637 CPU and cuSPARSE on an NVIDIA GTX TITAN X GPU, respectively, for multiplying a set of sparse matrices selected from SuiteSparse Matrix Collection. The experiment results show that the proposed FSpGEMM solution achieves on average 4.9x and 1.7x higher performance with 31.9x and 13.1x lower energy consumption per SpGEMM computation than the CPU and GPU implementations, respectively. BibTeX: @article{Tavakoli2021, author = {Erfan Bank Tavakoli and Michael Riera and Masudul Hassan Quraishi and Fengbo Ren}, title = {FSpGEMM: An OpenCL-based HPC Framework for Accelerating General Sparse Matrix-Matrix Multiplication on FPGAs}, year = {2021} }  Tavakoli EB, Riera M, Quraishi MH and Ren F (2021), "FSCHOL: An OpenCL-based HPC Framework for Accelerating Sparse Cholesky Factorization on FPGAs", In Proceedings of the 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing., October, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: The proposed FSCHOL framework consists of an FPGA kernel implementing a throughput-optimized hardware architecture for accelerating the supernodal multifrontal algorithm for sparse Cholesky factorization and a host program implementing a novel scheduling algorithm for finding the optimal execution order of supernodes computations for an elimination tree on the FPGA to eliminate the need for off-chip memory access for storing intermediate results. Moreover, the proposed scheduling algorithm minimizes on-chip memory requirements for buffering intermediate results by resolving the dependency of parent nodes in an elimination tree through temporal parallelism. Experiment results for factorizing a set of sparse matrices in various sizes from SuiteSparse Matrix Collection show that the proposed FSCHOL implemented on an Intel Stratix 10 GX FPGA development board achieves on average 5.5× and 9.7× higher performance and 10.3× and 24.7× lower energy consumption than implementations of CHOLMOD on an Intel Xeon E5-2637 CPU and an NVIDIA V100 GPU, respectively. BibTeX: @inproceedings{Tavakoli2021a, author = {Erfan Bank Tavakoli and Michael Riera and Masudul Hassan Quraishi and Fengbo Ren}, title = {FSCHOL: An OpenCL-based HPC Framework for Accelerating Sparse Cholesky Factorization on FPGAs}, booktitle = {Proceedings of the 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing}, publisher = {IEEE}, year = {2021}, doi = {10.1109/sbac-pad53543.2021.00032} }  Thebelt A, Tsay C, Lee RM, Sudermann-Merx N, Walz D, Tranter T and Misener R (2022), "Multi-objective constrained optimization for energy applications via tree ensembles", January, 2022. Vol. 306, pp. 118061. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Energy systems optimization problems are complex due to strongly non-linear system behavior and multiple competing objectives, e.g. economic gain vs. environmental impact. Moreover, a large number of input variables and different variable types, e.g. continuous and categorical, are challenges commonly present in real-world applications. In some cases, proposed optimal solutions need to obey explicit input constraints related to physical properties or safety-critical operating conditions. This paper proposes a novel data-driven strategy using tree ensembles for constrained multi-objective optimization of black-box problems with heterogeneous variable spaces for which underlying system dynamics are either too complex to model or unknown. In an extensive case study comprised of synthetic benchmarks and relevant energy applications we demonstrate the competitive performance and sampling efficiency of the proposed algorithm compared to other state-of-the-art tools, making it a useful all-in-one solution for real-world applications with limited evaluation budgets. BibTeX: @article{Thebelt2022, author = {Alexander Thebelt and Calvin Tsay and Robert M. Lee and Nathan Sudermann-Merx and David Walz and Tom Tranter and Ruth Misener}, title = {Multi-objective constrained optimization for energy applications via tree ensembles}, publisher = {Elsevier BV}, year = {2022}, volume = {306}, pages = {118061}, doi = {10.1016/j.apenergy.2021.118061} }  Thi HAL, Luu HPH and Dinh TP (2021), "Online Stochastic DCA with applications to Principal Component Analysis", August, 2021. [Abstract] [BibTeX] Abstract: Stochastic algorithms are well-known for their performance in the era of big data. In convex optimization, stochastic algorithms have been studied in depth and breadth. However, the current body of research on stochastic algorithms for nonsmooth, nonconvex optimization is relatively limited. In this paper, we propose new stochastic algorithms based on DC (Difference of Convex functions) programming and DCA (DC Algorithm) - the backbone of nonconvex, nonsmooth optimization. Since most real-world nonconvex programs fall into the framework of DC programming, our proposed methods can be employed in various situations, in which they confront stochastic nature and nonconvexity simultaneously. The convergence analysis of the proposed algorithms is studied intensively with the help of tools from modern convex analysis and martingale theory. Finally, we study several aspects of the proposed algorithms on an important problem in machine learning: the expected problem in Principal Component Analysis. BibTeX: @article{Thi2021, author = {Hoai An Le Thi and Hoang Phuc Hau Luu and Tao Pham Dinh}, title = {Online Stochastic DCA with applications to Principal Component Analysis}, year = {2021} }  Thies J, Röhrig-Zöllner M and Basermann A (2021), "(R)SE challenges in HPC", December, 2021. [Abstract] [BibTeX] Abstract: We discuss some specific software engineering challenges in the field of high-performance computing, and argue that the slow adoption of SE tools and techniques is at least in part caused by the fact that these do not address the HPC challenges out-of-the-box'. By giving some examples of solutions for designing, testing and benchmarking HPC software, we intend to bring software engineering and HPC closer together. BibTeX: @article{Thies2021, author = {Jonas Thies and Melven Röhrig-Zöllner and Achim Basermann}, title = {(R)SE challenges in HPC}, year = {2021} }  Thomas S, Carr A, Świrydowicz K and Day M (2021), "ILUT Smoothers for Hybrid C-AMG with Scaled Triangular Factors", November, 2021. [Abstract] [BibTeX] Abstract: Relaxation methods such as Jacobi or Gauss-Seidel are often applied as smoothers in algebraic multigrid. Incomplete factorizations can also be employed, however, direct triangular solves are comparatively slow on GPUs. Previous work by Antz et al. Anzt2015 proposed an iterative approach for solving such sparse triangular systems. However, when using the stationary Jacobi iteration, if the upper or lower triangular factor is highly non-normal, the iterations will diverge. An ILUT smoother is introduced for classical Ruge-Stüben C-AMG that applies Ruiz scaling to mitigate the non-normality of the upper triangular factor. Our approach facilitates the use of Jacobi iteration in place of the inherently sequential triangular solve. Because the scaling is applied to the upper triangular factor as opposed to the global matrix, it can be done locally on an MPI-rank for a diagonal block of the global matrix. A performance model is provided along with numerical results for matrices extracted from the PeleLM PeleLM pressure continuity solver. BibTeX: @article{Thomas2021, author = {Stephen Thomas and Arielle Carr and Kasia Świrydowicz and Marc Day}, title = {ILUT Smoothers for Hybrid C-AMG with Scaled Triangular Factors}, year = {2021} }  Thomas S, Carr A, Mullowney P, Li R and Świrydowicz K (2021), "Neumann Series in GMRES and Algebraic Multigrid Smoothers", December, 2021. [Abstract] [BibTeX] Abstract: Neumann series underlie both Krylov methods and algebraic multigrid smoothers. A low-synch modified Gram-Schmidt (MGS)-GMRES algorithm is described that employs a Neumann series to accelerate the projection step. A corollary to the backward stability result of Paige et al. (2006) demonstrates that the truncated Neumann series approximation is sufficient for convergence of GMRES. The lower triangular solver associated with the correction matrix T_m = (I + L_m)^-1 may then be replaced by a matrix-vector product with T_m = I - L_m. Next, Neumann series are applied to accelerate the classical Rüge-Stuben algebraic multigrid preconditioner using both a polynomial Gauss-Seidel or incomplete ILU smoother. The sparse triangular solver employed in these smoothers is replaced by an inner iteration based upon matrix-vector products. Henrici's departure from normality of the associated iteration matrices leads to a better understanding of these series. Connections are made between the (non)normality of the L and U factors and nonlinear stability analysis, as well as the pseudospectra of the coefficient matrix. Furthermore, re-orderings that preserve structural symmetry also reduce the departure from normality of the upper triangular factor and improve the relative residual of the triangular solves. To demonstrate the effectiveness of this approach on many-core architectures, the proposed solver and preconditioner are applied to the pressure continuity equation for the incompressible Navier-Stokes equations of fluid motion. The pressure solve time is reduced considerably with no change in the convergence rate and the polynomial Gauss-Seidel smoother is compared with a Jacobi smoother. Numerical and timing results are presented for Nalu-Wind and the PeleLM combustion codes, where ILU with iterative triangular solvers is shown to be much more effective than polynomial Gauss-Seidel. BibTeX: @article{Thomas2021a, author = {Stephen Thomas and Arielle Carr and Paul Mullowney and Ruipeng Li and Kasia Świrydowicz}, title = {Neumann Series in GMRES and Algebraic Multigrid Smoothers}, year = {2021} }  Tian R, Guo L, Li J, Ren B and Kestor G (2021), "A High-Performance Sparse Tensor Algebra Compiler in Multi-Level IR", February, 2021. [Abstract] [BibTeX] Abstract: Tensor algebra is widely used in many applications, such as scientific computing, machine learning, and data analytics. The tensors represented real-world data are usually large and sparse. There are tens of storage formats designed for sparse matrices and/or tensors and the performance of sparse tensor operations depends on a particular architecture and/or selected sparse format, which makes it challenging to implement and optimize every tensor operation of interest and transfer the code from one architecture to another. We propose a tensor algebra domain-specific language (DSL) and compiler infrastructure to automatically generate kernels for mixed sparse-dense tensor algebra operations, named COMET. The proposed DSL provides high-level programming abstractions that resemble the familiar Einstein notation to represent tensor algebra operations. The compiler performs code optimizations and transformations for efficient code generation while covering a wide range of tensor storage formats. COMET compiler also leverages data reordering to improve spatial or temporal locality for better performance. Our results show that the performance of automatically generated kernels outperforms the state-of-the-art sparse tensor algebra compiler, with up to 20.92x, 6.39x, and 13.9x performance improvement, for parallel SpMV, SpMM, and TTM over TACO, respectively. BibTeX: @article{Tian2021, author = {Ruiqin Tian and Luanzheng Guo and Jiajia Li and Bin Ren and Gokcen Kestor}, title = {A High-Performance Sparse Tensor Algebra Compiler in Multi-Level IR}, year = {2021} }  Tian Z, Liu Z and Dong Y (2021), "The coupled iteration algorithms for computing PageRank", Numerical Algorithms., August, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: In this paper, based on the splittings of the coefficient matrix in the PageRank problem, the coupled iteration algorithms are presented for computing PageRank vector. Convergence conditions of the proposed algorithms are analyzed in detail. Furthermore, the choices of the optimal parameters are discussed for some special cases. Finally, several numerical examples are given to illustrate the effectiveness of the proposed algorithms. BibTeX: @article{Tian2021a, author = {Zhaolu Tian and Zhongyun Liu and Yinghui Dong}, title = {The coupled iteration algorithms for computing PageRank}, journal = {Numerical Algorithms}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s11075-021-01166-x} }  Tian R, Guo L, Li J, Ren B and Kestor G (2021), "A High Performance Sparse Tensor Algebra Compiler in MLIR", In Proceedings of the 2021 IEEE/ACM 7th Workshop on the LLVM Compiler Infrastructure in HPC., November, 2021. IEEE. [BibTeX] [DOI] BibTeX: @inproceedings{Tian2021b, author = {Ruiqin Tian and Luanzheng Guo and Jiajia Li and Bin Ren and Gokcen Kestor}, title = {A High Performance Sparse Tensor Algebra Compiler in MLIR}, booktitle = {Proceedings of the 2021 IEEE/ACM 7th Workshop on the LLVM Compiler Infrastructure in HPC}, publisher = {IEEE}, year = {2021}, doi = {10.1109/llvmhpc54804.2021.00009} }  Torres C and Valero A (2021), "The Exergy Cost Theory Revisited", March, 2021. Vol. 14(6), pp. 1594. MDPI AG. [Abstract] [BibTeX] [DOI] Abstract: This paper reviews the fundamentals of the Exergy Cost Theory, an energy cost accounting methodology to evaluate the physical costs of products of energy systems and their associated waste. Besides, a mathematical and computationally approach is presented, which will allow the practitioner to carry out studies on production systems regardless of their structural complexity. The exergy cost theory was proposed in 1986 by Valero et al. in their “General theory of exergy savings”. It has been recognized as a powerful tool in the analysis of energy systems and has been applied to the evaluation of energy saving alternatives, local optimisation, thermoeconomic diagnosis, or industrial symbiosis. The waste cost formation process is presented from a thermodynamic perspective rather than the economist’s approach. It is proposed to consider waste as external irreversibilities occurring in plant processes. A new concept, called irreversibility carrier, is introduced, which will allow the identification of the origin, transfer, partial recovery, and disposal of waste. BibTeX: @article{Torres2021, author = {César Torres and Antonio Valero}, title = {The Exergy Cost Theory Revisited}, publisher = {MDPI AG}, year = {2021}, volume = {14}, number = {6}, pages = {1594}, doi = {10.3390/en14061594} }  Tran H and Cutkosky A (2021), "Correcting Momentum with Second-order Information", March, 2021. [Abstract] [BibTeX] Abstract: We develop a new algorithm for non-convex stochastic optimization that finds an 𝜖-critical point in the optimal O(-3) stochastic gradient and hessian-vector product computations. Our algorithm uses Hessian-vector products to "correct" a bias term in the momentum of SGD with momentum. This leads to better gradient estimates in a manner analogous to variance reduction methods. In contrast to prior work, we do not require excessively large batch sizes (or indeed any restrictions at all on the batch size), and both our algorithm and its analysis are much simpler. We validate our results on a variety of large-scale deep learning benchmarks and architectures, where we see improvements over SGD and Adam. BibTeX: @article{Tran2021, author = {Hoang Tran and Ashok Cutkosky}, title = {Correcting Momentum with Second-order Information}, year = {2021} }  Tsai YM, Cojean T and Anzt H (2021), "Porting a sparse linear algebra math library to Intel GPUs", March, 2021. [Abstract] [BibTeX] Abstract: With the announcement that the Aurora Supercomputer will be composed of general purpose Intel CPUs complemented by discrete high performance Intel GPUs, and the deployment of the oneAPI ecosystem, Intel has committed to enter the arena of discrete high performance GPUs. A central requirement for the scientific computing community is the availability of production-ready software stacks and a glimpse of the performance they can expect to see on Intel high performance GPUs. In this paper, we present the first platform-portable open source math library supporting Intel GPUs via the DPC++ programming environment. We also benchmark some of the developed sparse linear algebra functionality on different Intel GPUs to assess the efficiency of the DPC++ programming ecosystem to translate raw performance into application performance. Aside from quantifying the efficiency within the hardware-specific roofline model, we also compare against routines providing the same functionality that ship with Intel's oneMKL vendor library. BibTeX: @article{Tsai2021, author = {Yuhsiang M. Tsai and Terry Cojean and Hartwig Anzt}, title = {Porting a sparse linear algebra math library to Intel GPUs}, year = {2021} }  Tsay C, Kronqvist J, Thebelt A and Misener R (2021), "Partition-based formulations for mixed-integer optimization of trained ReLU neural networks", February, 2021. [Abstract] [BibTeX] Abstract: This paper introduces a class of mixed-integer formulations for trained ReLU neural networks. The approach balances model size and tightness by partitioning node inputs into a number of groups and forming the convex hull over the partitions via disjunctive programming. At one extreme, one partition per input recovers the convex hull of a node, i.e., the tightest possible formulation for each node. For fewer partitions, we develop smaller relaxations that approximate the convex hull, and show that they outperform existing formulations. Specifically, we propose strategies for partitioning variables based on theoretical motivations and validate these strategies using extensive computational experiments. Furthermore, the proposed scheme complements known algorithmic approaches, e.g., optimization-based bound tightening captures dependencies within a partition. BibTeX: @article{Tsay2021, author = {Calvin Tsay and Jan Kronqvist and Alexander Thebelt and Ruth Misener}, title = {Partition-based formulations for mixed-integer optimization of trained ReLU neural networks}, year = {2021} }  Tumurbaatar A and Sottile MJ (2021), "Algebraic Algorithms for Betweenness Centrality and 2 Percolation Centrality", Journal of Graph Algorithms and Applications. Vol. 25(1), pp. 241-261. Journal of Graph Algorithms and Applications. [Abstract] [BibTeX] [DOI] Abstract: In this paper, we explored different ways to write the algebraic version of betweenness centrality algorithm. Particularly, we focused on Brandes’ algorithm [8]. We aimed for algebraic betweenness centrality that can be parallelized easily. We proposed 3-tuple geodetic semiring as an extension to the usual geodetic semiring with 2-tuples [6]. Using the 3-tuple geodetic semiring, Dijkstra’s and Brandes’ algorithm, we wrote more concise and general algebraic betweenness centrality (ABC) algorithm which is valid for weighted and directed graphs. We also proposed an alternative version of ABC using the usual geodetic semiring with 2-tuple where we used a simple way to construct shortest path tree after computing shortest path distances in the usual geodetic semiring. This allows us to avoid computational complexity of ABC implementation using 3-tuple geodetic semiring. We used numba [18] to optimize and parallelize ABC. We evaluated the performance of ABC using 2-tuple geodetic semiring as compared to NetworkX [16], a common python package for graph algorithms. We did scalability experiments on parallel ABC and showed its total speedup. We also showed that with small modification, ABC can be adapted to algebraicly compute other centrality measures such as percolation centrality. BibTeX: @article{Tumurbaatar2021, author = {Altansuren Tumurbaatar and Matthew J. Sottile}, title = {Algebraic Algorithms for Betweenness Centrality and 2 Percolation Centrality}, journal = {Journal of Graph Algorithms and Applications}, publisher = {Journal of Graph Algorithms and Applications}, year = {2021}, volume = {25}, number = {1}, pages = {241--261}, doi = {10.7155/jgaa.00558} }  Tuzcu A and Arslan H (2021), "Betweenness Centrality in Sparse Real World and Wireless Multi-hop Networks", In Proceedings of the Intellingent and Fuzzy Techniques for Emerging Conditions Conference. , pp. 217-224. [Abstract] [BibTeX] [DOI] Abstract: Graphs are one of the compact ways to represent information about real-life and intelligent system networks like wireless sensor networks. Betweenness centrality is an important network measure that evaluates the significance of a node based on the shortest paths and is widely used in biological, social, transportation, complex, and communication networks. In this study, we implement an efficient algorithm computing betweenness centrality of nodes for real-world and wireless multi-hop networks. Large sparse graphs are stored using compressed sparse row storage format and modified version of Dijkstra’s algorithm is used to compute shortest paths. We conduct a comprehensive experimental study on real-world networks as well as wireless sensor networks that are state-of-the-art technologies for different applications such as intelligence structures, industrial and home automation as well as health care. We evaluate the effect of network dimension on the time needed to compute betweenness centrality. Experimental results demonstrate that computation time required to compute betweenness centrality varies from 0.9 to 52.5 s when the number of vertices changes from 10,000 to 60,000. We also observe that the proposed algorithm efficiently computes betweenness centrality for networks coming from machine learning, power network, and networks obtained from optimization problems as well as computational fluid dynamics. BibTeX: @inproceedings{Tuzcu2021, author = {Atakan Tuzcu and Hilal Arslan}, title = {Betweenness Centrality in Sparse Real World and Wireless Multi-hop Networks}, booktitle = {Proceedings of the Intellingent and Fuzzy Techniques for Emerging Conditions Conference}, year = {2021}, pages = {217--224}, doi = {10.1007/978-3-030-85626-7_27} }  Ullah I, Liu K, Yamamoto T, Mamlook REA and Jamal A (2021), "A comparative performance of machine learning algorithm to predict electric vehicles energy consumption: A path towards sustainability", 10, 2021. , pp. 0958305X2110449. SAGE Publications. [Abstract] [BibTeX] [DOI] Abstract: The rapid growth of transportation sector and related emissions are attracting the attention of policymakers to ensure environmental sustainability. Therefore, the deriving factors of transport emissions are extremely important to comprehend. The role of electric vehicles is imperative amid rising transport emissions. Electric vehicles pave the way towards a low-carbon economy and sustainable environment. Successful deployment of electric vehicles relies heavily on energy consumption models that can predict energy consumption efficiently and reliably. Improving electric vehicles’ energy consumption efficiency will significantly help to alleviate driver anxiety and provide an essential framework for operation, planning, and management of the charging infrastructure. To tackle the challenge of electric vehicles’ energy consumption prediction, this study aims to employ advanced machine learning models, extreme gradient boosting, and light gradient boosting machine to compare with traditional machine learning models, multiple linear regression, and artificial neural network. Electric vehicles energy consumption data in the analysis were collected in Aichi Prefecture, Japan. To evaluate the performance of the prediction models, three evaluation metrics were used; coefficient of determination (R^2), root mean square error, and mean absolute error. The prediction outcome exhibits that the extreme gradient boosting and light gradient boosting machine provided better and robust results compared to multiple linear regression and artificial neural network. The models based on extreme gradient boosting and light gradient boosting machine yielded higher values of R^2, lower mean absolute error, and root mean square error values have proven to be more accurate. However, the results demonstrated that the light gradient boosting machine is outperformed the extreme gradient boosting model. A detailed feature important analysis was carried out to demonstrate the impact and relative influence of different input variables on electric vehicles energy consumption prediction. The results imply that an advanced machine learning model can enhance the prediction performance of electric vehicles energy consumption. BibTeX: @article{Ullah2021, author = {Irfan Ullah and Kai Liu and Toshiyuki Yamamoto and Rabia Emhamed Al Mamlook and Arshad Jamal}, title = {A comparative performance of machine learning algorithm to predict electric vehicles energy consumption: A path towards sustainability}, publisher = {SAGE Publications}, year = {2021}, pages = {0958305X2110449}, doi = {10.1177/0958305x211044998} }  Upadhyaya P, Jarlebring E and Tudisco F (2021), "The self-consistent field iteration for p-spectral clustering", November, 2021. [Abstract] [BibTeX] Abstract: The self-consistent field (SCF) iteration, combined with its variants, is one of the most widely used algorithms in quantum chemistry. We propose a procedure to adapt the SCF iteration for the p-Laplacian eigenproblem, which is an important problem in the field of unsupervised learning. We formulate the p-Laplacian eigenproblem as a type of nonlinear eigenproblem with one eigenvector nonlinearity , which then allows us to adapt the SCF iteration for its solution after the application of suitable regularization techniques. The results of our numerical experiments confirm the viablity of our approach. BibTeX: @article{Upadhyaya2021, author = {Parikshit Upadhyaya and Elias Jarlebring and Francesco Tudisco}, title = {The self-consistent field iteration for p-spectral clustering}, year = {2021} }  Uppal AJ, Choi J, Rolinger T and Huang HH (2021), "Faster Stochastic Block Partition using Aggressive Initial Merging, Compressed Representation, and Parallelism Control", In Proceedings of the 2021 IEEE High Performance Extreme Computing Conference. [Abstract] [BibTeX] Abstract: The community detection problem continues to be challenging, particularly for large graph data. Although optimal graph partitioning is NP-hard, stochastic methods, such as in the IEEE HPEC GraphChallenge, can provide good approximate solutions in reasonable time. But the scalability with increasing graph size of such solutions remains a challenge. In this work, we describe three new techniques to speed up the stochastic block partition algorithm. The first technique relies on reducing the initial number of communities via aggressive agglomerative merging (a portion of the algorithm with high parallel scalability) to quickly reduce the amount of data that must be processed, resulting in an independent speedup of 1.85x for a 200k node graph. Our second technique uses a novel compressed data structure to store the main bookkeeping information of the algorithm. Our compressed representation allows the processing of large graphs that would otherwise be impossible due to memory constraints, and has a speedup of up to 1.19x over our uncompressed baseline representation. The third technique carefully manages the amount of parallelism during different phases of the algorithm. Compared to our best baseline configuration using a fixed number of threads, this technique yields an independent speedup of 2.26x for a 200k node graph. Combined together, our techniques result in speedups of 3.78x for a 50k node graph, 4.71x for a 100k node graph, and 5.13x for a 200k node graph over our previous best parallel algorithm. BibTeX: @inproceedings{Uppal2021, author = {Ahsen J Uppal and Jaeseok Choi and Thomas Rolinger and H. Howie Huang}, title = {Faster Stochastic Block Partition using Aggressive Initial Merging, Compressed Representation, and Parallelism Control}, booktitle = {Proceedings of the 2021 IEEE High Performance Extreme Computing Conference}, year = {2021} }  Védrine F, Jacquemin M, Kosmatov N and Signoles J (2021), "Runtime Abstract Interpretationfor Numerical Accuracy and Robustness", Proceedings of the 22nd International Conference on Verification, Model Checking, and Abstract Interpretation. [Abstract] [BibTeX] [URL] Abstract: Verification of numerical accuracy properties in modern software remains an important and challenging task. One of its difficulties is related to unstable tests, where the execution can take different branches for real and floating-point numbers. This paper presents a new verification technique for numerical properties, named Runtime Abstract Interpretation (RAI), that, given an annotated source code, embeds into it an abstract analyzer in order to analyze the program behavior at runtime. RAI is a hybrid technique combining abstract interpretation and runtime verification that aims at being sound as the former while taking benefit from the concrete run to gain greater precision from the latter when necessary. It solves the problem of unstable tests by surrounding an unstable test by two carefully defined program points, forming a so-called split-merge section, for which it separately analyzes different executions and merges the computed domains at the end of the section. The implementation of this technique relies on two basic tools, FLDCompiler, that performs a source-to-source transformation of the given program and defines the split-merge sections, and an instrumentation library FLDLib that provides necessary primitives to explore relevant (partial) executions of each section and propagate accuracy properties. Initial experiments show that the proposed technique can efficiently and soundly analyze numerical accuracy for industrial programs on thin numerical scenarios. BibTeX: @inproceedings{Vedrine2021, author = {Franck Védrine and Maxime Jacquemin and Nikolai Kosmatov and Julien Signoles}, title = {Runtime Abstract Interpretationfor Numerical Accuracy and Robustness}, journal = {Proceedings of the 22nd International Conference on Verification, Model Checking, and Abstract Interpretation}, year = {2021}, url = {http://julien.signoles.free.fr/publis/2021_vmcai.pdf} }  Vincent J, Gong J, Karp M, Peplinski A, Jansson N, Podobas A, Jocksch A, Yao J, Hussain F, Markidis S, Karlsson M, Pleiter D, Laure E and Schlatter P (2021), "Strong Scaling of OpenACC enabled Nek5000 on several GPU based HPC systems", September, 2021. [Abstract] [BibTeX] Abstract: We present new results on the strong parallel scaling for the OpenACC-accelerated implementation of the high-order spectral element fluid dynamics solver Nek5000. The test case considered consists of a direct numerical simulation of fully-developed turbulent flow in a straight pipe, at two different Reynolds numbers Re_360 and Re_550, based on friction velocity and pipe radius. The strong scaling is tested on several GPU-enabled HPC systems, including the Swiss Piz Daint system, TACC's Longhorn, Jülich's JUWELS Booster, and Berzelius in Sweden. The performance results show that speed-up between 3-5 can be achieved using the GPU accelerated version compared with the CPU version on these different systems. The run-time for 20 timesteps reduces from 43.5 to 13.2 seconds with increasing the number of GPUs from 64 to 512 for Re_550 case on JUWELS Booster system. This illustrates the GPU accelerated version the potential for high throughput. At the same time, the strong scaling limit is significantly larger for GPUs, at about 2000-5000 elements per rank; compared to about 50-100 for a CPU-rank. BibTeX: @article{Vincent2021, author = {Jonathan Vincent and Jing Gong and Martin Karp and Adam Peplinski and Niclas Jansson and Artur Podobas and Andreas Jocksch and Jie Yao and Fazle Hussain and Stefano Markidis and Matts Karlsson and Dirk Pleiter and Erwin Laure and Philipp Schlatter}, title = {Strong Scaling of OpenACC enabled Nek5000 on several GPU based HPC systems}, year = {2021} }  Vracem GV and Nijssen S (2021), "Iterated Matrix Reordering", In Machine Learning and Knowledge Discovery in Databases. Research Track. , pp. 745-761. Springer International Publishing. [Abstract] [BibTeX] [DOI] Abstract: Heatmaps are a popular data visualization technique that allows to visualize a matrix or table in its entirety. An important step in the creation of insightful heatmaps is the determination of a good order for the rows and the columns, that is, the use of appropriate matrix reordering or seriation techniques. Unfortunately, by using artificial data with known patterns, it can be shown that existing matrix ordering techniques often fail to identify good orderings in data in the presence of noise. In this paper, we propose a novel technique that addresses this weakness. Its key idea is to make an underlying base matrix ordering technique more robust to noise by embedding it into an iterated loop with image processing techniques. Experiments show that this iterative technique improves the quality of the matrix ordering found by all base ordering methods evaluated, both for artificial and real-world data, while still offering high levels of computational performance as well. BibTeX: @incollection{Vracem2021, author = {Gauthier Van Vracem and Siegfried Nijssen}, title = {Iterated Matrix Reordering}, booktitle = {Machine Learning and Knowledge Discovery in Databases. Research Track}, publisher = {Springer International Publishing}, year = {2021}, pages = {745--761}, doi = {10.1007/978-3-030-86523-8_45} }  Walden A, Zubair M, Stone C and Nielsen E (2021), "Memory Optimizations for Sparse Linear Algebra on GPU Hardware", In Proceedings of the Workshop on Memory Centric High Performance Computing at Supercomputing. [Abstract] [BibTeX] Abstract: n effort to maximize memory bandwidth utilization for a sparse linear algebra kernel executing on NVIDIA Tesla V100 and A100 Graphics Processing Units (GPUs) is described. The kernel consists of a block-sparse matrix-vector product and a series of forward/backward triangular solves. The computation is memory-bound and exhibits low arithmetic intensity. An earlier implementation yield good memory performance on the V100 architecture. However, a new approach, which assigns a warp to six rows of the matrix, is proposed for the A100. In addition, two new features offered by the A100 architecture are explored. L2 residency control enables a portion of the L2 cache to be used for persistent data access, and the asynchronous copy instruction allows data to be loaded directly from the main memory into shared memory. The new implementation improves memory bandwidth utilization from 71.5% to 81.2% of the peak available on the A100 architecture. BibTeX: @inproceedings{Walden2021, author = {Aaron Walden and Mohammad Zubair and Christopher Stone and Eric Nielsen}, title = {Memory Optimizations for Sparse Linear Algebra on GPU Hardware}, booktitle = {Proceedings of the Workshop on Memory Centric High Performance Computing at Supercomputing}, year = {2021} }  Wang Z, Berg SW-v and Braun M (2021), "Fast Parallel Newton-Raphson Power Flow Solver for Large Number of System Calculations with CPU and GPU", January, 2021. [Abstract] [BibTeX] Abstract: To analyze large sets of grid states, e.g. in time series calculations or parameter studies, large number of power flow calculations have to be performed, as well as evaluating the impact from uncertainties of the renewable energy e.g. wind and PV of power systems with Monte-Carlo simulation. For the application in real-time grid operation and in cases when computational time is critical, a novel approach on parallelization of Newton-Raphson power flow for many calculations on CPU and with GPU-acceleration is proposed. The result shows a speed-up of over x100 comparing to the open-source tool pandapower, when performing repetitive power flows of system with admittance matrix of the same sparsity pattern on both CPU and GPU. The speed-up relies on the optimized algorithm and parallelization strategy, which can reduce the repetitive work and saturate the high hardware performance of modern CPUs and GPUs well. This is achieved with the proposed batched sparse matrix operation and batched linear solver based on LU-refactorization. The batched linear solver shows a large performance improvement comparing to the state-of-art linear system solver KLU library and a better saturation of the GPU performance with small problem scale. Finally, the method of integrating the proposed solver into pandapower is presented, thus the parallel power flow solver with outstanding performance can be easily applied in challenging real-life grid operation and innovative researches e.g. data-driven machine learning studies. BibTeX: @article{Wang2021, author = {Zhenqi Wang and Sebastian Wende-von Berg and Martin Braun}, title = {Fast Parallel Newton-Raphson Power Flow Solver for Large Number of System Calculations with CPU and GPU}, year = {2021} }  Wang Y (2021), "High Performance Spectral Methods for Graph-Based Machine Learning". Thesis at: Michigan Technological University. [Abstract] [BibTeX] [URL] Abstract: Graphs play a critical role in machine learning and data mining fields. The success of graph-based machine learning algorithms highly depends on the quality of the underlying graphs. Desired graphs should have two characteristics: 1) they should be able to well-capture the underlying structures of the data sets. 2) they should be sparse enough so that the downstream algorithms can be performed efficiently on them.\ This dissertation first studies the application of a two-phase spectrum-preserving spectral sparsification method that enables to construct very sparse sparsifiers with guaranteed preservation of original graph spectra for spectral clustering. Experiments show that the computational challenge due to the eigen-decomposition procedure in spectral clustering can be fundamentally addressed.\ We then propose a highly-scalable spectral graph learning approach GRASPEL (Graph Spectral Learning at Scale). GRASPEL can learn high-quality graphs from high dimensional input data. Compared with prior state-of-the-art graph learning and construction methods [26, 27, 38] , GRASPEL leads to substantially improved algorithm performance. BibTeX: @phdthesis{Wang2021a, author = {Wang, Yongyu}, title = {High Performance Spectral Methods for Graph-Based Machine Learning}, school = {Michigan Technological University}, year = {2021}, url = {https://search.proquest.com/openview/c1551e2ddafc1052c25367e2753854af/1.pdf?pq-origsite=gscholar&cbl=18750&diss=y} }  Wang Y, Baciu G and Li C (2021), "A Layout-Based Classification Method for Visualizing Time-Varying Graphs", ACM Transactions on Knowledge Discovery from Data., March, 2021. Vol. 15(4), pp. 1-24. Association for Computing Machinery (ACM). [Abstract] [BibTeX] [DOI] Abstract: Connectivity analysis between the components of large evolving systems can reveal significant patterns of interaction. The systems can be simulated by topological graph structures. However, such analysis becomes challenging on large and complex graphs. Tasks such as comparing, searching, and summarizing structures, are difficult due to the enormous number of calculations required. For time-varying graphs, the temporal dimension even intensifies the difficulty. In this article, we propose to reduce the complexity of analysis by focusing on subgraphs that are induced by closely related entities. To summarize the diverse structures of subgraphs, we build a supervised layout-based classification model. The main premise is that the graph structures can induce a unique appearance of the layout. In contrast to traditional graph theory-based and contemporary neural network-based methods of graph classification, our approach generates low costs and there is no need to learn informative graph representations. Combined with temporally stable visualizations, we can also facilitate the understanding of sub-structures and the tracking of graph evolution. The method is evaluated on two real-world datasets. The results show that our system is highly effective in carrying out visual-based analytics of large graphs. BibTeX: @article{Wang2021b, author = {Yunzhe Wang and George Baciu and Chenhui Li}, title = {A Layout-Based Classification Method for Visualizing Time-Varying Graphs}, journal = {ACM Transactions on Knowledge Discovery from Data}, publisher = {Association for Computing Machinery (ACM)}, year = {2021}, volume = {15}, number = {4}, pages = {1--24}, doi = {10.1145/3441301} }  Wang H, Yang X and Deng X (2021), "A Hybrid First-Order Method for Nonconvex _p-ball Constrained Optimization", April, 2021. [Abstract] [BibTeX] Abstract: In this paper, we consider the nonconvex optimization problem in which the objective is continuously differentiable and the feasible set is a nonconvex _p ball. Studying such optimization offers the possibility to bridge the gap between a range of intermediate values p ∊ (0,1) and p ∊ 0,1\ in the norm-constraint sparse optimization, both in theory and practice. We propose a novel hybrid method within a first-order algorithmic framework for solving such problems by combining the Frank-Wolfe method and the gradient projection method. During iterations, it solves a Frank-Wolfe subproblem if the current iterate is in the interior of the _p ball, and it solves a gradient projection subproblem with a weighted _1-ball constraint if the current iterate is on the boundary of the _p ball. Global convergence is proved, and a worst-case iteration complexity O(1/2) of the optimality error is also established. We believe our proposed method is the first practical algorithm for solving _p-ball constrained nonlinear problems with theoretical guarantees. Numerical experiments demonstrate the practicability and efficiency of the proposed algorithm. BibTeX: @article{Wang2021c, author = {Hao Wang and Xiangyu Yang and Xin Deng}, title = {A Hybrid First-Order Method for Nonconvex _p-ball Constrained Optimization}, year = {2021} }  Wang R, Yang Z, Xu H and Lu L (2021), "A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution", The Journal of Supercomputing., June, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: In the past few decades, general matrix multiplication (GEMM), as the basic component of the Basic Linear Algebra Subprograms (BLAS) library, has played a vital role in various fields such as machine learning, image processing, and fluid dynamics. Because these fields tend to deconstruct the problem into multiple smaller sub-problems, today’s BLAS libraries have implemented batched GEMM routines to achieve high performance in this scenario. MAGMA proposes a vbatch routine to calculate batched GEMM with variable size on GPU, but unbalanced input will cause some workgroups and threads to be idle, thereby affecting performance. In addition, unbalanced input will also affect the load balancing of the Computing Unit in GPU, and extreme input will lead to insufficient utilization of hardware resources. In this paper we proposes a high-performance batched GEMM computing framework on GPU. For a large batch of small matrices with variable sizes and unbalanced distribution, the proposed framework considered the hardware architecture and the possible data distribution, and adopted three methods (flexible tile, sort-up and split-down) to improve hardware utilization and achieve better load balancing. Experimental results show that our framework has a 3.02× performance improvement compared to the latest MAGMA implementation on AMD Radeon Instinct MI50 GPU, and 3.14× speedup on MI100. BibTeX: @article{Wang2021d, author = {Ruimin Wang and Zhiwei Yang and Hao Xu and Lu Lu}, title = {A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution}, journal = {The Journal of Supercomputing}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s11227-021-03936-9} }  Wang Y, Li W and Gao J (2021), "A parallel sparse approximate inverse preconditioning algorithm based on MPI and CUDA", November, 2021. , pp. 100007. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: In this study, we present an efficient parallel sparse approximate inverse (SPAI) preconditioning algorithm based on MPI and CUDA, called HybridSPAI. For HybridSPAI, it optimizes a latest static SPAI preconditioning algorithm, and is extended from one GPU to multiple GPUs in order to process large-scale matrices. We make the following significant contributions: (1) a general parallel framework for optimizing the static SPAI preconditioner based on MPI and CUDA is presented, and (2) for each component of the preconditioner, a decision tree is established to choose the optimal kernel of computing it. Experimental results show that HybridSPAI is effective, and outperforms the popular preconditioning algorithms in two public libraries, and a latest parallel SPAI preconditioning algorithm. BibTeX: @article{Wang2021e, author = {Yizhou Wang and Wenhao Li and Jiaquan Gao}, title = {A parallel sparse approximate inverse preconditioning algorithm based on MPI and CUDA}, publisher = {Elsevier BV}, year = {2021}, pages = {100007}, doi = {10.1016/j.tbench.2021.100007} }  Wang Z, Chen H, Zhu J and Ding Z (2021), "Daily PM2.5 and PM10 forecasting using linear and nonlinear modeling framework based on robust local mean decomposition and moving window ensemble strategy", Applied Soft Computing., November, 2021. , pp. 108110. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Highly accurate forecasting of particulate matter concentration (PMC) is essential and effective for establishing a reliable air pollution early warning system and has both theoretical and practical significance. To meet this demand, a novel multi-scale hybrid learning framework based on robust local mean decomposition (RLMD) and moving window (MW) ensemble strategy is developed for PM2.5 and PM10 forecasting. In this architecture, the RLMD is adopted to adaptively decompose the PMC time series (PMCTS) into several production functions and one residue with different frequencies. These subseries are simpler than the original PMCTS, but they still work alongside mode aliasing. Thus, following the well-established “linear and nonlinear” modeling philosophy, a novel hybrid learning framework, composed of the autoregressive integrated moving average (ARIMA) and combined kernel function relevance vector machine (RVM_com), is proposed to capture both the linear and nonlinear patterns in the subseries. To obtain better final outputs, based on the definition of the ensemble improvement degree, the MW ensemble method is used to merge the forecasting results of all subseries. A comprehensive experiment is conducted using PM2.5 and PM10 datasets from four municipalities in China to investigate the forecasting performance of our proposed framework, and the results demonstrate that our proposed RLMD-ARIMA-RVM_com-MW (R-A&R_com-M) model is superior to other considered methods in terms of forecasting accuracy and generalization ability. This means that the developed forecasting architecture has a great application value in the field of PMCTS prediction. BibTeX: @article{Wang2021f, author = {Zicheng Wang and Huayou Chen and Jiaming Zhu and Zhenni Ding}, title = {Daily PM2.5 and PM10 forecasting using linear and nonlinear modeling framework based on robust local mean decomposition and moving window ensemble strategy}, journal = {Applied Soft Computing}, publisher = {Elsevier BV}, year = {2021}, pages = {108110}, doi = {10.1016/j.asoc.2021.108110} }  Wang C and Tang P (2021), "A dual semismooth Newton based augmented Lagrangian method for large-scale linearly constrained sparse group square-root Lasso problems", November, 2021. [Abstract] [BibTeX] Abstract: Square-root Lasso problems are proven robust regression problems. Furthermore, square-root regression problems with structured sparsity also plays an important role in statistics and machine learning. In this paper, we focus on the numerical computation of large-scale linearly constrained sparse group square-root Lasso problems. In order to overcome the difficulty that there are two nonsmooth terms in the objective function, we propose a dual semismooth Newton (SSN) based augmented Lagrangian method (ALM) for it. That is, we apply the ALM to the dual problem with the subproblem solved by the SSN method. To apply the SSN method, the positive definiteness of the generalized Jacobian is very important. Hence we characterize the equivalence of its positive definiteness and the constraint nondegeneracy condition of the corresponding primal problem. In numerical implementation, we fully employ the second order sparsity so that the Newton direction can be efficiently obtained. Numerical experiments demonstrate the efficiency of the proposed algorithm. BibTeX: @article{Wang2021g, author = {Chengjing Wang and Peipei Tang}, title = {A dual semismooth Newton based augmented Lagrangian method for large-scale linearly constrained sparse group square-root Lasso problems}, year = {2021} }  Ward JP, Narcowich FJ and Ward JD (2021), "Locally supported, quasi-interpolatory bases on graphs", January, 2021. [Abstract] [BibTeX] Abstract: Lagrange functions are localized bases that have many applications in signal processing and data approximation. Their structure and fast decay make them excellent tools for constructing approximations. Here, we propose perturbations of Lagrange functions on graphs that maintain the nice properties of Lagrange functions while also having the added benefit of being locally supported. Moreover, their local construction means that they can be computed in parallel, and they are easily implemented via quasi-interpolation. BibTeX: @article{Ward2021, author = {John Paul Ward and Francis J. Narcowich and Joseph D. Ward}, title = {Locally supported, quasi-interpolatory bases on graphs}, year = {2021} }  Wiebe J and Misener R (2021), "ROmodel: Modeling robust optimization problems in Pyomo", May, 2021. [Abstract] [BibTeX] Abstract: This paper introduces ROmodel, an open source Python package extending the modeling capabilities of the algebraic modeling language Pyomo to robust optimization problems. ROmodel helps practitioners transition from deterministic to robust optimization through modeling objects which allow formulating robust models in close analogy to their mathematical formulation. ROmodel contains a library of commonly used uncertainty sets which can be generated using their matrix representations, but it also allows users to define custom uncertainty sets using Pyomo constraints. ROmodel supports adjustable variables via linear decision rules. The resulting models can be solved using ROmodels solvers which implement both the robust reformulation and cutting plane approach. ROmodel is a platform to implement and compare custom uncertainty sets and reformulations. We demonstrate ROmodel's capabilities by applying it to six case studies. We implement custom uncertainty sets based on (warped) Gaussian processes to show how ROmodel can integrate data-driven models with optimization. BibTeX: @article{Wiebe2021, author = {Johannes Wiebe and Ruth Misener}, title = {ROmodel: Modeling robust optimization problems in Pyomo}, year = {2021} }  Wiegele A and Zhao S (2021), "SDP-based bounds for graph partition via extended ADMM", May, 2021. [Abstract] [BibTeX] Abstract: We study two NP-complete graph partition problems, k-equipartition problems and graph partition problems with knapsack constraints (GPKC). We introduce tight SDP relaxations with nonnegativity constraints to get lower bounds, the SDP relaxations are solved by an extended alternating direction method of multipliers (ADMM). In this way, we obtain high quality lower bounds for k-equipartition on large instances up to n =1000 vertices within as few as five minutes and for GPKC problems up to n=500 vertices within as little as one hour. On the other hand, interior point methods fail to solve instances from n=300 due to memory requirements. We also design heuristics to generate upper bounds from the SDP solutions, giving us tighter upper bounds than other methods proposed in the literature with low computational expense. BibTeX: @article{Wiegele2021, author = {Angelika Wiegele and Shudian Zhao}, title = {SDP-based bounds for graph partition via extended ADMM}, year = {2021} }  Wilkinson WJ, Solin A and Adam V (2021), "Sparse Algorithms for Markovian Gaussian Processes", In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics., March, 2021. [Abstract] [BibTeX] Abstract: Approximate Bayesian inference methods that scale to very large datasets are crucial in leveraging probabilistic models for real-world time series. Sparse Markovian Gaussian processes combine the use of inducing variables with efficient Kalman filter-like recursions, resulting in algorithms whose computational and memory requirements scale linearly in the number of inducing points, whilst also enabling parallel parameter updates and stochastic optimisation. Under this paradigm, we derive a general site-based approach to approximate inference, whereby we approximate the non-Gaussian likelihood with local Gaussian terms, called sites. Our approach results in a suite of novel sparse extensions to algorithms from both the machine learning and signal processing literature, including variational inference, expectation propagation, and the classical nonlinear Kalman smoothers. The derived methods are suited to large time series, and we also demonstrate their applicability to spatio-temporal data, where the model has separate inducing points in both time and space. BibTeX: @inproceedings{Wilkinson2021, author = {William J. Wilkinson and Arno Solin and Vincent Adam}, title = {Sparse Algorithms for Markovian Gaussian Processes}, booktitle = {Proceedings of the 24th International Conference on Artificial Intelligence and Statistics}, year = {2021} }  Witemeyer B, Weidner NJ, Davis TA, Kim T and Sueda S (2021), "QLB: Collision-Aware Quasi-Newton Solver with Cholesky and L-BFGS for Nonlinear Time Integration", In Proceedings of the Motion, Interaction and Games Conference. [Abstract] [BibTeX] Abstract: We advocate for the straightforward applications of the Cholesky and the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithms in the context of nonlinear time integration of deformable objects with dynamic collisions. At the beginning of each time step, we form and factor the Hessian matrix, accounting for all internal forces while omitting the implicit cross-coupling terms from the collision forces between multiple dynamic objects or self collisions. Then during the nonlinear solver iterations of the time step, we implicitly update this Hessian with L-BFGS. This approach is simple to implement and can be readily applied to any nonlinear time integration scheme, including higher-order schemes and quasistatics. We show that this approach works well in a wide range of settings involving complex nonlinear materials, including heterogeneity and anisotropy, as well as collisions, including frictional contact and self collisions. BibTeX: @inproceedings{Witemeyer2021, author = {Bethany Witemeyer and Nicholas J. Weidner and Timothy A. Davis and Theodore Kim and Shinjiro Sueda}, title = {QLB: Collision-Aware Quasi-Newton Solver with Cholesky and L-BFGS for Nonlinear Time Integration}, booktitle = {Proceedings of the Motion, Interaction and Games Conference}, year = {2021} }  Wu X, Wang H and Lu J (2021), "Distributed Optimization with Coupling Constraints", February, 2021. [Abstract] [BibTeX] Abstract: In this paper, we develop a novel distributed algorithm for addressing convex optimization with both nonlinear inequality and linear equality constraints, where the objective function can be a general nonsmooth convex function and all the constraints can be fully coupled. Specifically, we first separate the constraints into three groups, and design two primal-dual methods and utilize a virtual-queue-based method to handle each group of the constraints independently. Then, we integrate these three methods in a strategic way, leading to an integrated primal-dual proximal (IPLUX) algorithm, and enable the distributed implementation of IPLUX. We show that IPLUX achieves an O(1/k) rate of convergence in terms of optimality and feasibility, which is stronger than the convergence results of the state-of-the-art distributed algorithms for convex optimization with coupling nonlinear constraints. Finally, IPLUX exhibits competitive practical performance in the simulations. BibTeX: @article{Wu2021, author = {Xuyang Wu and He Wang and Jie Lu}, title = {Distributed Optimization with Coupling Constraints}, year = {2021} }  Wu J, Sun J, Sun H and Sun G (2021), "Performance Analysis of Graph Neural Network Frameworks", In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)., March, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Graph neural networks (GNNs) are effective models to address learning problems on graphs and have been successfully applied to numerous domains. To improve the productivity of implementing GNNs, various GNN programming frameworks have been developed. Both the effectiveness (accuracy, loss, etc) and the performance (latency, bandwidth, etc) are essential metrics to evaluate the implementation of GNNs. There are many comparative studies related to the effectiveness of different GNN models on domain tasks. However, the performance characteristics of different GNN frameworks are still lacking. In this study, we evaluate the effectiveness and performance of six popular GNN models, GCN, GIN, GAT, GraphSAGE, MoNet, and GatedGCN, across several common benchmarks under two popular GNN frameworks, PyTorch Geometric and Deep Graph Library. We analyze the training time, GPU utilization, and memory usage of different evaluation settings and the performance of models across different hardware configurations under the two frameworks. Our evaluation provides in-depth observations of performance bottlenecks of GNNs and the performance differences between the two popular GNN frameworks. Our work helps GNN researchers understand the performance differences of the popular GNN frameworks, and gives guidelines for developers to find potential performance bugs of frameworks and optimization possibilities of GNNs. BibTeX: @inproceedings{Wu2021a, author = {Junwei Wu and Jingwei Sun and Hao Sun and Guangzhong Sun}, title = {Performance Analysis of Graph Neural Network Frameworks}, booktitle = {2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}, publisher = {IEEE}, year = {2021}, doi = {10.1109/ispass51385.2021.00029} }  Xia Y, Jiang P, Agrawal G and Ramnath R (2021), "Scaling Sparse Matrix Multiplication on CPU-GPU Nodes", In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)., May, 2021. IEEE. [Abstract] [BibTeX] [DOI] Abstract: Multiplication of two sparse matrices (SpGEMM) is a popular kernel behind many numerical solvers, and also features in implementing many common graph algorithms. Though many recent research efforts have focused on implementing SpGEMM efficiently on a single GPU, none of the existing work has considered the case where the memory requirements exceed the size of GPU memory. Similarly, the use of the aggregate computing power of CPU and GPU has also not been addressed for those large matrices. In this paper, we present a framework for scaling SpGEMM computations for matrices that do not fit into GPU memory. We address how the computation and data can be partitioned across kernel executions on GPUs. An important emphasis in our work is overlapping data movement and computation. We achieve this by addressing many challenges, such as avoiding dynamic memory allocations, and re-scheduling data transfers with the computation of chunks. We extend our framework to make efficient use of both GPU and CPU, by developing an efficient work distribution strategy. Our evaluation on 9 large matrices shows that our out-of-core GPU implementation achieves 1.98-3.03X speedups over a state-of-the-art multi-core CPU implementation, our hybrid implementation further achieves speedups up to 3.74x, and that our design choices are directly contributing towards achieving this performance. BibTeX: @inproceedings{Xia2021, author = {Yang Xia and Peng Jiang and Gagan Agrawal and Rajiv Ramnath}, title = {Scaling Sparse Matrix Multiplication on CPU-GPU Nodes}, booktitle = {2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, publisher = {IEEE}, year = {2021}, doi = {10.1109/ipdps49936.2021.00047} }  Xie Z, Tan G, Liu W and Sun N (2021), "A Pattern-based SpGEMM Library for Multi-core and Many-core Architectures", IEEE Transactions on Parallel and Distributed Systems. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: General sparse matrix-matrix multiplication (SpGEMM) is one of the most important mathematical library routines in a number of applications. In recent years, several efficient SpGEMM algorithms have been proposed, however, most of them are based on the compressed sparse row (CSR) format, and the possible performance gain from exploiting other formats has not been well studied. And some specific algorithms are restricted to parameter tuning that has a significant impact on performance. So the particular format, algorithm, and parameter that yield the best performance for SpGEMM remain undetermined. In this paper, we conduct a prospective study on format-specific parallel SpGEMM algorithms and analyze their pros and cons. We then propose a pattern-based SpGEMM library, that provides a unified programming interface in the CSR format, analyses the pattern of two input matrices, and automatically determines the best format, algorithm, and parameter for arbitrary matrix pairs. For this purpose, we build an algorithm set that integrates three new designed algorithms with existing popular libraries, and design a hybrid deep learning model called MatNet to quickly identify patterns of input matrices and accurately predict the best solution by using sparse features and density representations. The evaluation shows that this library consistently outperforms the state-of-the-art library. We also demonstrate its adaptability in an AMG solver and a BFS algorithm with 30% performance improvement. BibTeX: @article{Xie2021, author = {Zhen Xie and Guangming Tan and Weifeng Liu and Ninghui Sun}, title = {A Pattern-based SpGEMM Library for Multi-core and Many-core Architectures}, journal = {IEEE Transactions on Parallel and Distributed Systems}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/tpds.2021.3090328} }  Xiong J-S (2021), "Modified upper and lower triangular splitting iterative method for a class of block two-by-two linear systems", Linear and Multilinear Algebra., December, 2021. , pp. 1-12. Informa UK Limited. [Abstract] [BibTeX] [DOI] Abstract: Based on two triangular splitting of the coefficient matrix, a modified upper and lower triangular (MULT) splitting iterative method is proposed in this paper for solving a class of block two-by-two linear systems. The convergence property and optimal iterated parameters are derived under suitable conditions. Finally, the effectiveness of the proposed MULT splitting iterative method is supported by two numerical examples. BibTeX: @article{Xiong2021, author = {Jin-Song Xiong}, title = {Modified upper and lower triangular splitting iterative method for a class of block two-by-two linear systems}, journal = {Linear and Multilinear Algebra}, publisher = {Informa UK Limited}, year = {2021}, pages = {1--12}, doi = {10.1080/03081087.2021.2017833} }  Xu Z and Li P (2021), "On the Riemannian Search for Eigenvector Computation", Journal of Machine Learning Research. Vol. 22 [Abstract] [BibTeX] [URL] Abstract: Eigenvector computation is central to numerical algebra and often critical to many data analysis tasks nowadays. Most research on this problem has been focusing on projection methods like power iterations, such that this category of algorithms can achieve both optimal convergence rates and cheap per-iteration costs. In contrast, search methods belonging to another main category are less understood in this respect. In this work, we consider the leading eigenvector computation as a non-convex optimization problem on the (generalized) Stiefel manifold and covers the cases for both standard and generalized eigenvectors. It is shown that the inexact Riemannian gradient method induced by the shift-and-invert preconditioning is guaranteed to converge to one of the ground-truth eigenvectors at an optimal rate, e.g., O( \kappa_B \frac{\lambda_1}{\lambda_1 - \lambda_{p+1}} \log \frac{1}{\epsilon} ) for a pair of real symmetric matrices (A, B) with B being positive definite, where _i represents the i-th largest generalized eigenvalue of the matrix pair, p is the multiplicity of _1, and _B stands for the condition number of B. The standard eigenvector computation is recovered by setting B to an identity matrix. Our analysis reduces the dependence on the eigengap, making it the first Riemannian eigensolver that achieves the optimal rate. Experiments demonstrate that the proposed search method is able to deliver significantly better performance than projection methods by taking advantages of step-size schemes BibTeX: @article{Xu2021, author = {Zhiqiang Xu and Ping Li}, title = {On the Riemannian Search for Eigenvector Computation}, journal = {Journal of Machine Learning Research}, year = {2021}, volume = {22}, url = {https://www.jmlr.org/papers/volume22/20-033/20-033.pdf} }  Yang Y, Zou N, Lin E, Suo F and Chen Z (2021), "A neural network method for nonconvex optimization and its application on parameter retrieval", IEEE Transactions on Signal Processing. , pp. 1-1. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: Parameter retrieval is a typical nonconvex optimization problem in a wide range of research and engineering fields. Classic methods tackle the parameter retrieval problem by feature extraction from the subspace or transform domain. In this paper, we proposed a network-based method to directly solve the nonconvex optimization problem on parameters estimation of complex exponential signals, with no requirement of labeled data. The proposed network has an architecture similar to the Autoencoder network but with the decoder sub-network replaced by a complex exponential signal generator. After training the network to fit the signal parameters to the acquired data, one could obtain the parameters, i.e. frequencies, decay rates, and intensities, and reconstruct the signal. By this work, we show that with a simple application of a lightweight neural network, nonconvex optimization problems like parameter retrieval can be solved efficiently, even without any intricately designed algorithms. We also discuss the robustness of the network-based method by repeated experiments and present the failure cases to indicate the limitations of this method. BibTeX: @article{Yang2021, author = {Yu Yang and Nannan Zou and Enping Lin and Fei Suo and Zhong Chen}, title = {A neural network method for nonconvex optimization and its application on parameter retrieval}, journal = {IEEE Transactions on Signal Processing}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, pages = {1--1}, doi = {10.1109/tsp.2021.3080426} }  Yang Y and Liu Z (2021), "Heuristics for Finding Sparse Solutions of Linear Inequalities", Asia-Pacific Journal of Operational Research., December, 2021. World Scientific Pub Co Pte Ltd. [Abstract] [BibTeX] [DOI] Abstract: In this paper, we consider the problem of finding a sparse solution, with a minimal number of nonzero components, for a set of linear inequalities. This optimization problem is combinatorial and arises in various fields such as machine learning and compressed sensing. We present three new heuristics for the problem. The first two are greedy algorithms minimizing the sum of infeasibilities in the primal and dual spaces with different selection rules. The third heuristic is a combination of the greedy heuristic in the dual space and a local search algorithm. In numerical experiments, our proposed heuristics are compared with the weighted-l_1 algorithm and DCA programming with three different non-convex approximations of the zero norm. The computational results demonstrate the efficiency of our methods. BibTeX: @article{Yang2021a, author = {Yichen Yang and Zhaohui Liu}, title = {Heuristics for Finding Sparse Solutions of Linear Inequalities}, journal = {Asia-Pacific Journal of Operational Research}, publisher = {World Scientific Pub Co Pte Ltd}, year = {2021}, doi = {10.1142/s021759592240005x} }  Yaşar A, Gabert K and Çatalyürek ÜV (2021), "Parallel graph algorithms by blocks", In Proceedings of the 18th ACM International Conference on Computing Frontiers., May, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: In today's data-driven world and heterogeneous computing environments, processing large-scale graphs in an architecture agnostic manner has become more crucial than ever before. In terms of graph analytics frameworks, on the one side, there has been a significant interest in developing hand-optimized high-performance computing solutions. On the systems side, following the big data movement and to bring parallel computing to the masses, researchers have proposed several graph processing and management systems to handle large-scale graphs. Hand optimized HPC approaches require high expertise and are expensive to maintain and develop, and graph processing frameworks suffer from limited expressibility and performance. We propose Parallel Graph Algorithms by Blocks (PGAbB), a block-based graph algorithms framework for shared-memory, multi-core, multi-GPU machines. PGAbB offers a sweet spot between efficient parallelism and architecture agnostic algorithm design for a wide class of graph problems while performing close to hand-optimized HPC implementations.\ While our PGAbB framework, as well as many other recent HPC graph-analytics frameworks, are highly tuned and able to run complex graph analytics in fractions of seconds on billion-edge graphs, there remains a gap in their end-to-end use. Despite the significant improvements that modern hardware and operating systems have made towards input and output, reading the graph from file systems easily takes thousands of times longer than running the computational kernel itself. This slowdown causes both a disconnect for end users and a loss of productivity for researchers and developers. We close this gap by providing a simple to use, small, header-only, and dependency-free C++11 library, PIGO, that brings I/O improvements to graph and sparse matrix systems. Using PIGO, we improve the end-to-end performance for state-of-the-art systems significantly---in many cases by over 40X. BibTeX: @inproceedings{Yasar2021, author = {Abdurrahman Yaşar and Kasimir Gabert and Ümit V. Çatalyürek}, title = {Parallel graph algorithms by blocks}, booktitle = {Proceedings of the 18th ACM International Conference on Computing Frontiers}, publisher = {ACM}, year = {2021}, doi = {10.1145/3457388.3459987} }  Yayavaram S and Chanda SS (2021), "Decision making under high complexity: a computational model for the science of muddling through", Computational and Mathematical Organization Theory., November, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: It is well recognized that many organizations operate under situations of high complexity that arises from pervasive interdependencies between their decision elements. While prior work has discussed the benefits of low to moderate complexity, the literature on how to cope with high complexity is relatively sparse. In this study, we seek to demonstrate that Lindblom’s decision-making principle of muddling through is a very effective approach that organizations can use to cope with high complexity. Using a computational simulation (NK) model, we show that Lindblom’s muddling through approach obtains outcomes superior to those obtained from boundedly rational decision-making approaches when complexity is high. Moreover, our results also show that muddling through is an appropriate vehicle for bringing in radical organizational change or far-reaching adaptation. BibTeX: @article{Yayavaram2021, author = {Sai Yayavaram and Sasanka Sekhar Chanda}, title = {Decision making under high complexity: a computational model for the science of muddling through}, journal = {Computational and Mathematical Organization Theory}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s10588-021-09354-9} }  Yazdani A, Haynes RD and Ruuth SJ (2021), "A Convergence Analysis of the Parallel Schwarz Solution of the Continuous Closest Point Method", September, 2021. [Abstract] [BibTeX] Abstract: The discretization of surface intrinsic PDEs has challenges that one might not face in the flat space. The closest point method (CPM) is an embedding method that represents surfaces using a function that maps points in the flat space to their closest points on the surface. This mapping brings intrinsic data onto the embedding space, allowing us to numerically approximate PDEs by the standard methods in the tubular neighborhood of the surface. Here, we solve the surface intrinsic positive Helmholtz equation by the CPM paired with finite differences which usually yields a large, sparse, and non-symmetric system. Domain decomposition methods, especially Schwarz methods, are robust algorithms to solve these linear systems. While there have been substantial works on Schwarz methods, Schwarz methods for solving surface differential equations have not been widely analyzed. In this work, we investigate the convergence of the CPM coupled with Schwarz method on 1-manifolds in d-dimensional space of real numbers. BibTeX: @article{Yazdani2021, author = {Alireza Yazdani and Ronald D. Haynes and Steven J. Ruuth}, title = {A Convergence Analysis of the Parallel Schwarz Solution of the Continuous Closest Point Method}, year = {2021} }  Yilmaz B (2021), "Graph Transformation and Specialized Code Generation For Sparse Triangular Solve (SpTRSV)", March, 2021. [Abstract] [BibTeX] Abstract: Sparse Triangular Solve (SpTRSV) is an important computational kernel used in the solution of sparse linear algebra systems in many scientific and engineering applications. It is diffcult to parallelize SpTRSV in today's architectures. The limited parallelism due to the dependencies between calculations and the irregular nature of the computations require an effective load balancing and synchronization mechanism approach. In this work, we present a novel graph transformation method where the equation representing a row is rewritten to break the dependencies. Using this approach, we propose a dependency graph transformation and code generation framework that increases the parallelism of the parts of a sparse matrix where it is scarce, reducing the need for synchronization points. In addition, the proposed framework generates specialized code for the transformed dependency graph on CPUs using domain-specific optimizations. BibTeX: @article{Yilmaz2021, author = {Buse Yilmaz}, title = {Graph Transformation and Specialized Code Generation For Sparse Triangular Solve (SpTRSV)}, year = {2021} }  Yin M, Xu X, Zhang T and Ye C (2021), "Performance Evaluation Model for Matrix Calculation on GPU", 10, 2021. World Scientific Pub Co Pte Ltd. [Abstract] [BibTeX] [DOI] Abstract: Establishment of a performance evaluation model is a hotspot of current research. In this paper, the performance bottleneck is analyzed quantitatively, which provided programmers with a guidance to optimize the performance bottleneck. This paper takes a matrix as an example; the matrix is divided into a dense matrix or a sparse matrix. For dense matrix, the performance is first analyzed in a quantitative way, and an evaluation model is developed, which includes the instruction pipeline, shared memory, and global memory. For sparse matrix, this paper aims at the four formats of CSR, ELL, COO, and HYB, through the observation data obtained from the actual operation of large datasets, finds the relationship between the running time, dataset form, and storage model, and establishes their relational model functions. Through practical test and comparison, the error between the execution time of the test dataset that is predicted by the model function and the actual running time is found to be within a stable finite deviation threshold, proving that the model has certain practicability. BibTeX: @article{Yin2021, author = {Mengjia Yin and Xianbin Xu and Tao Zhang and Conghuan Ye}, title = {Performance Evaluation Model for Matrix Calculation on GPU}, publisher = {World Scientific Pub Co Pte Ltd}, year = {2021}, doi = {10.1142/s0218001421540306} }  Yi-Wen W, Chen W-M and Tsai H-H (2021), "Accelerating the Bron-Kerbosch algorithm for maximal clique enumeration using GPUs", IEEE Transactions on Parallel and Distributed Systems. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: Maximal clique enumeration (MCE) is a classic problem in graph theory to identify all complete subgraphs in a graph. In prior MCE work, the Bron-Kerbosch algorithm is one of the most popular solutions, and there are several improved algorithms proposed on CPU platforms. However, while few studies have focused on the related issue of parallel implementation, recently, there have been numerous explorations of the acceleration of general purpose applications using a graphics processing unit (GPU) to reduce the computing power consumption. In this paper, we develop a GPU-based Bron-Kerbosch algorithm that efficiently solves the MCE problem in parallel by optimizing the process of subproblem decomposition and computing resource usage. To speed up the computations, we use coalesced memory accesses and warp reductions to increase bandwidth and reduce memory latency. Our experimental results show that the proposed algorithm can fully exploit the resources of GPU architectures, allowing for the vast acceleration of operations to solve the MCE problem. BibTeX: @article{YiWen2021, author = {Wei Yi-Wen and Wei-Mei Chen and Hsin-Hung Tsai}, title = {Accelerating the Bron-Kerbosch algorithm for maximal clique enumeration using GPUs}, journal = {IEEE Transactions on Parallel and Distributed Systems}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, doi = {10.1109/tpds.2021.3067053} }  Yoshio N and Biegler LT (2021), "A Nested Schur Decomposition Approach for Multiperiod Optimization of Chemical Processes", Computers & Chemical Engineering., August, 2021. , pp. 107509. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: This work develops an algorithm for solving nonlinear multiperiod optimization (MPO) problems using a nested Schur decomposition (NSD) approach. The NSD approach decomposes MPO using a Schur complement and allows us to solve the decomposed nonlinear programming (NLP) problem in parallel. The NSD partitions the MPO into a two-level problem with individual NLPs at the lower level. In MPO problems for chemical processes, the upper problem generally has inventory, demand, and design constraints set over the entire period. The lower problem consists of a single process model for each period. The problem-level decomposition facilitates the flexible selection of the lower-level solver. For example, an efficient barrier solver such as IPOPT can be used when the problem is well-conditioned. Conversely, a robust active-set solver such as CONOPT can be selected when degeneracy exists in the problem. In this paper, the NSD approach is demonstrated with different process models for MPO under uncertain demand in both serial and parallel implementation. The solutions are also compared with the direct approach, which solves the entire MPO problem simultaneously. The demonstration shows the capability of the flexible inner solver selection with IPOPT and CONOPT. The result shows that NSD converges to the same optimum as the direct approach, regardless of the choice of the inner solver. Furthermore, IPOPT could be more efficient than CONOPT when the problem is well-conditioned. Moreover, it is noted that the NSD outperforms the direct approach when the size of the process model is large with CONOPT as the inner solver. From those results, we observe that NSD is well-suited to solve large MPO problems for chemical processes in an efficient, flexible, and robust manner. BibTeX: @article{Yoshio2021, author = {Noriyuki Yoshio and Lorenz T. Biegler}, title = {A Nested Schur Decomposition Approach for Multiperiod Optimization of Chemical Processes}, journal = {Computers & Chemical Engineering}, publisher = {Elsevier BV}, year = {2021}, pages = {107509}, doi = {10.1016/j.compchemeng.2021.107509} }  Yurtsever A, Mangalick V and Sra S (2021), "Three Operator Splitting with a Nonconvex Loss Function", March, 2021. [Abstract] [BibTeX] Abstract: We consider the problem of minimizing the sum of three functions, one of which is nonconvex but differentiable, and the other two are convex but possibly nondifferentiable. We investigate the Three Operator Splitting method (TOS) of Davis & Yin (2017) with an aim to extend its theoretical guarantees for this nonconvex problem template. In particular, we prove convergence of TOS with nonasymptotic bounds on its nonstationarity and infeasibility errors. In contrast with the existing work on nonconvex TOS, our guarantees do not require additional smoothness assumptions on the terms comprising the objective; hence they cover instances of particular interest where the nondifferentiable terms are indicator functions. We also extend our results to a stochastic setting where we have access only to an unbiased estimator of the gradient. Finally, we illustrate the effectiveness of the proposed method through numerical experiments on quadratic assignment problems. BibTeX: @article{Yurtsever2021, author = {Alp Yurtsever and Varun Mangalick and Suvrit Sra}, title = {Three Operator Splitting with a Nonconvex Loss Function}, year = {2021} }  Zaitsev DA, Shmeleva TR and Luszczek P (2021), "Aggregation of clans to speed-up solving linear systems on parallel architectures", International Journal of Parallel, Emergent and Distributed Systems., November, 2021. , pp. 1-22. Informa UK Limited. [Abstract] [BibTeX] [DOI] Abstract: The paper further refines the clan composition technique that is considered a way of matrix partitioning into a union of block-diagonal and block-column matrices. This enables solving the individual systems for each horizontal block on a separate computing node, followed by solving the composition system. The size of minimal clans, obtained as a result of matrix decomposition, varies considerably. For load balancing, early versions of ParAd software were using dynamic scheduling of jobs. The present paper studies a task of static balancing the clan size. Rather good results are obtained using a fast bin packing algorithm with the first fit on a sorted array which are considerably improved applying a multi-objective graph partitioning with software package METIS. Aggregation of clans allows us to obtain up to three times extra speed-up, including systems over fields of real numbers, on matrices from Model Checking Contest and Matrix Market. BibTeX: @article{Zaitsev2021, author = {Dmitry A. Zaitsev and Tatiana R. Shmeleva and Piotr Luszczek}, title = {Aggregation of clans to speed-up solving linear systems on parallel architectures}, journal = {International Journal of Parallel, Emergent and Distributed Systems}, publisher = {Informa UK Limited}, year = {2021}, pages = {1--22}, doi = {10.1080/17445760.2021.2004412} }  Zhai Y, Giem E, Fan Q, Zhao K, Liu J and Chen Z (2021), "FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance", April, 2021. [Abstract] [BibTeX] [DOI] Abstract: Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate fault tolerance into our brand-new BLAS implementation: duplicating computing instructions for memory-bound Level-1 and Level-2 BLAS routines and incorporating an Algorithm-Based Fault Tolerance mechanism for computing-bound Level-3 BLAS routines. Our high performance and low overhead are obtained from delicate assembly-level optimization and a kernel-fusion approach to the computing kernels. Experimental results demonstrate that FT-BLAS offers high reliability and high performance -- faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14% and 21.70%, respectively, for routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute. BibTeX: @article{Zhai2021, author = {Yujia Zhai and Elisabeth Giem and Quan Fan and Kai Zhao and Jinyang Liu and Zizhong Chen}, title = {FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance}, year = {2021}, doi = {10.1145/3447818.3460364} }  Zhang F, Su J, Liu W, He B, Wu R, Du X and Wang R (2021), "YuenyeungSpTRSV: A Thread-Level and Warp-Level Fusion Synchronization-Free Sparse Triangular Solve", IEEE Transactions on Parallel and Distributed Systems. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: Sparse triangular solves are widely used in linear algebra domains. Synchronization-free SpTRSVs, due to their short preprocessing time and high performance, are the most popular SpTRSV algorithms. However, we observe that the performance of those SpTRSVs on different matrices can vary greatly. Our further studies show that when the average number of components per level is high and the average number of nonzero elements per row is low, those SpTRSVs exhibit extremely low performance. The reason is that, they use a warp on the GPU to process a row in sparse matrices, and such warp-level designs have severe underutilization of the GPU. To solve this problem, we propose YuenyeungSpTRSV, a thread-level and wrap-level fusion synchronization-free SpTRSV, which handles the rows with a large number of nonzero elements at warp-level while the rows with a low number of nonzero elements at thread-level. It can achieve good performance on the most popular sparse matrix storage, compressed sparse row format, and thus users do not need to conduct format conversion. We evaluate YuenyeungSpTRSV with 245 matrices from the Florida Sparse Matrix Collection on four platforms, and experiments show that our SpTRSV exhibits 7.14 GFLOPS/s, which is 5.98x speedup over the state-of-the-art synchronization-free SpTRSV. BibTeX: @article{Zhang2021, author = {Feng Zhang and Jiya Su and Weifeng Liu and Bingsheng He and Ruofan Wu and Xiaoyong Du and Rujia Wang}, title = {YuenyeungSpTRSV: A Thread-Level and Warp-Level Fusion Synchronization-Free Sparse Triangular Solve}, journal = {IEEE Transactions on Parallel and Distributed Systems}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, doi = {10.1109/tpds.2021.3066635} }  Zhang C, Song Y, Cai X and Han D (2021), "An extended proximal ADMM algorithm for three-block nonconvex optimization problems", Journal of Computational and Applied Mathematics., December, 2021. Vol. 398, pp. 113681. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: We propose a new proximal alternating direction method of multipliers (ADMM) for solving a class of three-block nonconvex optimization problems with linear constraints. The proposed method updates the third primal variable twice per iteration and introduces semidefinite proximal terms to the subproblems with the first two blocks. The method can be regarded as an extension of the method proposed in Sun et al. (2015) which is specialized to the convex case with the third block of the objective function being quadratic. Based on the powerful Kurdyka--Łojasiewicz property, we prove that each bounded sequence generated by the proposed method converges to a critical point of the considered problem. Some numerical results are reported to indicate the effectiveness and superiority of the proposed method. BibTeX: @article{Zhang2021a, author = {Chun Zhang and Yongzhong Song and Xingju Cai and Deren Han}, title = {An extended proximal ADMM algorithm for three-block nonconvex optimization problems}, journal = {Journal of Computational and Applied Mathematics}, publisher = {Elsevier BV}, year = {2021}, volume = {398}, pages = {113681}, doi = {10.1016/j.cam.2021.113681} }  Zhang Y and Li H (2021), "A count sketch maximal weighted residual Kaczmarz method for solving highly overdetermined linear systems", Applied Mathematics and Computation., December, 2021. Vol. 410, pp. 126486. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: In this paper, combining count sketch and maximal weighted residual Kaczmarz method, we propose a fast randomized algorithm for highly overdetermined linear systems. Convergence analysis of the new algorithm is provided. Numerical experiments show that, for the same accuracy, our method behaves better in computing time compared with the maximal weighted residual Kaczmarz algorithm. BibTeX: @article{Zhang2021b, author = {Yanjun Zhang and Hanyu Li}, title = {A count sketch maximal weighted residual Kaczmarz method for solving highly overdetermined linear systems}, journal = {Applied Mathematics and Computation}, publisher = {Elsevier BV}, year = {2021}, volume = {410}, pages = {126486}, doi = {10.1016/j.amc.2021.126486} }  Zhang X, Li B and Jiang J (2021), "Efficient Convolutional Dictionary Learning Using Preconditioned ADMM", International Journal of Pattern Recognition and Artificial Intelligence., July, 2021. Vol. 35(09), pp. 2151009. World Scientific Pub Co Pte Lt. [Abstract] [BibTeX] [DOI] Abstract: Given training data, convolutional dictionary learning (CDL) seeks a translation-invariant sparse representation, which is characterized by a set of convolutional kernels. However, even a small training set with moderate sample size can render the optimization process both computationally challenging and memory starving. Under a biconvex optimization strategy for CDL, we propose to diagonally precondition the system matrices in the filter learning sub-problem that can be solved by the alternating direction method of multipliers (ADMM). This method leads to the substitution of matrix inversion (O(n3)) and matrix multiplication (O(n3)) involved in ADMM with an element-wise operation (O(n)), which significantly reduces the computational complexity as well as the memory requirement. Numerical experiments validate the performance advantage of the proposed method over the state-of-the-arts. Code is available at https://github.com/baopingli/Efficient-Convolutional-Dictionary-Learning-using-PADMM. BibTeX: @article{Zhang2021c, author = {Xuesong Zhang and Baoping Li and Jing Jiang}, title = {Efficient Convolutional Dictionary Learning Using Preconditioned ADMM}, journal = {International Journal of Pattern Recognition and Artificial Intelligence}, publisher = {World Scientific Pub Co Pte Lt}, year = {2021}, volume = {35}, number = {09}, pages = {2151009}, doi = {10.1142/s0218001421510095} }  Zhang W, Feng X, Xiao F and Wang X (2021), "A class of bilinear matrix constraint optimization problem and its applications", Knowledge-Based Systems., August, 2021. , pp. 107429. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: A broad class of minimization problems involving the sum of nonconvex and nonsmooth functions with a bilinear matrix equality constraint is introduced. The constraint condition can be regarded as a generalization of the multiplicative decomposition and additive decomposition of the original data. Augmented Lagrangian multiplier method and proximal alternating linearized minimization algorithm are applied for effectively solving the problem. Convergence guarantee is given under some mild assumptions. Taking two applications for instance to show that many practical problems can be converted to the general model with simple reformation, and effectively solved by the algorithm. The numerical experimental result shows the proposed method has better convergence property, better recovery result and less time-consuming than the compared methods. BibTeX: @article{Zhang2021d, author = {Wenjuan Zhang and Xiangchu Feng and Feng Xiao and Xudong Wang}, title = {A class of bilinear matrix constraint optimization problem and its applications}, journal = {Knowledge-Based Systems}, publisher = {Elsevier BV}, year = {2021}, pages = {107429}, doi = {10.1016/j.knosys.2021.107429} }  Zhang S and Bailey CP (2021), "Accelerated Primal-Dual Algorithm for Distributed Non-convex Optimization", August, 2021. [Abstract] [BibTeX] Abstract: This paper investigates accelerating the convergence of distributed optimization algorithms on non-convex problems. We propose a distributed primal-dual stochastic gradient descent (SGD) equipped with "powerball" method to accelerate. We show that the proposed algorithm achieves the linear speedup convergence rate 𝒪(1/nT) for general smooth (possibly non-convex) cost functions. We demonstrate the efficiency of the algorithm through numerical experiments by training two-layer fully connected neural networks and convolutional neural networks on the MNIST dataset to compare with state-of-the-art distributed SGD algorithms and centralized SGD algorithms. BibTeX: @article{Zhang2021e, author = {Shengjun Zhang and Colleen P. Bailey}, title = {Accelerated Primal-Dual Algorithm for Distributed Non-convex Optimization}, year = {2021} }  Zhang Y, Yang W, Li K, Tang D and Li K (2021), "Performance analysis and optimization for SpMV based on aligned storage formats on an ARM processor", Journal of Parallel and Distributed Computing., December, 2021. Vol. 158, pp. 126-137. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: Sparse matrix-vector multiplication (SpMV) has always been a hot topic of research for scientific computing and big data processing, but the sparsity and discontinuity of the nonzero elements in a sparse matrix lead to the memory bottleneck of SpMV. In this paper, we propose aligned CSR (ACSR) and aligned ELL (AELL) formats and a parallel SpMV algorithm to utilize NEON SIMD registers on ARM processors. We analyze the impact of SIMD instruction latency, cache access, and cache misses on SpMV with different formats. In the experiments, our SpMV algorithm based on ACSR achieves 1.18x and 1.56x speedup over SpMV based on CSR and SpMV in PETSc, respectively, and AELL achieves 1.21x speedup over ELL. The deviations between the theoretical results and experimental results in the instruction latency and cache access are 10.26% and 10.51% in ACSR and 5.68% and 2.91% in AELL, respectively. BibTeX: @article{Zhang2021f, author = {Yufeng Zhang and Wangdong Yang and Kenli Li and Dahai Tang and Keqin Li}, title = {Performance analysis and optimization for SpMV based on aligned storage formats on an ARM processor}, journal = {Journal of Parallel and Distributed Computing}, publisher = {Elsevier BV}, year = {2021}, volume = {158}, pages = {126--137}, doi = {10.1016/j.jpdc.2021.08.002} }  Zhao K, Hu T, Li Y, Dongarra J and Moler C (2021), "Package 'Rbeast'" [Abstract] [BibTeX] Abstract: A Bayesian model averaging algorithm called BEAST to decompose time series or 1D sequential data into individual components, such as abrupt changes, trends, and periodic/seasonal variations. BEAST is useful for changepoint detection (e.g., breakpoints or structural breaks), nonlinear trend analysis, time series decomposition, and time series segmentation. BibTeX: @techreport{Zhao2021, author = {Kaiguang Zhao and Tongxi Hu and Yang Li and Jack Dongarra and Cleve Moler}, title = {Package 'Rbeast'}, year = {2021} }  Zheng R and Pai S (2021), "Efficient Execution of Graph Algorithms on CPU with SIMD Extensions", In Proceedings of the International Symposium on Code Generation and Optimization. [Abstract] [BibTeX] Abstract: Existing state-of-the-art CPU graph frameworks take advantage of multiple cores, but not the SIMD capability within each core. In this work, we retarget an existing GPU graph algorithm compiler to obtain the first graph framework that uses SIMD extensions on CPUs to efficiently execute graph algorithms. We evaluate this compiler on 10 benchmarks and 3 graphs on 3 different CPUs and also compare to the GPU. Evaluation results show that on a 8-core machine, enabling SIMD on a naive multi-core implementation achieves an additional 7.48x speedup, averaged across 10 benchmarks and 3 inputs. Applying our SIMD-targeted optimizations improves the plain SIMD implementation by 1.67x, outperforming a serial implementation by 12.46x. On average, the optimized multi-core SIMD version also outperforms the state-of-the-art graph framework, GraphIt, by 1.53x, averaged across 5 (common) benchmarks. SIMD execution on CPUs closes the gap between the CPU and GPU to 1.76x, but the CPU virtual memory performs better when graphs are much bigger than available physical memory. BibTeX: @inproceedings{Zheng2021, author = {Ruohuang Zheng and Sreepathi Pai}, title = {Efficient Execution of Graph Algorithms on CPU with SIMD Extensions}, booktitle = {Proceedings of the International Symposium on Code Generation and Optimization}, year = {2021} }  Zheng Z, Chen J and Chen Y-F (2021), "A fully structured preconditioner for a class of complex symmetric indefinite linear systems", BIT Numerical Mathematics., August, 2021. Springer Science and Business Media LLC. [Abstract] [BibTeX] [DOI] Abstract: Based on the equivalent transformation of the complex coefficient matrix, a fully structured preconditioner which is economic to implement within GMRES acceleration is presented for solving a class of complex symmetric indefinite linear systems. We analyze the computational complexity of the proposed preconditioner and show that all eigenvalues of the corresponding preconditioned matrix are clustered at a top half annulus. Compared with some other existing preconditioners, the validity of theoretical analysis and the effectiveness of the proposed preconditioner are verified by numerical experiments. BibTeX: @article{Zheng2021a, author = {Zhong Zheng and Jing Chen and Yue-Fen Chen}, title = {A fully structured preconditioner for a class of complex symmetric indefinite linear systems}, journal = {BIT Numerical Mathematics}, publisher = {Springer Science and Business Media LLC}, year = {2021}, doi = {10.1007/s10543-021-00887-8} }  Zhong D, Cao Q, Bosilca G and Dongarra J (2021), "Using long vector extensions for MPI reductions", Parallel Computing., December, 2021. , pp. 102871. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: The modern CPU’s design, including the deep memory hierarchies and SIMD/vectorization capability have a more significant impact on algorithms’ efficiency than the modest frequency increase observed recently. The current introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become a critical software component to increase efficiency and close the gap to peak performance.// In this paper, we investigate the impact of the vectorization of MPI reduction operations. We propose an implementation of predefined MPI reduction operations using vector intrinsics (AVX and SVE) to improve the time-to-solution of the predefined MPI reduction operations. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architectures. Experiments conducted on varied architectures (Intel Xeon Gold, AMD Zen 2, and Arm A64FX), show that the proposed vector extension optimized reduction operations significantly reduce completion time for collective communication reductions. With these optimizations, we achieve higher memory bandwidth and an increased efficiency for local computations, which directly benefit the overall cost of collective reductions and applications based on them. BibTeX: @article{Zhong2021, author = {Dong Zhong and Qinglei Cao and George Bosilca and Jack Dongarra}, title = {Using long vector extensions for MPI reductions}, journal = {Parallel Computing}, publisher = {Elsevier BV}, year = {2021}, pages = {102871}, doi = {10.1016/j.parco.2021.102871} }  Zhou T, Gao L and Guan X (2021), "A Fault-Tolerant Distributed Framework for Asynchronous Iterative Computations", IEEE Transactions on Parallel and Distributed Systems., August, 2021. Vol. 32(8), pp. 2062-2073. Institute of Electrical and Electronics Engineers (IEEE). [Abstract] [BibTeX] [DOI] Abstract: Asynchronous iterative computations (AIC) are common in machine learning and data mining systems. However, the lack of synchronization barriers in asynchronous processing brings challenges for continuous processing while workers might fail. There is no global synchronization point that all workers can roll back to. In this article, we propose a fault-tolerant framework for asynchronous iterative computations (FAIC). Our framework takes a virtual snapshot of the AIC system without halting the computation of any worker. We prove that the virtual snapshot capture by FAIC can recover the AIC system correctly. We evaluate our FAIC framework on two existing AIC systems, Maiter and NOMAD. Our experiment result shows that the checkpoint overhead of FAIC is more than 50 percent shorter than the synchronous checkpoint method. FAIC is around 10 percent faster than other asynchronous snapshot algorithms, such as the Chandy-Lamport algorithm. Our experiments on a large cluster demonstrate that FAIC scales with the number of workers. BibTeX: @article{Zhou2021, author = {Tian Zhou and Lixin Gao and Xiaohong Guan}, title = {A Fault-Tolerant Distributed Framework for Asynchronous Iterative Computations}, journal = {IEEE Transactions on Parallel and Distributed Systems}, publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, year = {2021}, volume = {32}, number = {8}, pages = {2062--2073}, doi = {10.1109/tpds.2021.3059420} }  Zhou Y, Lin W, Hao J-K, Xiao M and Jin Y (2021), "An effective branch-and-bound algorithm for the maximum s-bundle problem", European Journal of Operational Research., May, 2021. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: An s-bundle (where s is a positive integer) is a connected graph, the vertex connectivity of which is at least n-s, where n is the number of vertices in the graph. As a relaxation of the classical clique model, the s-bundle is relevant for representing cohesive groups with an emphasis on the connectivity of members; thus, it is of great practical importance. In this work, we investigate the fundamental problem of finding the maximum s-bundle from a given graph and present an effective branch-and-bound algorithm for solving this NP-hard problem. The proposed algorithm is distinguished owing to its new multi-branching rules, graph coloring-based bounding technique, and reduction rules using structural information. The experiments indicate that the algorithm outperforms the best-known approaches on a wide range of well-known benchmark graphs for different s values. In particular, compared with the popular Russian Doll Search algorithm, the proposed algorithm almost doubles the success rate of solving large social networks in an hour when s=5. BibTeX: @article{Zhou2021a, author = {Yi Zhou and Weibo Lin and Jin-Kao Hao and Mingyu Xiao and Yan Jin}, title = {An effective branch-and-bound algorithm for the maximum s-bundle problem}, journal = {European Journal of Operational Research}, publisher = {Elsevier BV}, year = {2021}, doi = {10.1016/j.ejor.2021.05.001} }  Zhou K, Adhianto L, Anderson J, Cherian A, Grubisic D, Krentel M, Liu Y, Meng X and Mellor-Crummey J (2021), "Measurement and analysis of GPU-accelerated applications with HPCToolkit", Parallel Computing., September, 2021. , pp. 102837. Elsevier BV. [Abstract] [BibTeX] [DOI] Abstract: To address the challenge of performance analysis on the US DOE’s forthcoming exascale supercomputers, Rice University has been extending its HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications. To help developers understand the performance of accelerated applications as a whole, HPCToolkit’s measurement and analysis tools attribute metrics to calling contexts that span both CPUs and GPUs. To measure GPU-accelerated applications efficiently, HPCToolkit employs a novel wait-free data structure to coordinate monitoring and attribution of GPU performance. To help developers understand the performance of complex GPU code generated from high-level programming models, HPCToolkit constructs sophisticated approximations of call path profiles for GPU computations. To support fine-grained analysis and tuning, HPCToolkit uses PC sampling and instrumentation to measure and attribute GPU performance metrics to source lines, loops, and inlined code. To supplement fine-grained measurements, HPCToolkit can measure GPU kernel executions using hardware performance counters. To provide a view of how an execution evolves over time, HPCToolkit can collect, analyze, and visualize call path traces within and across nodes. Finally, on NVIDIA GPUs, HPCToolkit can derive and attribute a collection of useful performance metrics based on measurements using GPU PC samples. We illustrate HPCToolkit’s new capabilities for analyzing GPU-accelerated applications with several codes developed as part of the Exascale Computing Project. BibTeX: @article{Zhou2021b, author = {Keren Zhou and Laksono Adhianto and Jonathon Anderson and Aaron Cherian and Dejan Grubisic and Mark Krentel and Yumeng Liu and Xiaozhu Meng and John Mellor-Crummey}, title = {Measurement and analysis of GPU-accelerated applications with HPCToolkit}, journal = {Parallel Computing}, publisher = {Elsevier BV}, year = {2021}, pages = {102837}, doi = {10.1016/j.parco.2021.102837} }  Zhu Y, Swami A and Segarra S (2021), "Free Energy Node Embedding via Generalized Skip-gram with Negative Sampling", May, 2021. [Abstract] [BibTeX] Abstract: A widely established set of unsupervised node embedding methods can be interpreted as consisting of two distinctive steps: i) the definition of a similarity matrix based on the graph of interest followed by ii) an explicit or implicit factorization of such matrix. Inspired by this viewpoint, we propose improvements in both steps of the framework. On the one hand, we propose to encode node similarities based on the free energy distance, which interpolates between the shortest path and the commute time distances, thus, providing an additional degree of flexibility. On the other hand, we propose a matrix factorization method based on a loss function that generalizes that of the skip-gram model with negative sampling to arbitrary similarity matrices. Compared with factorizations based on the widely used _2 loss, the proposed method can better preserve node pairs associated with higher similarity scores. Moreover, it can be easily implemented using advanced automatic differentiation toolkits and computed efficiently by leveraging GPU resources. Node clustering, node classification, and link prediction experiments on real-world datasets demonstrate the effectiveness of incorporating free-energy-based similarities as well as the proposed matrix factorization compared with state-of-the-art alternatives. BibTeX: @article{Zhu2021, author = {Yu Zhu and Ananthram Swami and Santiago Segarra}, title = {Free Energy Node Embedding via Generalized Skip-gram with Negative Sampling}, year = {2021} }  Zhu L, Hua Q-S and Jin H (2021), "Communication Avoiding All-Pairs Shortest Paths Algorithm for Sparse Graphs", In 50th International Conference on Parallel Processing., August, 2021. ACM. [Abstract] [BibTeX] [DOI] Abstract: In this paper, we propose a parallel algorithm for computing all-pairs shortest paths (APSP) for sparse graphs on the distributed memory system with p processors. To exploit the graph sparsity, we first preprocess the graph by utilizing several known algorithmic techniques in linear algebra such as fill-in reducing ordering and elimination tree parallelism. Then we map the preprocessed graph on the distributed memory system for both load balancing and communication reduction. Finally, we design a new scheduling strategy to minimize the communication cost. The bandwidth cost (communication volume) and the latency cost (number of messages) of our algorithm are and O(log 2p), respectively, where S is a minimal vertex separator that partitions the graph into two components of roughly equal size. Compared with the state-of-the-art result for dense graphs where the bandwidth and latency costs are and ), respectively, our algorithm reduces the latency cost by a factor of , and reduces the bandwidth cost by a factor of for sparse graphs with . We also present the bandwidth and latency costs lower bounds for computing APSP on sparse graphs, which are and (log 2p), respectively. This implies that the bandwidth cost of our algorithm is nearly optimal and the latency cost is optimal. BibTeX: @inproceedings{Zhu2021a, author = {Lin Zhu and Qiang-Sheng Hua and Hai Jin}, title = {Communication Avoiding All-Pairs Shortest Paths Algorithm for Sparse Graphs}, booktitle = {50th International Conference on Parallel Processing}, publisher = {ACM}, year = {2021}, doi = {10.1145/3472456.3472524} }  Ziogas AN, Ben-Nun T, Schneider T and Hoefler T (2021), "NPBench: A Benchmarking Suite for High-Performance NumPy", In Proceedings of the 2021 International Conference on Supercomputing. [Abstract] [BibTeX] Abstract: Python, already one of the most popular languages for scientific computing, has made significant inroads in High Performance Computing (HPC). At the center of Python’s ecosystem is NumPy, an efficient implementation of the multi-dimensional array (tensor) structure, together with basic arithmetic and linear algebra. Compared to traditional HPC languages, the relatively low performance of Python and NumPy has spawned significant research in compilers and frameworks that decouple Python’s compact representation from the underlying implement